This note easily records myself for the compiler, if any wrong or questions can comment.
A compiler's work what it does is just like transpiling the source code, which may be written in C++
or Java
, to the target machine code, which may looks like this MOV ax, 2
or c7 00 0000 0002
. At first glance, what the compiler do is not very complicated. However, under the hood, the compiler need to do much work:
Scanner
The Sanner is the first part of compiler, what it does just reading source code (consists of many characters) and marking every character to a more meaningful unit. In this step, the unit produced by the scanner we call tokens, and this step we call it lexical analysis or tokenize. For example, the following code (whatever language it is, java
, c++
or etc.):
a[index] = 4 + 2;
After the lexical analysis, all the characters will be marked a meaningful word which look like this:
{
identifier: 'a',
symbol: {
name: 'left bracket',
value: '['
},
identifier: 'index',
symbol: {
name: 'right bracket',
value: ']'
},
symbol: {
name: 'assignment',
value: '='
},
number: 4,
symbol: {
name: 'plus',
value: '+'
},
number: 2
}
Note: The above structure is just an example, it may be not produced by a real world scanner.
Parser
Parser will receive the tokens, which produced by the scanner in the lexical anaylsis phase. And in this phase, we will use the tokens to combine to the expression, for example, 4 + 2
will be represented as add-expression
. The result of the parser work is producing the syntax tree or parse tree. Sometimes, if we know the expression such as add-express
is one value plus another, then the syntax tree will omit some node, and thus we call abstract syntax tree.
Semantic Analysis
When finishing parsing. Now the compiler will do semantic analysis. Semantic analysis differs with syntax, the semantic will determine the behavior in the runtime environment such as the order execution. In this step, it also has two styles: static semantic and dynamic semantic.
Static semantic analyzer will typically do typing checking (including computing data type) and declaration. However dynamic semantic is only determined it by executing it.
Optimizer
The optimzation is often behind the semantic analysis. This step is just do some optimzations about the source code, but sometimes it work perform well or not completely depending on the source code. The common optimzation method is called constant folding, in the above example, the expression a[index] = 4 + 2;
will finally replace by the a[index] = 6;
when finishing the optimizing.
On the other hand, there are also many methods to optimize the semantic tree directly. A standard choice is three-address code, and we can see what it really does in behind. The first step, the optimizer will declare a temperary variable to store the expression result:
int t = 4 + 2;
a[index] = t;
and then optimizer will precompute the first expression:
int t = 6;
a[index] = t;
finally, replacing the t
by its value:
a[index] = 6;
Another popular optimization method is P-Code, which has been largly used in Pascal.
Code Generation
Code generation is the final step. It takes the Intermediate code produced by the optimizer to generate the target machine code. Different machines may have different machine code syntax, such as x86 or X64 architecture, so the code generation should can produce different machine code and guarantee their behavior is consistent.
In addition to code generation for different machine, the code generation step should also optimize the target code according to the machine. Thus, it can acquire a better runtime performance.
Error Handle
Error handling is a most complicated component in compiler. It works throung all the compiling period. In different phase, the error handler should provide a precise message including the reason, lines or etc. This is static error handling. In dynamic language or running time, compiler should also provide a useful error message.
We have introduced different components about a common compiler, this is logical structure. And this section recap some physical compiler structure in real world.
In this perspective, the compiler divides into two part: analysis and synthesis. The scanner, parser and semtantic analyzer we mentioned before is anaylzing the program. Of course, they belong to the analysis part. Only the code generation is the synthesis.
Different analysis and synthensis is, anaylysis is more mathmematical and better understood, corresponding the synthesis is more concrete and requires advanced techniques.
This is a classical view in compiler construction, most of compilers are organized by this construction. In front end, its responsiblity includes lexical analysis, syntax analysis and semantic analysis, and back end is only code generation, which is very similar to the view of analysis and synthesis.
In this view, a more essential power is portability that its part can be easily replaced by anohter one which has same function. However, in practice this way of replacing may be very diffcult. Because the compiler front end and back end rely on the source language and target language. Especially in back end, this will make some optimization methods don't work when change to a new back end component.
The compiler also has some convinient methods to transpile the source code to the target code. One of this method is passes, which deals with the code iteractivly. One pass transform the language to the token, and the second alters its structure or adds some information. The eariler language (scheme)[https://en.wikipedia.org/wiki/Scheme_(programming_language)] compiler adopte this methods. It is very easily implemented, but when more feature contained, it will become more complicated and hard to understand and maintain.
After introducing the compiler construction, now we can see some common data structure to implement the compiler essential structures to make the compiler more peformance