Skip to content

Instantly share code, notes, and snippets.

@eryue0220
Last active June 10, 2018 08:43
Show Gist options
  • Save eryue0220/38f90f436e39c515e2bcd5811d2020d8 to your computer and use it in GitHub Desktop.
Save eryue0220/38f90f436e39c515e2bcd5811d2020d8 to your computer and use it in GitHub Desktop.
A Brief Note About Compiler Instructions

A Brief Note About Compiler Construction - Part 1

This note easily records myself for the compiler, if any wrong or questions can comment.

Components in Compilers

A compiler's work what it does is just like transpiling the source code, which may be written in C++ or Java, to the target machine code, which may looks like this MOV ax, 2 or c7 00 0000 0002. At first glance, what the compiler do is not very complicated. However, under the hood, the compiler need to do much work:

Scanner

The Sanner is the first part of compiler, what it does just reading source code (consists of many characters) and marking every character to a more meaningful unit. In this step, the unit produced by the scanner we call tokens, and this step we call it lexical analysis or tokenize. For example, the following code (whatever language it is, java, c++ or etc.):

a[index] = 4 + 2;

After the lexical analysis, all the characters will be marked a meaningful word which look like this:

{
  identifier: 'a',
  symbol: {
    name: 'left bracket',
    value: '['
  },
  identifier: 'index',
  symbol: {
    name: 'right bracket',
    value: ']'
  },
  symbol: {
    name: 'assignment',
    value: '='
  },
  number: 4,
  symbol: {
    name: 'plus',
    value: '+'
  },
  number: 2
}

Note: The above structure is just an example, it may be not produced by a real world scanner.

Parser

Parser will receive the tokens, which produced by the scanner in the lexical anaylsis phase. And in this phase, we will use the tokens to combine to the expression, for example, 4 + 2 will be represented as add-expression. The result of the parser work is producing the syntax tree or parse tree. Sometimes, if we know the expression such as add-express is one value plus another, then the syntax tree will omit some node, and thus we call abstract syntax tree.

Semantic Analysis

When finishing parsing. Now the compiler will do semantic analysis. Semantic analysis differs with syntax, the semantic will determine the behavior in the runtime environment such as the order execution. In this step, it also has two styles: static semantic and dynamic semantic.

Static semantic analyzer will typically do typing checking (including computing data type) and declaration. However dynamic semantic is only determined it by executing it.

Optimizer

The optimzation is often behind the semantic analysis. This step is just do some optimzations about the source code, but sometimes it work perform well or not completely depending on the source code. The common optimzation method is called constant folding, in the above example, the expression a[index] = 4 + 2; will finally replace by the a[index] = 6; when finishing the optimizing.

On the other hand, there are also many methods to optimize the semantic tree directly. A standard choice is three-address code, and we can see what it really does in behind. The first step, the optimizer will declare a temperary variable to store the expression result:

int t = 4 + 2;
a[index] = t;

and then optimizer will precompute the first expression:

int t = 6;
a[index] = t;

finally, replacing the t by its value:

a[index] = 6;

Another popular optimization method is P-Code, which has been largly used in Pascal.

Code Generation

Code generation is the final step. It takes the Intermediate code produced by the optimizer to generate the target machine code. Different machines may have different machine code syntax, such as x86 or X64 architecture, so the code generation should can produce different machine code and guarantee their behavior is consistent.

In addition to code generation for different machine, the code generation step should also optimize the target code according to the machine. Thus, it can acquire a better runtime performance.

Error Handle

Error handling is a most complicated component in compiler. It works throung all the compiling period. In different phase, the error handler should provide a precise message including the reason, lines or etc. This is static error handling. In dynamic language or running time, compiler should also provide a useful error message.

Real World Compiler

We have introduced different components about a common compiler, this is logical structure. And this section recap some physical compiler structure in real world.

Analysis and Synthesis

In this perspective, the compiler divides into two part: analysis and synthesis. The scanner, parser and semtantic analyzer we mentioned before is anaylzing the program. Of course, they belong to the analysis part. Only the code generation is the synthesis.

Different analysis and synthensis is, anaylysis is more mathmematical and better understood, corresponding the synthesis is more concrete and requires advanced techniques.

Front End and Back End

This is a classical view in compiler construction, most of compilers are organized by this construction. In front end, its responsiblity includes lexical analysis, syntax analysis and semantic analysis, and back end is only code generation, which is very similar to the view of analysis and synthesis.

In this view, a more essential power is portability that its part can be easily replaced by anohter one which has same function. However, in practice this way of replacing may be very diffcult. Because the compiler front end and back end rely on the source language and target language. Especially in back end, this will make some optimization methods don't work when change to a new back end component.

Passes

The compiler also has some convinient methods to transpile the source code to the target code. One of this method is passes, which deals with the code iteractivly. One pass transform the language to the token, and the second alters its structure or adds some information. The eariler language (scheme)[https://en.wikipedia.org/wiki/Scheme_(programming_language)] compiler adopte this methods. It is very easily implemented, but when more feature contained, it will become more complicated and hard to understand and maintain.

Essential Data Structure in Compiler

After introducing the compiler construction, now we can see some common data structure to implement the compiler essential structures to make the compiler more peformance

The Literal Table

The Symbol Table

Temporary Files

Chicken or Egg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment