Skip to content

Instantly share code, notes, and snippets.

@peey
Created July 22, 2017 14:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save peey/d317015d6ee6545a029a8e203a4de334 to your computer and use it in GitHub Desktop.
Save peey/d317015d6ee6545a029a8e203a4de334 to your computer and use it in GitHub Desktop.

Guide to Babylon

First of all, a high level overview of Babylon's tokenizer and parser can be understood by going through super tiny compiler.

Tokenizer

Tokenizer reads the source code and consumes it sequentially converting it into meaningful tokens like keywords (return, let, do), punctuation types (**, {, ;), names (identifiers), numbers, etc. These are called TokenTypes, often imported as tt in rest of the source. You can find the full list of token types in src/tokenizer/types.js.

There are no separate tokenization and parsing stages, however you'll notice that TODO: a brief description of important info that tokens contain (in state?)

When you require a babel-plugin-syntax-* package in babel, it does nothing fancy but to add an entry to parserOpts.plugins array. In babylon's source you'll be using the method this.hasPlugin to check if a plugin was enabled by a babel transform. Babylon takes the plugins array passed as option and converts it to a dictionary this.plugins where each plugin name is a key and the value is either true or false depending on if plugin name was in the array or not.

Special Note: comments are parsed along with tokenization. This is because comments aren't treated as a part of the AST but are attached to the tokens that they are follow / are followed by. Rest of the code is not parsed until the tokenization is complete.

Working

The tokenizer maintains a notion of "positon" which starts at line 0, column 0 and advances every time the tokenizer converts the next few characters in the source code to a token.

As the tokenizer begins, it first skips the whitespace characters and comments using the method skipSpace. After skipping, the method next passes control to getTokenFromCode which determines the type of token based on the current character (often using more info such as the next character if the current character wasn't enough) and calls the appropriate method to consume that type of the token and to advance the position.

State

Each token contains a state object which is partially populated at the time of tokenization with information like ... and partially populated at the time of parsing with information like ...

A key aspect is that instead of considering various syntactic elements of a construct as an array of tokens, the tokenizer considers it to be one token of the type of that construct (e.g. class) and the associated syntax elements as the properties on the token (e.g. decorators)

Context

TODO

Useful API methods

The method match(type: TokenType): boolean can be used to find if the current token is of a certain type.

The method eat(type: TokenType): boolean is used to advance the position and discard token if it matches the type and

E.g. for matching a, b, c, d, in a.b.c.d you could use

do {
  // process identifier and advance position to the end of identifier
} while (this.eat(tt.dot))

There are a bunch of related API methods which might be useful to you - expect, eatContextual, the documentation for which you can find in the source file src/parser/util.js

The method raise(pos: number, message: string): empty is used to raise a syntax error and can be passed a position argument. Also see unexpected, expect in parser/util.js

Lookahead

// maybe this section should be a comment in the src near src/tokenizer/index.js L75C3 isLookahead, L122C3 lookahead ?

Sometimes a lookahead is required to determine the nature of the current token. This is handled by this.lookahead method which "pretends" to advance to the next token after eating the current but restores the old state after saving the result of the lookahead into a variable and returning it.

Parser

TODO: code structure overview, base.js, comments.js, location.js and then all other files

Misc Notes

  • A lot of the source code in this repository is optimized for performance over readablity, but thankfully is accompanied by comments wherever this is the case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment