Skip to content

Instantly share code, notes, and snippets.

@harrysarson
Created June 30, 2021 20:05
Show Gist options
  • Save harrysarson/2cc659656bbca1ff3c7387bb60d2d9b0 to your computer and use it in GitHub Desktop.
Save harrysarson/2cc659656bbca1ff3c7387bb60d2d9b0 to your computer and use it in GitHub Desktop.
Lexer/parser thoughts

Lexer/parser thoughts

I like lexer/parser approach because there seem to be two hard parts to going from text to an AST:

  1. Dealing with text
  2. Working out the (possibly recursive structure of the text)

Elm is a really nice language in that you can separate the two hard parts cleanly when writing a parser for it:

  • The lexer takes text and produces a sequence of tokens. This sequence is flat; the lexer is completely obvious to the structure of the code [1].
  • The parser takes tokens and works out the structure of the program. Code is almost always nested and so this step generally takes some flat data structure and creates a nested/recursive one. When writing the parser I think you should be careful to make sure that it never looks at text, the identities of the tokens (e.g. this is the keyword type) [2]

Lexing

The lexer function for elm would look something like this

lex : String -> List (LexItem)


type LexItem
    = Token LexToken
    | Ignorable LexIgnorable
    | Invalid LexInvalid
    | Newlines (List Int) Int
    

type LexToken
    = Sigil LexSigil
    | Identifier
        { qualifiers : List Token.UpperCase
        , name : Token.Token
        }
    | RecordAccessorLiteral Token.LowerCase
    | RecordAccessorFunction Token.LowerCase
    | Keyword Token.Keyword
    | NumericLiteral String
    | TextLiteral LexLiteralType String

Nice properties:

  1. Lexing never fails (just produces Invalid tokens).
  2. Lexing is full-fidelity

Not-nice properties:

You can see a RecordAccessorLiteral/RecordAccessorFunction hack. This is because in elm record .field and record.field parse into different AST's. To handle this either the parser step has to consider whitespace between tokens as significant or we need this hack in the lexer.

Parsing

This is where I got stuck. In a language with mutability we have:

struct ParseState { ... }

impl ParseState {
    fn peek(&self) -> &TokenItem { ... }
    fn pop(&mut self) -> TokenItem { ... }
}

and the parser will repeatedly pop tokens from the state and put them into the AST.

Immutability making things hard

Very possibly I have missed something here

The equivalent patten in immutable languages is to take the state as input and create a new state to return as an output. However, I worried that it would be too easy to create a new state with the last element popped off but then accidentally use the previous state instead.

I couldn't really find a nice patten for parsing without mutability [2].

Maybe we need an elm/parser like library but instead of working with strings it works with List Ts instead.

Notes

1

It not always possible to convert text into tokens whilst ignoring the structure of the code. Try parsing C for instance. This is what I mean when I say that elm is a really nice language to parse.

If you ever design a language, make life easy for the folk writing dev tools by making the language easy to parse! For example, it took rust about 5 years of being stable to get a really good vscode plugin, partly because design decisions (e.g. macros) make it really hard to write good, fast tooling for rust. Maybe the language would have gained popularity quicker if the IDE experience was better earlier on?

2

See my attempt here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment