harrysarson/lexer-parser.md

## lexer-parser.md

      
    Raw
  

              lexer-parser.md
            
          
    Lexer/parser thoughts

I like lexer/parser approach because there seem to be two hard parts to going
from text to an AST:

Dealing with text
Working out the (possibly recursive structure of the text)

Elm is a really nice language in that you can separate the two hard parts
cleanly when writing a parser for it:

The lexer takes text and produces a sequence of tokens. This sequence is
flat; the lexer is completely obvious to the structure of the code [1].
The parser takes tokens and works out the structure of the program. Code is
almost always nested and so this step generally takes some flat data
structure and creates a nested/recursive one. When writing the parser I think
you should be careful to make sure that it never looks at text, the
identities of the tokens (e.g. this is the keyword type)  [2]

Lexing

The lexer function for elm would look something like this
lex : String -> List (LexItem)


type LexItem
    = Token LexToken
    | Ignorable LexIgnorable
    | Invalid LexInvalid
    | Newlines (List Int) Int
    

type LexToken
    = Sigil LexSigil
    | Identifier
        { qualifiers : List Token.UpperCase
        , name : Token.Token
        }
    | RecordAccessorLiteral Token.LowerCase
    | RecordAccessorFunction Token.LowerCase
    | Keyword Token.Keyword
    | NumericLiteral String
    | TextLiteral LexLiteralType String

Nice properties:

Lexing never fails (just produces Invalid tokens).
Lexing is full-fidelity

Not-nice properties:
You can see a RecordAccessorLiteral/RecordAccessorFunction hack. This is
because in elm record .field and record.field parse into different AST's.
To handle this either the parser step has to consider whitespace between tokens
as significant or we need this hack in the lexer.
Parsing

This is where I got stuck. In a language with mutability we have:
struct ParseState { ... }

impl ParseState {
    fn peek(&self) -> &TokenItem { ... }
    fn pop(&mut self) -> TokenItem { ... }
}

and the parser will repeatedly pop tokens from the state and put them into
the AST.
Immutability making things hard


Very possibly I have missed something here

The equivalent patten in immutable languages is to take the state as input and
create a new state to return as an output. However, I worried that it would be
too easy to create a new state with the last element popped off but then
accidentally use the previous state instead.
I couldn't really find a nice patten for parsing without mutability [2].
Maybe we need an elm/parser like library but instead of working with strings it
works with List Ts instead.
Notes

1

It not always possible to convert text into tokens whilst ignoring the
structure of the code. Try parsing C for instance. This is what I mean when I
say that elm is a really nice language to parse.
If you ever design a language, make life easy for the folk writing dev tools by
making the language easy to parse! For example, it took rust about 5 years of
being stable to get a really good vscode plugin, partly because design
decisions (e.g. macros) make it really hard to write good, fast tooling for
rust. Maybe the language would have gained popularity quicker if the IDE
experience was better earlier on?
2

See my attempt here.