Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active May 4, 2024 17:22
Show Gist options
  • Save Chubek/4442bf73f284eabfe87a72dcdc1d8ebe to your computer and use it in GitHub Desktop.
Save Chubek/4442bf73f284eabfe87a72dcdc1d8ebe to your computer and use it in GitHub Desktop.
Tokenizer for EBNF in Scheme

Tokenizer for EBNF in Scheme

tokenize-ebnf.scm defines in it a tokenizer (aka lexer, or scanner) for EBNF. This is compliant to ISO but several other meta-structures such as RegExp (/.../) are also consumed. Comments starting withg # are tokenized, consumed and appended to the token stream. Do with them as you wish!

This was an excercise at Scheme --- and so much more, an excercise at hand-rollling lexers in an stateless language with functional paradigm built in.

You can use this to, for example, translate EBNF to LaTEx, HTML, PostScript, ROFF --- or maybe define a parser for it, and convert it to Yacc, PEG, ANTLR, Flex, Re2C, etc.

What is EBNF?

Extended Backus-Naur Form is a meta-language, a domain-specific language used to describe other languages. Meta-lanugages can describe the lexical grammar of a language, or its syntactic grammar. They could also describe the abstract syntax of the language (e.g. Zephyr ASDL, see my translator to C.

EBNF has been 'standardized' by ISO. But the standard is lacking. For example, it is much easier to use regular expressions to embed lexical grammar for terminals amongst non-terminals. But with ISO EBNF you'd have to define it in a much less readble way. I am not a fan of using Regular Expressions in tokenizers, as Rob Pike explains they just make things messy and the final DFA unreadable. Plus, just you try to compile an NFA into DFA in a language without any imperative control structure! But using Regex to explain lexical grammar of terminals, a la PEG, is fine (I am not sure what PEG generators do with Regex, do they make DFA? PEG is often translated to LL(1) Recursive-Descent Parser, so it's up to them what they do. They can scoop up as they go. ANTLR is LL(k) and it uses so many heuristics to parse it runs the Euphrates dry!).

Anyways, the rules for EBNF:

  • We have 'rules', which are 'non-terminal names' separated by ::= from 'non-terminal definitons';
non-terminal-name ::= ... ;
  • We have 'terminals', inside single, double or backtick quotes;
  • You can use infix ... to denote range between left terminal operator, and right terminal operand;
  • We can use infix | to denote 'choice' between rule clusters (terminals, or composite);
  • It's best, for the sake of conversion to e.g. Yacc, to use single quotes for characters, and double quotes for strings;
lower ::= 'a' | 'b' | ... | 'y' | 'z' ;
keywords ::= "do" | "while" ;
  • Delimiters: [] denotes 'optional', {} denotes 'sequence', () denotes group;
if-else-stmt ::= "if" expr [ "then" ] stmt { stmt } { "elsif" { stmt } } [ "else" stmt { stmt } ] ;
  • 'Free capture' can be done with delimiting betweeen ? and ?;
any-char ::= ? any ascii character ?

There are some other rules, but you get the gist!

EBNF and Chomsky Hierarchy

EBNF grammars are well-capable of describing Context-Free (Type 2) and Regular (Type 3) grammars. There's a 1-1 relationship between EBNF rules, and regular expressions, in fact. Regardless of what I said about ISO EBNF not having rules for RE, it's just a matter of rolling up your sleeve to describe or define a Regular grammar, that is, a lexical grammar, in EBNF.

Let's use PEG, which is also capable of both T2 and T3, but has Regexp.

  • PEG:
id <- [a-zA-Z_][a-zA-Z0-9_]* ;
id-list <- id ( ',' id )*
  • Equivalent EBNF:
upper ::= 'A' ... 'Z' ;
lower ::= 'a' ... 'z' ;
digit ::= '0' ... '9' ;
letter ::= upper | lower ;
id-head ::= letter | '_' ;
id ::= id-head { letter | digit | '_' } ;
id-list ::= id { ',' id } ;

So... Why not use PEG?

I think you should use a mix of PEG and EBNF. Like, EBNF with Regexp. Like Python describes and defines its gramamr. PEG is much more 'terse' than it should be for a meta-language. After all, PEG is used to actually implement parsers to be generated!

How do I begin authoring an EBNF grammar (for my language, or an established one)?

I have included an ebnf-bootstrap.ebnf file in this Gist. You can wget it and begin writing your grammar. If you are a [Neo]Vim user, you can use ebnf.vim, also in this Gist.

Conclusion

I hope you learned something from this. Have fun.

Contact me at chubakbidpaa [at] riseup [dot] net if you need help. Check out my work!.

digit ::=
| '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
;
digit-natural ::=
digit - '0'
;
upper-case ::=
| 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J'
| 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T'
| 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
;
lower-case ::=
| 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j'
| 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't'
| 'u' | 'v' | 'w' | 'x' | 'y' | 'z'
;
punctuation ::=
| '.' | ',' | ';' | ':' | '!' | '?' | '-' | '_' | '(' | ')'
| '{' | '}' | '[' | ']' | '"' | "'" | '`' | '@' | '#'
| '$' | '%' | '^' | '&' | '*' | '+' | '=' | '<' | '>'
| '/' | '|' | '~' | '\'
;
space ::=
? Sequential white space ?
;
newline ::=
? The newline character on native system ?
;
tabulator ::=
? The tab character on native system ?
;
alphabetic ::=
upper-case | lower-case
;
alphanumeric ::=
alphabetic | digit
;
symbolic ::=
punctuation | alphanumeric | space
;
printable ::=
symbolic | tabulator
;
character ::=
printable | newline
;
" ebnf.vim - Syntax highlighting for EBNF
" Improved by: Your Name Here
" Original Author: Chubak Bidpaa (chubakbidpaa@riseup.net)
if exists("b:current_syntax")
finish
endif
" Define regions for comments (adjust according to EBNF variant)
syntax region ebnfComment start=/\v\#.*$/ end=/$/ keepend
" Improved definitions for terminal strings to avoid matching operators inside
syntax region ebnfMultiCharTerminal start=/\v"/ end=/\v"/
syntax region ebnfSingleCharTerminal start=/\v'/ end=/\v'/
" Capture patterns (optional, depends on EBNF variant)
syntax region ebnfCapture start=/\v\s*\?/ end=/\v\s*\?$/ keepend
" Match non-terminal identifiers more accurately
syntax match ebnfNonTermIdent /\v[-_a-zA-Z][-_a-z0-9]*/
" Highlight LHS identifiers uniquely (assuming they start at the line's beginning)
syntax match ebnfLhsIdent /\v^[-_a-zA-Z][-_a-z0-9]*/
syntax match ebnfOperator "::="
syntax match ebnfOperator "[{}()\[\]|/,]"
syntax match ebnfOperator ";"
syntax match ebnfOperator "-"
syntax match ebnfSpecial "\.\.\."
syntax match ebnfQuantifier "[?*+]"
" Linking highlights
highlight link ebnfComment Comment
highlight link ebnfMultiCharTerminal String
highlight link ebnfSingleCharTerminal Character
highlight link ebnfNonTermIdent Identifier
highlight link ebnfLhsIdent Underlined
highlight link ebnfOperator Operator
highlight link ebnfSpecial Special
highlight link ebnfQuantifier SpecialChar
highlight link ebnfCapture Statement
let b:current_syntax = "ebnf"
; tokenize-ebnf.scm
; A tokenizer for EBNF
; This works with any Scheme interpreter or compiler compliant with R7RS-Small
; Released under Public Domain License - 2024 (C) Chubak Bidpaa
(define (consume-while predicate rest)
(let loop ((rest-prime rest)
(acc '()))
(if (and (not (null? rest-prime))
(predicate (car rest-prime)))
(loop (cdr rest-prime) (cons (car rest-prime) acc))
(values (list->string (reverse acc)) rest-prime))))
(define (consume-until predicate rest)
(let loop ((rest-prime rest)
(acc '()))
(if (or (null? rest-prime)
(predicate (car rest-prime)))
(values (list->string (reverse acc)) rest-prime)
(loop (cdr rest-prime) (cons (car rest-prime) acc)))))
(define (tokenize input)
(let loop ((rest (string->list input))
(tokens '(INIT StartOfStream)))
(cond
[(null? rest)
(reverse (list '(END EndOfStream) tokens))]
[(member (car rest)
'(#\newline #\space #\tab))
(loop (cdr rest) tokens)]
[(member (car rest)
'(#\" #\' #\`))
(let-values
([(acc rest-after) (consume-while (lambda (ch) (not (char=? ch (car rest)))) (cdr rest))])
(loop rest-after (cons (cons 'TERMINAL acc) tokens)))]
[(char-alphabetic? (car rest))
(let-values
([(acc rest-after)
(consume-while (lambda (ch)
(or (char-alphabetic? ch)
(char-numeric? ch)
(char=? ch #\-))) rest)])
(loop rest-after (cons (cons 'NON-TEMRINAL-ID acc) tokens)))]
[(char=? (car rest) #\:)
(let-values
([(_ rest-after) (consume-while (lambda (ch) (or (char=? ch #\:) (char=? ch #\=))) (cdr rest))])
(loop rest-after (cons '(ASSIGN ASSIGN) tokens)))]
[(char=? (car rest) #\.)
(let-values
([(_ rest-after) (consume-while (lambda (ch) (char=? ch #\.)) (cdr rest))])
(loop rest-after (cons '(RANGE RANGE) tokens)))]
[(char=? (car rest) #\\)
(let-values
([(acc rest-after) (consume-while (lambda (ch) (not (member ch '(#\newline #\space #\tab)))) (cdr rest))])
(loop rest-after (cons (cons 'SPECIAL acc) tokens)))]
[(char=? (car rest) #\/)
(let-values
([(acc rest-after) (consume-until (lambda (ch) (char=? ch #\/)) (cdr rest))])
(loop rest-after (cons (cons 'REGEX acc) tokens)))]
[(member (car rest) '(#\( #\[ #\{ #\) #\] #\} #\; #\, #\|))
(loop (cdr rest)
(cons (case (car rest)
[(#\;) '(CTRL SEMI)]
[(#\() '(SUB-START LPAREN)]
[(#\[) '(SUB-START LBRACK)]
[(#\{) '(SUB-START LCURLY)]
[(#\)) '(SUB-END RPAREN)]
[(#\]) '(SUB-END RBRACK)]
[(#\}) '(SUB-END RCURLY)]
[(#\|) '(CTRL ALTCHR)]
[(#\,) '(DECORE COLON)]) tokens))]
[(char=? (car rest) #\#)
(let-values
([(acc rest-after) (consume-until (lambda (ch) (char=? ch #\newline)) (cdr rest))])
(loop rest-after (cons (cons 'COMMENT acc) tokens)))]
[else
(error "Loose token" (car rest))])))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment