Skip to content

Instantly share code, notes, and snippets.

@Chubek
Last active February 28, 2024 22:39
Show Gist options
  • Save Chubek/df09c5d429d1b3ea0bbdf4e8123160ba to your computer and use it in GitHub Desktop.
Save Chubek/df09c5d429d1b3ea0bbdf4e8123160ba to your computer and use it in GitHub Desktop.
Writing Grammars: Mixing Regular Definitions with EBNF

The reason you could be writing a grammar is:

1- You wish people to learn the ins-and-outs of your language;

2- You wish to feed this grammar into an LLM, a lower-spec generator like BNFC, etc;

3- You wish to 'mindstorm' features of your language, or a superset that you are defining for an existing language.

I have devised a method of specifying grammars which is not very original, but the positive side of it is, the lexical and the syntactic grammar are separate, so an LLM like ChatGPT can easily create separate lexical analyszer specs or even the lexical analyzer itself, from the parser --- which is of course defined by the syntactic grammar.

So let's demonstrate this with an example. I'll use the grammar specs I am working on right now, an spec for AWK, for which I am writiing a C translator (called Awk2C, in Ocaml). But you can see a finished one in my ASDL implementation.

So basically, these are general EBNF gramamrs, with syntactic and lexical aspects separated. And 'regular definitions' are used to describe the lexical gramamr (curse below to see what they are).

(The difference between a lexical, and syntactic gramamr, are explained below. Lexical grammar is cComsky type 3, Syntactic grammar is Chomskty type 2).

Here's how you read an EBNF grammar:

Enclosed Right Enclosed Left Meaning
Single Quote Single Quote Single character
Double Quote Double Quote Byte-sequence string
Right Bracket Left Bracket What's within is optional
Right Curly Left Curly What's within is repeatable
Right Paren Left Paren What's within is grouped

So these are the 'major' aspects of EBNF. Keep in mind, in my flavor of EBNF mixed with regular definitions, sometimes, brackets mean 'range' the way they do in regular expressions.

awk            ::= { fn-def | patt-act | comment }

fn-def         ::= "function" unindexed '(' unindexed-list ')' block

unindexed-list ::= unindexed { ',' unindexed }

patt-act       ::= pattern block

pattern        ::= expr-rng
                | regex
                | "BEGIN"
                | "END"

expr-rng       ::= expr [ ',' expr ]

block          ::= '{' stmt { [ ';' ] stmt } '}'

stmt           ::= if-stmt 
                | while-stmt 
                | for-stmt 
                | for-in-stmt 
                | assign-stmt 
                | next-stmt
                | expr

if-stmt        ::= "if" '(' expr ')' block [ "else" block ]

while-stmt     ::= "while" '(' expr ')' block

for-stmt       ::= "for" '(' [ assign-stmt ] ';' [ expr ] ';' [ assign-stmt ] ')' block

for-in-stmt    ::= "for" '(' variable "in" array ')' block

next-stmt      ::= "next"

assign-stmt    ::= variable '=' expr

expr           ::= primary | binary-expr | unary-expr

binary-expr    ::= logical-or-expr

logical-or-expr     ::= logical-and-expr { "||" logical-and-expr }
logical-and-expr    ::= equality-expr { "&&" equality-expr }
equality-expr       ::= relational-expr { ("==" | "!=") relational-expr }
relational-expr     ::= additive-expr { ("<" | ">" | "<=" | ">=") additive-expr }
additive-expr       ::= multiplicative-expr { ("+" | "-") multiplicative-expr }
multiplicative-expr ::= exponentiation-expr { ("*" | "/") exponentiation-expr }
exponentiation-expr ::= unary-expr { "^" unary-expr }

unary-expr     ::= "++" primary
                | "--" primary
                | primary "++"
                | primary "--"
                | '+' primary
                | '-' primary
                | '!' primary

primary        ::= number | regex | variable | function-call | '(' expr ')' ;

function-call  ::= unindexed '(' expr-list ')'

variable       ::= indexed | unindexed

indexed        ::= unindexed '[' no-lbrack+ { SUBSEP no-lbrack+ } ']' # SUBSEP is user-defined, ',' by default

unindexed      ::= ( letter | '_' ) { letter | digit |  '_' }

# Lexical Grammar for AWK

regex          ::= '/' no-slash+ '/'

number         ::= integer | float
float          ::= integer '.' integer
integer        ::= digit+

letter         ::= upper | lower
upper          ::= [A-Z]
lower          ::= [a-z]
digit          ::= [0-9]

no-lbrack      ::= [^\]] | "\]"
no-dquote      ::= [^"] | '\"'
no-squote      ::= [^'] | "\'"
no-slash       ::= [^/] | "\/"

comment        ::= "#" .* \n

People familar with EBNF will notice something odd: regular definitions!

But what are regular definitions?

In the world of computer lingustics, we have type 2 Chomsky grammar (context-free grammars) and type-3 Chomsky grammars (regular languages).

A lot of methods have been developed to describe CFGs. Backaus Naur Form of BNF was invented by eponymous creators of Algol to denote their grammars. Later, methods such as VDL or Vienna Definition Language was invented. These meta-languages were all cool and hip, but none of them were really machine-understandable at the time. At this time, parser generators like Yacc invented their own machine-parsable way of representing grammars.

Yacc relied on both type 2 and type 3 grammars to parse the program. You needed to tokenize your progrma and create a type 3 'scanner', also known as a 'lexer'. That is why Lex was created, ironically by a fella called Lesk, to automate the lexicall scanner generation based on type 3 grammars.

So we talked about type 2 grammars. What is a type 3 grammar?

There are not many type 3 meta-languages around. The most obvious one is regular expressions! But we also have regular definitions.

Regular definitions are similar to regular expressions, except they are more verbose. Regular definitions mix type 2 and type 3 grammars.

They are in a way similar, and precursor, to PEG or Parsing Expression Grammars, created by Robert Ford in his thesis in 2004. To tell you the truth, PEGs seem to be the LATEST, HOTTEST breakthrough in the world of LP! Honesst.

So in my way of defining EBNF grammars, I mix-in regular definitions with EBNF. Plese tell me what you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment