Chubek/README.md

## README.md

      
    Raw
  

              README.md
            
          
    The reason you could be writing a grammar is:
1- You wish people to learn the ins-and-outs of your language;
2- You wish to feed this grammar into an LLM, a lower-spec generator like BNFC, etc;
3- You wish to 'mindstorm' features of your language, or a superset that you are defining for an existing language.
I have devised a method of specifying grammars which is not very original, but the positive side of it is, the lexical and the syntactic grammar are separate, so an LLM like ChatGPT can easily create separate lexical analyszer specs or even the lexical analyzer itself, from the parser --- which is of course defined by the syntactic grammar.
So let's demonstrate this with an example. I'll use the grammar specs I am working on right now, an spec for AWK, for which I am writiing a C translator (called Awk2C, in Ocaml). But you can see a finished one in my ASDL implementation.
So basically, these are general EBNF gramamrs, with syntactic and lexical aspects separated. And 'regular definitions' are used to describe the lexical gramamr (curse below to see what they are).
(The difference between a lexical, and syntactic gramamr, are explained below. Lexical grammar is cComsky type 3, Syntactic grammar is Chomskty type 2).
Here's how you read an EBNF grammar:


Enclosed Right
Enclosed Left
Meaning


Single Quote
Single Quote
Single character


Double Quote
Double Quote
Byte-sequence string


Right Bracket
Left Bracket
What's within is optional


Right Curly
Left Curly
What's within is repeatable


Right Paren
Left Paren
What's within is grouped


So these are the 'major' aspects of EBNF. Keep in mind, in my flavor of EBNF mixed with regular definitions, sometimes, brackets mean 'range' the way they do in regular expressions.
awk            ::= { fn-def | patt-act | comment }

fn-def         ::= "function" unindexed '(' unindexed-list ')' block

unindexed-list ::= unindexed { ',' unindexed }

patt-act       ::= pattern block

pattern        ::= expr-rng
                | regex
                | "BEGIN"
                | "END"

expr-rng       ::= expr [ ',' expr ]

block          ::= '{' stmt { [ ';' ] stmt } '}'

stmt           ::= if-stmt 
                | while-stmt 
                | for-stmt 
                | for-in-stmt 
                | assign-stmt 
                | next-stmt
                | expr

if-stmt        ::= "if" '(' expr ')' block [ "else" block ]

while-stmt     ::= "while" '(' expr ')' block

for-stmt       ::= "for" '(' [ assign-stmt ] ';' [ expr ] ';' [ assign-stmt ] ')' block

for-in-stmt    ::= "for" '(' variable "in" array ')' block

next-stmt      ::= "next"

assign-stmt    ::= variable '=' expr

expr           ::= primary | binary-expr | unary-expr

binary-expr    ::= logical-or-expr

logical-or-expr     ::= logical-and-expr { "||" logical-and-expr }
logical-and-expr    ::= equality-expr { "&&" equality-expr }
equality-expr       ::= relational-expr { ("==" | "!=") relational-expr }
relational-expr     ::= additive-expr { ("<" | ">" | "<=" | ">=") additive-expr }
additive-expr       ::= multiplicative-expr { ("+" | "-") multiplicative-expr }
multiplicative-expr ::= exponentiation-expr { ("*" | "/") exponentiation-expr }
exponentiation-expr ::= unary-expr { "^" unary-expr }

unary-expr     ::= "++" primary
                | "--" primary
                | primary "++"
                | primary "--"
                | '+' primary
                | '-' primary
                | '!' primary

primary        ::= number | regex | variable | function-call | '(' expr ')' ;

function-call  ::= unindexed '(' expr-list ')'

variable       ::= indexed | unindexed

indexed        ::= unindexed '[' no-lbrack+ { SUBSEP no-lbrack+ } ']' # SUBSEP is user-defined, ',' by default

unindexed      ::= ( letter | '_' ) { letter | digit |  '_' }

# Lexical Grammar for AWK

regex          ::= '/' no-slash+ '/'

number         ::= integer | float
float          ::= integer '.' integer
integer        ::= digit+

letter         ::= upper | lower
upper          ::= [A-Z]
lower          ::= [a-z]
digit          ::= [0-9]

no-lbrack      ::= [^\]] | "\]"
no-dquote      ::= [^"] | '\"'
no-squote      ::= [^'] | "\'"
no-slash       ::= [^/] | "\/"

comment        ::= "#" .* \n
People familar with EBNF will notice something odd: regular definitions!
But what are regular definitions?
In the world of computer lingustics, we have type 2 Chomsky grammar (context-free grammars) and type-3 Chomsky grammars (regular languages).
A lot of methods have been developed to describe CFGs. Backaus Naur Form of BNF was invented by eponymous creators of Algol to denote their grammars. Later, methods such as VDL or Vienna Definition Language was invented. These meta-languages were all cool and hip, but none of them were really machine-understandable at the time. At this time, parser generators like Yacc invented their own machine-parsable way of representing grammars.
Yacc relied on both type 2 and type 3 grammars to parse the program. You needed to tokenize your progrma and create a type 3 'scanner', also known as a 'lexer'. That is why Lex was created, ironically by a fella called Lesk, to automate the lexicall scanner generation based on type 3 grammars.
So we talked about type 2 grammars. What is a type 3 grammar?
There are not many type 3 meta-languages around. The most obvious one is regular expressions! But we also have regular definitions.
Regular definitions are similar to regular expressions, except they are more verbose. Regular definitions mix type 2 and type 3 grammars.
They are in a way similar, and precursor, to PEG or Parsing Expression Grammars, created by Robert Ford in his thesis in 2004. To tell you the truth, PEGs seem to be the LATEST, HOTTEST breakthrough in the world of LP! Honesst.
So in my way of defining EBNF grammars, I mix-in regular definitions with EBNF. Plese tell me what you think?
Enclosed Right	Enclosed Left	Meaning
Single Quote	Single Quote	Single character
Double Quote	Double Quote	Byte-sequence string
Right Bracket	Left Bracket	What's within is optional
Right Curly	Left Curly	What's within is repeatable
Right Paren	Left Paren	What's within is grouped