Skip to content

Instantly share code, notes, and snippets.

@Chubek
Created May 1, 2024 14:01
Show Gist options
  • Save Chubek/5f3937e81e268146eda7d43be932869d to your computer and use it in GitHub Desktop.
Save Chubek/5f3937e81e268146eda7d43be932869d to your computer and use it in GitHub Desktop.
A tokenizer for EBNF in Scheme (need parser!)

tokenize-ebnf.scm is a tokenizer for EBNF in Scheme. Although it is [almost] useless without a parser --- and that is what I have issue arriving at!

I want this to be an educational experience, so I don't wish to resort to define-sytax macros. Otherwise it would have been much simpler. Scheme's macros are a cheatcode tbqh. In fact, CLisp's macros are cheatcode too. Read 'Let Over Lambda' by Doug Hoyte. If I use macros here, I have doomed myself to be a Scheme Simp (Scsimp?) forever. I just don't want that.

So what I want is your help in creating a parser for EBNF that keeps in mind the recursive, context-free nature of it. EBNF was not meant to be parser, it was never created as an applicable language. I have loooked around and there are not that many parsers for EBNF around. It's a meta-grammar meant to be leared at and hooted at. Not parsed.

And I don't really need to parse it. I wanna write an ebnf2ps program. So this is enough really.

The problem is the 'nested' or 'recursive' nature of EBNF. For example:

my-rule  ::= '>' int [ my-other-rule { ',' my-other-rule } ]

This kinda throws a wrench at the whole process.

Your help is much appreciated. (chubakbidpaa [at] riseup [dot] net)

(define (consume-while predicate rest)
(let loop ((rest-prime rest)
(acc '()))
(if (and (not (null? rest-prime))
(predicate (car rest-prime)))
(loop (cdr rest-prime) (cons (car rest-prime) acc))
(values (list->string (reverse acc)) rest-prime))))
(define (consume-until predicate rest)
(let loop ((rest-prime rest)
(acc '()))
(if (or (null? rest-prime)
(predicate (car rest-prime)))
(values (list->string (reverse acc)) rest-prime)
(loop (cdr rest-prime) (cons (car rest-prime) acc)))))
(define (tokenize input)
(let loop ((rest (string->list input))
(tokens '(INIT)))
(cond
[(null? rest)
(reverse (list 'EOF tokens))]
[(member (car rest)
'(#\newline #\space #\tab))
(loop (cdr rest) tokens)]
[(member (car rest)
'(#\" #\' #\`))
(let-values
([(acc rest-after) (consume-while (lambda (ch) (not (char=? ch (car rest)))) (cdr rest))])
(loop rest-after (cons (cons 'TERMINAL acc) tokens)))]
[(char-alphabetic? (car rest))
(let-values
([(acc rest-after)
(consume-while (lambda (ch)
(or (char-alphabetic? ch)
(char-numeric? ch)
(char=? ch #\-))) rest)])
(loop rest-after (cons (cons 'NON-TEMRINAL-ID acc) tokens)))]
[(char=? (car rest) #\:)
(let-values
([(_ rest-after) (consume-while (lambda (ch) (or (char=? ch #\:) (char=? ch #\=))) (cdr rest))])
(loop rest-after (cons '(ASSIGN ASSIGN) tokens)))]
[(char=? (car rest) #\.)
(let-values
([(_ rest-after) (consume-while (lambda (ch) (char=? ch #\.)) (cdr rest))])
(loop rest-after (cons '(RANGE RANGE) tokens)))]
[(char=? (car rest) #\\)
(let-values
([(acc rest-after) (consume-while (lambda (ch) (not (member ch '(#\newline #\space #\tab)))) (cdr rest))])
(loop rest-after (cons (cons 'SPECIAL acc) tokens)))]
[(char=? (car rest) #\/)
(let-values
([(acc rest-after) (consume-until (lambda (ch) (char=? ch #\/)) (cdr rest))])
(loop rest-after (cons (cons 'REGEX acc) tokens)))]
[(member (car rest) '(#\( #\[ #\{ #\) #\] #\} #\; #\, #\|))
(loop (cdr rest)
(cons (case (car rest)
[(#\;) '(CTRL SEMI)]
[(#\() '(SUB-START LPAREN)]
[(#\[) '(SUB-START LBRACK)]
[(#\{) '(SUB-START LCURLY)]
[(#\)) '(SUB-END RPAREN)]
[(#\]) '(SUB-END RBRACK)]
[(#\}) '(SUB-ENT RCURLY)]
[(#\|) '(CTRL ALTCHR)]
[(#\,) '(DECORE COLON)]) tokens))]
[(char=? (car rest) #\#)
(let-values
([(acc rest-after) (consume-until (lambda (ch) (char=? ch #\newline)) (cdr rest))])
(loop rest-after (cons (cons 'COMMENT acc) tokens)))]
[else
(error "Loose token" (car rest))])))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment