Skip to content

Instantly share code, notes, and snippets.

@onurkerimov
Last active March 4, 2024 01:41
Show Gist options
  • Save onurkerimov/b7289418cff53d95e2888831be3e94e2 to your computer and use it in GitHub Desktop.
Save onurkerimov/b7289418cff53d95e2888831be3e94e2 to your computer and use it in GitHub Desktop.
Open idea: parser combinator API using regex

This is my take on building a parser combinator library API in JavaScript. This library would have a very low surface area, and would be easy to learn for those who already know already know regex.

Parser combinators often create their own DSL (domain-specific language) to provide a less verbose way of declaring parsers. These DSLs have to support operators for things such as repetition, alternation, optionality, lookahead, and more. Some of them don't create DSLs and provide these as helper functions.

In the following link, there are JSON parser implementations done by several parser combinatiors/generators: https://chevrotain.io/performance/ Some of them are verbose, some of them are concise.

This parser combinator API I'm proposing will not provide a DSL, however it will result in very short parser declarations. Also they're arguably concise if you're comfortable with regular expressions. The idea is this: Regex is already a DSL on its own, and it supports many operators that a parser combinator may need:

  • Repetition: X* (zero or more), X+ (one or more)
  • Optionality: X?
  • Alternation: [XYZ], (X|Y|Z)
  • Negation: [^X]
  • Positive and negative lookahead...

Therefore, they can be utilized to declare parsing rules. In the following snippet, I've demonstrated a JSON parser design using this technique.

import { token, rule, oneOf, createTokenizer, createParser } from 'my-awesome-parser-combinator-library'
// Tokens
const Keyword = token(/true|false|null/)
const Delimiter = token(/\.|:|,|\{|\}|\(|\)|\[|\]/, token => ({...token, type: token.value}))
const StringLiteral = token(/"(:?[^\\"\n\r]+|\\(:?[bfnrtv"\\/]|u[0-9a-fA-F]{4}))*"/)
const NumberLiteral = token(/-?(0|[1-9]\d*)(\.\d+)?([eE][+-]?\d+)?/)
const WhiteSpace = token(/\s+/, () => null)
// Tokenizer
const tokenizer = createTokenizer([
WhiteSpace,
NumberLiteral,
StringLiteral,
Delimiter,
Keyword
])
// Parser
const parser = createParser(R => {
R.JSON = () => oneOf([R.Object, R.Array])
R.Object = () => rule(/\{(0(,0)*)?\}/, [R.ObjectItem])
R.Array = () => rule(/\[(0(,0)*)?\]/, [R.Value])
R.ObjectItem = () => rule(/0:1/, [StringLiteral, R.Value])
R.Value = () => oneOf([StringLiteral, NumberLiteral, R.Object, R.Array, Keyword])
})
export default str => parser(tokenizer(str))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment