Skip to content

Instantly share code, notes, and snippets.

@ryandabler
Last active January 23, 2024 17:57
Show Gist options
  • Save ryandabler/1dcdcb88890f72677bfdaed661d64c8c to your computer and use it in GitHub Desktop.
Save ryandabler/1dcdcb88890f72677bfdaed661d64c8c to your computer and use it in GitHub Desktop.
Complete example of a JSON parser for this article: https://itnext.io/demystifying-json-parse-383bb4907d79
const tokens = [
[/^\s+/, null],
[/^\[/, '['],
[/^]/, ']'],
[/^\{/, '{'],
[/^}/, '}'],
[/^:/, ':'],
[/^,/, ','],
[/^"/, '"'],
[/^\d+/, 'number'],
[/^null\b/, 'null'], // Add word boundaries to ensure that
[/^true\b/, 'true'], // `null`, etc., is matched exactly.
[/^false\b/, 'false'],
[/^[^"]*/, 'string'], // Match anything not a quotation mark.
];
class Tokenizer {
// List of tokens recognized by the tokenizer
#tokens;
// Position in the input string from which it will
// read to get the next token
#cursor;
// String to turn into tokens
#string;
constructor(tokens) {
this.#tokens = tokens;
}
read(string) {
this.#cursor = 0;
this.#string = string;
}
next() {
// If at end of input, not more tokens to generate
if (this.#cursor === this.#string.length) {
return undefined;
}
// Find substring beginning at position of cursor
const str = this.#string.slice(this.#cursor);
for (const [pattern, type] of this.#tokens) {
const [match] = pattern.exec(str) || [];
if (!match) {
continue;
}
this.#cursor += match.length;
// Skip tokens with null types
if (type === null) {
return this.next();
}
return { token: match, type };
}
// Could not extract any tokens, so throw error
throw new Error(`Unrecognized input: ${str[0]}`);
}
}
const astFactory = {
JSON_STRING(token) {
return {
type: 'JSON_STRING',
value: token.token,
};
},
JSON_number(token) {
return {
type: 'JSON_number',
value: +token.token,
};
},
JSON_null() {
return {
type: 'JSON_null',
value: null,
};
},
JSON_BOOL(token) {
return {
type: 'JSON_BOOL',
value: token.token === 'true',
};
},
JSON_ARRAY(elements) {
return {
type: 'JSON_ARRAY',
elements,
};
},
JSON_OBJECT(entries) {
return {
type: 'JSON_OBJECT',
entries,
}
}
};
const jsFactory = {
JSON_STRING(token) {
return token.token;
},
JSON_number(token) {
return +token.token;
},
JSON_null() {
return null;
},
JSON_BOOL(token) {
return token.token === 'true'
},
JSON_ARRAY(elements) {
return elements;
},
JSON_OBJECT(entries) {
return entries.reduce(
(obj, [k, v]) => {
obj[k] = v;
return obj;
},
{}
);
}
};
class Analyzer {
#tokenizer;
#lookahead;
#factory;
constructor(tokenizer, factory) {
this.#tokenizer = tokenizer;
this.#factory = factory;
}
read(string) {
this.#tokenizer.read(string);
this.#lookahead = this.#tokenizer.next();
return this.#JSON();
}
#eat(tokenType) {
const token = this.#lookahead;
if (!token) {
throw new Error(
`Unexpected end of input; expected ${token.type}`
);
}
if (this.#lookahead.type !== tokenType) {
throw new Error(
`Expected ${tokenType} === ${token.type}`
);
}
this.#lookahead = this.#tokenizer.next();
return token;
}
#JSON() {
switch (this.#lookahead.type) {
case '"':
return this.#JSON_STRING();
case 'number':
return this.#JSON_number();
case 'null':
return this.#JSON_null();
case 'true':
case 'false':
return this.#JSON_BOOL();
case '[':
return this.#JSON_ARRAY();
case '{':
return this.#JSON_OBJECT();
default:
throw new Error(
`Received token which cannot be valid JSON`
);
}
}
#JSON_STRING() {
// The quotation marks are necessary for the JSON grammar,
// but contribute nothing to the semantic content of the
// AST, so ensure they exist but do not use
this.#eat('"');
const string = this.#eat('string');
this.#eat('"');
return this.#factory.JSON_STRING(string);
}
#JSON_number() {
const number = this.#eat('number');
return this.#factory.JSON_number(number);
}
#JSON_null() {
this.#eat('null');
return this.#factory.JSON_null();
}
#JSON_BOOL() {
const bool = this.#lookahead.type === 'true'
? this.#eat('true')
: this.#eat('false');
return this.#factory.JSON_BOOL(bool);
}
#JSON_ARRAY() {
this.#eat('[');
const elements = this.#JSON_ARRAY_ELS();
this.#eat(']');
return this.#factory.JSON_ARRAY(elements);
}
#JSON_ARRAY_ELS() {
const elements = [];
while (this.#lookahead.type !== ']') {
elements.push(this.#JSON());
this.#lookahead.type === ',' && this.#eat(',');
}
return elements;
}
#JSON_OBJECT() {
this.#eat('{');
const kvPairs = this.#JSON_OBJECT_KV_PAIRS();
this.#eat('}');
return this.#factory.JSON_OBJECT(kvPairs);
}
#JSON_OBJECT_KV_PAIRS() {
const entries = [];
while (this.#lookahead.type !== '}') {
entries.push(this.#JSON_OBJECT_KV_PAIR());
this.#lookahead.type === ',' && this.#eat(',');
}
return entries;
}
#JSON_OBJECT_KV_PAIR() {
// Get key
this.#eat('"');
const key = this.#eat('string').token;
this.#eat('"');
this.#eat(':');
// Get value
const value = this.#JSON();
return [key, value];
}
}
@Ingmar-Paetzold
Copy link

Ingmar-Paetzold commented Jul 30, 2023

Hi Ryan, thank you very much for your concise and very well written tutorial on JSON parsing (the demystifying article)! I was looking for something that lays out a clean and lightweight algorithm and even some pseudo code, before I invent the wheel on my own. Javascript is perfectly suited for that job, since one can immediately get it running in the browser to play with.
One suggestion: the number token only parses positive integers, thus, it chops the fractional part of floating points, and it does not detect negative values at all. How about a more complete Regex according to the RFC, e.g. /^-?\d+([.]\d+)?([Ee][+-]?\d+)?/?

Edit: I had a string "2019-01-01" that was classified as 'number' by the Tokenizer, since once the quotes are eaten, the number stands "naked". Therefore, I suggest to abandon the quotes as a separate token class and make them part of the string-regex. The quotes must then be trimmed away in #JSON_STRING() and #JSON_OBJECT_KV_PAIR() instead of #eat()ing them. First case in #JSON() becomes case 'string'. This is even more consistent with the RFC 4627 which defines the tokens, among those the 6 "structural characters" []{}:, (not the quote), and the values, where "A string begins and ends with quotation marks".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment