Skip to content

Instantly share code, notes, and snippets.

@sma
Last active April 15, 2023 21:24
Show Gist options
  • Save sma/cb281758b3dd1d3c1b9f151e6f141c75 to your computer and use it in GitHub Desktop.
Save sma/cb281758b3dd1d3c1b9f151e6f141c75 to your computer and use it in GitHub Desktop.
A TYON implementation by ChatGPT (cannot deal with typed values)

Here is a ChatGPT experiment that might even result in something useful.

I recently saw a mention on HackerNews of a JSON-like data format called TYON for which a parser written in Dart might be quite useful. It shouldn't be too difficult to write by hand, but why not let ChatGPT 4 do the job?

Basically, in addition to JSON, TYON allows for Lisp-style comments, doesn't require " and starts with an implicit object so you need no top level {}. It also supports type definition that can be used to build objects from just data.

Here's my prompt:

Please write a parser in Dart for a file format according the specification below. Parsing a string of TYON should return a Map<String, dynamic> where dynamic is really either a String, another of such maps or a List<dynamic> where each element is again either a string, a map or a list.

Then I appended the whole specification to my prompt and pressed RETURN and waited while our future overlord slowly emitted its wisdom.

ChatGTP starts with a school-book solution using a lexer and a parser:

Map<String, dynamic> parseTYON(String input) {
  final lexer = _TYONLexer(input);
  final parser = _TYONParser(lexer);
  return parser.parse();
}

The lexer is also very classical in that it stores the _input and maintains an _index of a current character (_peek) which can be consumed (_consume):

class _TYONLexer {
  static const _specialChars = '()[]=/;';

  String _input;
  int _index;

  _TYONLexer(String input)
      : _input = input,
        _index = 0;

  String _consume() {
    if (_index < _input.length) {
      return _input[_index++];
    }
    return null;
  }

  String _peek() {
    if (_index < _input.length) {
      return _input[_index];
    }
    return null;
  }

  ...

Note, that ChatGPT doesn't know about Dart's null safety, so the _peek and _consume methods should be of type String?, but that's easy to fix. And _input should probably be final.

A nextToken method will return the next token from the input (again, it should have the return type String?). As you see, it will skip whitespace and return null if the end of the input has been reached. Otherwise it will skip comments or either return a syntax token, a quoted string or a literal:

  String nextToken() {
    String char = _peek();
    while (char != null && _isWhitespace(char)) {
      _consume();
      char = _peek();
    }

    if (char == null) {
      return null;
    }

    if (char == ';') {
      while (char != null && char != '\n') {
        _consume();
        char = _peek();
      }
      return nextToken();
    }

    if (_specialChars.contains(char)) {
      _consume();
      return char;
    }

    if (char == '"') {
      _consume();
      return _readQuotedString();
    }

    return _readLiteral();
  }

The _isWhitespace method does the obvious thing. Hence, I omitted the code.

The _readQuotedString method, however, doesn't follow the specification which says that a " must be escaped as "". Instead, it implements the usual \ quoting. That's something I will fix later.

A literal, according to the specification, cannot start with " or /, but can contain it. This is something ChatGTP also overlooked in _readLiteral. I didn't bother.

The parser declares two methods to interact with the lexer, but only ever uses _nextToken (again, we need to add ? for null safety and a final to make the linter happy):

class _TYONParser {
  _TYONLexer _lexer;

  _TYONParser(this._lexer);

  String _peekToken() => _lexer._peek();

  String _nextToken() => _lexer.nextToken();

  ...

Let's have a look at parse which drives the whole process:

  Map<String, dynamic> parse() {
    Map<String, dynamic> result = HashMap<String, dynamic>();
    String? token = _nextToken();

    while (token != null) {
      if (token == '(' || token == '[') {
        throw FormatException('Unexpected opening bracket');
      }

      String key = token;
      token = _nextToken();

      if (token == '=') {
        token = _nextToken();

        if (token == '(' || token == '[') {
          result[key] = _parseValue(token);
        } else {
          result[key] = token;
        }
      } else {
        throw FormatException('Expected "=" after key');
      }

      token = _nextToken();
    }

    return result;
  }

It will read as many tokens as it finds. It correctly knows that a ( or [ would be illegal while expecting a property name. Actually any syntax character is invalid at this point, not just those two. It simply assumes that the token is the key. That assumption is incorrect if it's a = or ) or whatever. Also, because the lexer already removes the " from quoted strings, a perfectly valid expression like "[" = "]" would be invalid. It then expects a = and tries to parse values in [ ] or () or assumes that the token must be a literal value without checking that. The code is okayish but could be improved.

Here is _parseValue:

  dynamic _parseValue(String token) {
    if (token == '(') {
      return _parseMap();
    } else if (token == '[') {
      return _parseList();
    } else {
      return token;
    }
  }

Depending on the token, it will parse a map or a list, which is correct. However, anything but a literal or quoted string (something the lexer doesn't know) would be invalid at this point. There is absolutely no need to test for ( and [ twice. Either don't do it in parse or embed the code from _parseValue.

Parsing a list ist rather simple. It assumes tokens which are either new maps, lists or literals:

  List<dynamic> _parseList() {
    List<dynamic> list = [];
    String? token = _nextToken();

    while (token != null && token != ']') {
      if (token != '(' && token != '[') {
        list.add(token);
      } else {
        list.add(_parseValue(token));
      }
      token = _nextToken();
    }
    return list;
  }

Parsing a map would be more difficult, if the code would try to deal with types. But it completely ignores that and just assumes, that a map consists of key-value pairs separated by =.

  Map<String, dynamic> _parseMap() {
    Map<String, dynamic> map = HashMap<String, dynamic>();
    String key;
    dynamic value;
    String? token = _nextToken();

    while (token != null && token != ')') {
      key = token;
      token = _nextToken();

      if (token == '=') {
        token = _nextToken();

        if (token == '(' || token == '[') {
          value = _parseValue(token);
        } else {
          value = token!;
        }

        map[key] = value;
        token = _nextToken();
      } else {
        throw FormatException('Expected "=" after key');
      }
    }
    return map;
  }

So it might be able to parse the map example from the specification, but not the more complex typed examples:

person = (
   first = John
   last = Doe
   age = 42
   "favorite numbers" = [1 2 3]
)

I get a "type 'List' is not a subtype of type 'String'" error because the value variable in parse is of type String and not dynamic. I agree with ChatGPT that it would be so much nicer if Dart had a real lightweight sum type instead of having to use dynamic. But we have to work with what's available.

After fixing that, it works.

Parsing the other examples with the exception of the typed expressions like the one below also seem to work, too.

/person = (first-name middle-initial last-name age)

So, are you impressed by ChatGPT's features?

Reprinting a standard parser doesn't impress me much. But this does:

Me: I think, _readQuotedString is wrong. Can you double check?

ChatGPT: You're correct; I made a mistake in _readQuotedString. The method should handle escaped double quotes "" as a single double quote " in the resulting string. I've fixed the issue in the method below:

String _readQuotedString() {
  StringBuffer buffer = StringBuffer();
  String char = _consume();

  while (char != null && char != '"') {
    if (char == '"') {
      if (_peek() == '"') {
        buffer.write('"');
        _consume(); // Consume the second quote
      } else {
        break;
      }
    } else {
      buffer.write(char);
    }
    char = _consume();
  }
  return buffer.toString();
}

Me: You're correct, but that code isn't.

ChatGPT: Apologies for the confusion. I misunderstood your original concern. Here's the corrected _readQuotedString method:

String _readQuotedString() {
  StringBuffer buffer = StringBuffer();
  String char = _consume();

  while (char != null) {
    if (char == '"') {
      if (_peek() == '"') {
        _consume(); // Consume the second quote
        buffer.write('"');
      } else {
        break;
      }
    } else {
      buffer.write(char);
    }
    char = _consume();
  }
  return buffer.toString();
}

Now the nested strings are correctly parsed. (Let's ignore the fact that strings in general are not correctly parsed because the lexer doesn't distinguish them from syntax.)

Me: I think, you forgot to implement typed lists

Just watching the code slowly appearing in the chat windows makes me think that this is too difficult. This is not correct:

String _parseTypeName(String token) {
  if (token.startsWith('/')) {
    return token.substring(1);
  }
  return null;
}

And this is completely wrong:

List<dynamic> _parseTypedList(String type) {
  String token = _nextToken();
  if (token == '[') {
    List<dynamic> list = _parseList();
    list.insert(0, type); // Insert type as the first element
    return list;
  } else {
    throw FormatException('Expected "[" after type for list');
  }
}

At least it adds: Note that this implementation doesn't support type definitions, as the example provided in the specification does not clearly demonstrate how these should be handled.

I, on the other hand think, the specification is clear. Let's try again:

Me: I think, you forgot to implement typed lists. I'll explain the syntax:

/name = (a b c) will declare a type "name" with three properties "a", "b", and "c". Such a type definition can only be used on the top-level implicit map.

a typed value like /(a b c)(1 2 3) is equivalent to (a=1 b=2 c=3). Instead of using /(a b c) to define a type "inline", you can also refer to a type name like /name(1 2 3) which, using the type definition from below, again create the map value (a=1 b=2 c=3).

a typed list like /(a b c)[(1 2 3) (3 4 5)] applies the type to each element - unless that element is explicitly typed. It is therefore equivalent to [/(a b c)(1 2 3) /(a b c)(3 4 5)].

It now (correctly) adds Map<String, List<String>> _types; to the parser to store type definitions and adds a method to parse them. However, it thinks (probably because I reached the token limit) that it lexes /(a b c) as a single token.

I tried to help with

Me: remember that your _TYONLexer class will correctly tokenize /(a b c) as six tokens "/", "(", "a", "b", "c", ")".

… but it gets only more wrong. I give up.

I need to do it myself or start again in a new session. But not today. When creating parsers, I like to work based on a EBNF grammar, so this is my last question:

Me: please write an EBNF grammar for TYON

TYON          ::= { TypeDecl | KeyValuePair }
TypeDecl      ::= '/' Identifier '=' '(' { Identifier } ')'
KeyValuePair  ::= Key '=' Value
Key           ::= Literal | String
Value         ::= Literal | String | List | Map | TypedList | TypedMap
Literal       ::= ValidLiteralCharacters
String        ::= '"' { AnyCharacterExceptDoubleQuote | '""' } '"'
List          ::= '[' { Value } ']'
Map           ::= '(' { KeyValuePair } ')'
TypedList     ::= InlineType '[' { Value } ']'
TypedMap      ::= TypeReferenceOrInlineType '(' { Value } ')'
InlineType    ::= '/' '(' { Identifier } ')'
TypeReferenceOrInlineType ::= InlineType | TypeReference
TypeReference ::= '/' Identifier
Identifier    ::= ValidIdentifierCharacters

And this is rather great.

PS, here's the original code, in case somebody is interested.

PPS: I asked for a more compact lexer using regular expression to split the input into tokens and got this (I only added a final and made it null-safe). This is more to my liking, although the regexp is incorrect. I leave it as an exercise to the reader to fix it ;)

class _TYONLexer {
  _TYONLexer(String input) : _iterator = _regex.allMatches(input).iterator;
  final Iterator<RegExpMatch> _iterator;

  static final RegExp _regex = RegExp(r'[/()\[\]=]|"(?:[^"]|"")*"|;[^\n]*|\S+');

  String? nextToken() {
    if (!_iterator.moveNext()) return null;

    String token = _iterator.current[0]!;
    if (token.startsWith(';')) {
      return nextToken();
    } else if (token.startsWith('"') && token.endsWith('"')) {
      return token.substring(1, token.length - 1).replaceAll('""', '"');
    } else {
      return token;
    }
  }
}
import 'dart:collection';
Map<String, dynamic> parseTYON(String input) {
final lexer = _TYONLexer(input);
final parser = _TYONParser(lexer);
return parser.parse();
}
class _TYONLexer {
static const _specialChars = '()[]=/;';
final String _input;
int _index;
_TYONLexer(String input)
: _input = input,
_index = 0;
String? _consume() {
if (_index < _input.length) {
return _input[_index++];
}
return null;
}
String? _peek() {
if (_index < _input.length) {
return _input[_index];
}
return null;
}
bool _isWhitespace(String char) {
return char == ' ' || char == '\t' || char == '\r' || char == '\n';
}
String _readQuotedString() {
StringBuffer buffer = StringBuffer();
String? char = _consume();
while (char != null) {
if (char == '"') {
if (_peek() == '"') {
_consume(); // Consume the second quote
buffer.write('"');
} else {
break;
}
} else {
buffer.write(char);
}
char = _consume();
}
return buffer.toString();
}
String _readLiteral() {
StringBuffer buffer = StringBuffer();
String? char = _peek();
while (char != null && !_isWhitespace(char) && !_specialChars.contains(char)) {
buffer.write(_consume());
char = _peek();
}
return buffer.toString();
}
String? nextToken() {
String? char = _peek();
while (char != null && _isWhitespace(char)) {
_consume();
char = _peek();
}
if (char == null) {
return null;
}
if (char == ';') {
while (char != null && char != '\n') {
_consume();
char = _peek();
}
return nextToken();
}
if (_specialChars.contains(char)) {
_consume();
return char;
}
if (char == '"') {
_consume();
return _readQuotedString();
}
return _readLiteral();
}
}
class _TYONParser {
final _TYONLexer _lexer;
_TYONParser(this._lexer);
String? _nextToken() => _lexer.nextToken();
dynamic _parseValue(String? token) {
if (token == '(') {
return _parseMap();
} else if (token == '[') {
return _parseList();
} else {
return token;
}
}
List<dynamic> _parseList() {
List<dynamic> list = [];
String? token = _nextToken();
while (token != null && token != ']') {
if (token != '(' && token != '[') {
list.add(token);
} else {
list.add(_parseValue(token));
}
token = _nextToken();
}
return list;
}
Map<String, dynamic> _parseMap() {
Map<String, dynamic> map = HashMap<String, dynamic>();
String key;
dynamic value;
String? token = _nextToken();
while (token != null && token != ')') {
key = token;
token = _nextToken();
if (token == '=') {
token = _nextToken();
if (token == '(' || token == '[') {
value = _parseValue(token);
} else {
value = token!;
}
map[key] = value;
token = _nextToken();
} else {
throw FormatException('Expected "=" after key');
}
}
return map;
}
Map<String, dynamic> parse() {
Map<String, dynamic> result = HashMap<String, dynamic>();
String? token = _nextToken();
while (token != null) {
if (token == '(' || token == '[') {
throw FormatException('Unexpected opening bracket');
}
String key = token;
token = _nextToken();
if (token == '=') {
token = _nextToken();
if (token == '(' || token == '[') {
result[key] = _parseValue(token);
} else {
result[key] = token;
}
} else {
throw FormatException('Expected "=" after key, got "$token"');
}
token = _nextToken();
}
return result;
}
}
void main() {
print(parseTYON('''
; key / value pairs
first = 1
second = maybe
; defining the type 'person'
; /person = (first-name middle-initial last-name age)
first = 1 ; first value
second = ; the key 'second'
"some text" ; the value corresponding to the key 'second'
numbers = [1 2 3] ; a simple list
nested = [ ; a list containing:
42 ; a value
[1 2 3] ; a nested list
( ; a map
first = 1
second = 2
)
]
person = (
first = John
last = Doe
age = 42
"favorite numbers" = [1 2 3]
)
key = value ; keys can be literals
"string key" = [1 2 3] ; or strings
simple = "simple string"
quoted = "some ""quoted"" text"
multi = "multiple
lines
of text"
; /person = (first middle last age "favorite number")
'''));
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment