Skip to content

Instantly share code, notes, and snippets.

@pombredanne
Last active November 11, 2015 18:41
Show Gist options
  • Save pombredanne/d97ff0435dfa755461d8 to your computer and use it in GitHub Desktop.
Save pombredanne/d97ff0435dfa755461d8 to your computer and use it in GitHub Desktop.
A README for a to be written sane YAML parser and serializer

yaml2 is an alternative sane and opiniated YAML parser and serializer for Python.

If you have been frustrated by PyYAML and you want to use YAML for simple and readable data, keep reading: yaml2 has been made for you.

Why yaml2?

YAML is a formidable data format.

On the one hand, it can be used to handle intuitive and highly readable and portable data in simple text file. This is YAML main attraction and the good parts of YAML. This is why it is used most of the time.

On the other hand, YAM can be incredibly complex. This is "YAML: the bad parts":

  • The YAML spec is a long, bloated and complex specification: the JSON spec is one page, the YAML spec is about twice longer than the XML spec!

  • YAML can create arcane and unreadable documents

  • YAML support multiple ways to represent the same data in block, flow or canonical representations.

  • YAML can also read and write JSON that YAML 1.2 considers as a subset of itself.

  • YAML supports endless extensibility with custom tags and can eventually serialize and deserialize native objects.

  • YAML does surprising implicit type resolution such as recognizing and converting numbers and dates to native types.

In Python land and in many places, the reference parser is the C-based libYAML and the only seriously available YAML parser for Python is PyYAML, both from the same author. PyYAML attempts to implement the full YAML 1.1 spec and eventually 1.2.

It is mature and well tested but has a rather lackluster documentation and an unactive community.

PyYAML also implements unsafe objects serialization with support for extension; it's default operation mode is unsafe.

Many developer when attracted to the good part of YAML start by using PyYAML and end up fighting against PyYAML and its support for complex features and unsane defaults.

YAML is used in several major Python-based projects and products such as , OpenStack and others.

YAML is ubiquitous as configuration files for CI such as Travis and Drone, for the Google App Engine, for devops tools such as Ansible, Salt, for OpenStack.

This has been the motivation for yaml2.

How does yaml2 work?

yaml2 can load or dump only block-style YAML. Flow style, JSON-looking YAML cannot be parsed and will error out. Only UTF-8 encoded YAML is supported. Loading is 'safe' and will only return primitive Python objects.

Loading works by reading a YAML string or a file and converts its input to a simple structure of nested lists and dictionaries containing only strings.

No implicit type conversion from strings to something else is done: if you want to get specific types, you can convert these in your application.

Comments in YAML are ignored by default, but they can be kept in the returned data as Comment objects. You can either keep or ignore them.

You can also add Comment objects to an existing data stream to dump comments in the emitted YAML.

The parser is forgiving, and tries to recover from common hand-written YAML errors:

  • in a block-style string, a colon followed by a space is not considered an error. Instead it is treated just as part of the string.

  • in strings are not recognized as end-of-line comments, but as part of the

    string

  • things looking like flow-style YAML are treated as plain strings

  • indented comment lines are treated like regular comments

Canonical YAML and custom tags are not supported and will error out..

Dumping works by reading a Python structure composed only of primitive standard types: dicts, OrderedDicts, sets, lists, namedtuples, tuples, strings.

Anything else that is not a string or unicode object will be converted to unicode including ints and floats.

bytes and str are converted to unicode using the surrogate encoding meaning that binary strings will serialize correctly.

The YAML created uses the block style and is UTF-8 encoded.

  • It is properly indented using four spaces.
  • Strings are stripped from leading and trailing white spaces.
  • OrderedDicts are dumped in order.
  • Dicts are sorted by keys.
  • Long strings that exceed the width are wrapped using the literal | or folding > style depending on the presence of line returns
  • Comment and BlankLine objects present in the data stream are recreated as a # followed by a space and the comment line. They are not wrapped.
  • None are represented as empty strings and not by a 'null'.
  • You can control the wrapping of strings with the width parameter that defaults to 120.

Strings are quoted only if absolutely necessary, in particular things looking like numbers and dates are quoted if needed to avoid implicit type conversion by another YAML parser. The single quote is always used for quoting. Double quotes are always escaped as ".

What is yaml2?

yaml2 is a minimalist and forgiving YAML parser that can parse simple block- style YAML without surprises and emit clean block-style standard YAML. It supports reading and writing comments and empty lines to keep YAML documents readable.

yaml2 started as a set of utilities to deal more sanely with the compounded complexities of YAML and PyYAML and is now based on a fork of the PyYAML codebase keeping only the good parts of YAML and offering:

  • forgiving and unsurprising parsing,
  • serialization to a YAML-compliant block-style readable format,
  • tracking original map order, comments and empty lines and writing these back.
  • comprehensive documentation and easy hacking.

Why not using yaml2?

If you want full support of YAML complexities and spec... If you think that JSON is a subset of YAML... If you want to serialize arbitrary Python objects to YAML... If you want to extend YAML with custom tags... If you like YAML flow and canonical representation ... ...then there is nothing for you here: yaml2 is not for you: move on.

Instead consider these alternative:

  • For JSON, use json.
  • For Python object serialization use Pickle.
  • For complex data and binary data, consider Protocol Buffers or MessagePack, use a full YAML parser like PyYAML or use XML.

Some of the common libYAML and PyYAML woes have been forcing the same type of code to be rewritten over and over 100's or 1000's of times. Here are some examples:

@luzfcb
Copy link

luzfcb commented Nov 11, 2015

I saw about it in crdoconnor/dumbyaml#1.

Good, but, where is the source?

It is this: https://github.com/nexB/scancode-toolkit/blob/e243767f5ccd67674c1d3d0df8698a10cba419ee/src/licensedcode/saneyaml.py ?

maybe "yaml2" name, could be confused with this https://github.com/yaml/YAML2/wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment