njsmith/config-language-comparison.rst

## config-language-comparison.rst

      
    Raw
  

              config-language-comparison.rst
            
          
    Comparison of configuration file languages

We need to PEPify a static format for writing down bootstrap
information in Python source trees. The initial target is a list of
PEP 508 package requirement strings. It's possible that in the future
we might want to add more features like a build system backend
specification (as in PEPs 516, 517), or an extension namespace feature
to allow third-party developer tools (flit, pytest, coverage, flake8,
etc.) to consolidate their configuration in this file in a systematic
way without bumping into each other.
This file will be a central part of the Python package developer user
experience, and since its role is to provide bootstrap information it
will be rather difficult to change our minds about its format
later. (The goal is that you can change what build system you use by
editing the bootstrap file... but you can't change what bootstrap file
you use by editing the bootstrap file.)
There are a number of perfectly workable options. But given its
central role in developer experience, and that I want the Python
packager developer experience to be one of pleasure and joy (no
really), it seems worth lining up the contenders so that we at least
know exactly what the trade-offs are.
In this document I review the four main options that have been
suggested:

JSON
YAML
ConfigParser (as used in setup.cfg)
TOML


YAML

YAML is a widely-used data structure format. I'll use it to introduce
my running example.
This example is meant to give a general flavor of what our eventual
bootstrap files might look like -- I'm not actually proposing anything
here as an actual standard, but hopefully it's enough to get a sense
of how the different formats would feel in actual usage. Each document
includes a schema-version as a hedge for future extensibility, a list
of PEP 508 bootstrap requirement strings, and an extension entry to
see what it would look like if we allow tools like flit to add their
own namespaced configuration. In YAML, it looks like this:
schema-version: 1  # optional
bootstrap-requirements:
  # Temporarily commented out 2016-01-10
  # - magic-build-helper
  - setuptools >= 27
  - numpy >= 1.10  # for the new frobnicate feature
  # Pinned until we get a fix for
  #   https://github.com/cyberdyne/the-versionator/issues/123
  - the-versionator == 0.13

# The owner of pypi name "flit" decides what goes under the
#   extension: flit:
# key
extension:
  flit:
    whatever: true

Running this through PyYAML produces:
# Python
{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "extension": {
    "flit": {
      "some-flag": True,
    },
  },
}

In my experience, YAML is full of subtle and hidden gotchas --
sometimes you need quotes for mysterious reasons, small errors tend to
produce YAML that's valid but meaningless, and so forth. I don't
actually understand how the parsing works, and no-one I know
understands how the parsing works either. The specification is 80
pages of dense text (and
that's the 1.1 spec, which most implementations seem to have settled
on -- the 1.2 spec is different and subtly incompatible). This is a
fuzzy metric, but a real issue -- dealing with YAML doesn't help me be
a better happier person. My main experience of YAML is cursing at the
screen because why did it just do that wtf, and this feeling seems
to be wide-spread. On the other hand, the reason it's wide-spread is
that YAML itself is wide-spread, and lots of people are at least
familiar with it.

JSON

JSON needs no introduction. In JSON our example looks like:
{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13"
  ],
  "extension": {
    "flit": {
      "some-flag": true
    }
  }
}

And here's the Python parse:
# Python
{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "extension": {
    "flit": {
      "some-flag": True,
    },
  },
}

The nice thing is that the Python and JSON versions are almost
identical. The not-so-nice thing is that we had to strip out all the
comments. Plus there are finicky annoyances like the lack of support
for trailing commas, which trips up human editors and makes diffs
harder to read.

ConfigParser

ConfigParser is an INI-like format built into the stdlib. It has many
configuration options that affect the file format; the configuration
traditionally used by setup.cfg is `RawConfigParser

with its default settings
<https://github.com/pypa/setuptools/blob/04d10ff025e1cbef7ec93a2008c930e856045c8a/setuptools/command/setopt.py#L43>`,
and these defaults are listed here.
This format has the following attributes:

The key namespace is hierarchical with exactly 2 levels: it maps
(<section>, <keyname>) tuples to values.
All values are strings. (But multi-line strings are supported by
indenting continuation lines.)
Both = and : are allowed as assignment characters.
Comments are allowed using either # or ;.

Example:
; (this is innocuous-looking but broken, see below)
[schema]
version = 1  ; optional

[bootstrap]
requirements =
    setuptools >= 27
    ; Temporarily commented out 2016-01-10
    ; magic-build-helper
    numpy >= 1.10  ; for the frobnicate feature
    ; Pinned until we get a fix for:
    ;   https://github.com/cyberdyne/the-versionator/issues/123
    the-versionator == 0.13

; The owner of pypi name "flit" decides what goes under the
;   extension.flit
; section
[extension.flit]
whatever = True

Or well... no, actually, the above file is broken on both Python 2 and
Python 3, but in different ways. On Python 2, the version line is
parsed correctly (because ; comments are allowed to begin in the
middle of a line -- though # comments are not), but comments are
not recognized inside multiline values, so the requirements entry
gets all the comments mixed in:
# Python 2
{
  "schema": {
    "version": "1",
  },
  "bootstrap": {
    "requirements": "\nsetuptools >= 27\n; Temporarily commented out 2016-01-10\n; magic-build-helper\nnumpy >= 1.10  ; for the frobnicate feature\n; Pinned until we get a fix for:\n;   https://github.com/cyberdyne/the-versionator/issues/123\nthe-versionator == 0.13"
  },
  "extension.flit": {
    "whatever": "True",
  },
}

On Python 3, comments are recognized inside multi-line values, but are
never allowed to begin in the middle of a line, so we instead get:
# Python 3
{
  "schema": {
     "version": "1  ; optional",
  },
  "bootstrap": "{
     "requirements": "\nsetuptools >= 27\nnumpy >= 1.10  ; for the frobnicate feature\nthe-versionator == 0.13",
  },
  "extension.flit": {
    "whatever": "True",
  },
}

So compared to the Python 2 parse, some (but not all) of the comments
under the "requirements" key have disappeared -- but under the
"version" key, a new comment has snuck in.
The obvious workaround here is to teach everyone to stick to the
common subset of Python 2 ConfigParser and Python 3 ConfigParser, so
that comments appear only at the beginning of lines and never in the
middle of multi-line values.
; ConfigParser, corrected example
[schema]
; version is optional
version = 1

[bootstrap]
; numpy 1.10 needed for the frobnicator feature
; the-versionator is pinned to 0.13 until we get a fix for:
;   https://github.com/cyberdyne/the-versionator/issues/123
requirements =
    setuptools >= 27
    numpy >= 1.10
    the-versionator == 0.13

; Temporarily commented out 2016-01-10
;    magic-build-tool

The trade-off is that we've had to rearrange and rewrite the comments
in awkward ways, since we can no longer place the comments next to the
things being commented on.
Also, as far as I can tell from testing and web searches, in Python 2
ConfigParser has no support at all for unicode:
-- test.cfg --
[metadata]
author = Stéfan van der Walt

>>> sys.version
'2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]'
>>> import ConfigParser
>>> cp = ConfigParser.RawConfigParser()
>>> cp.read("test.cfg")
>>> cp.items("metadata")
[('author', 'St\xc3\xa9fan van der Walt')]

Fortunately, this does not cause immediate problems for the bootstrap
requirements use case, because PyPI mandates that all distribution
names be ascii-only. But it does mean
that if in the future we ever want to add new build metadata that is
genuinely textual, then we'll either need to add a new file in a
better-defined format, or else define an extended file format --
something like [ConfigParser + a mandatory post-processing step of
calling .decode("utf-8") on all values].
Potentially a setup.cfg PEP could fix up the comment handling in a
similar manner, by defining and mandating a post-processing step that
strips out comments from values according to some PEP-defined grammar.
OTOH, advantages of ConfigParser include that (a) it's in the
stdlib, (b) setup.cfg is a thing that has some precedence.

TOML

TOML is a relatively new contender in the config format races;
possibly its most prominent deployment so far is that it's been used
for some years as the standard format for Rust package metadata.
TOML is basically the good parts of INI/ConfigParser (human
friendliness) crossed with the good parts of JSON (consistent and
unambiguous grammar supported across lots of languages + a simple yet
rich data model -- TOML keeps JSON's string-keyed-dicts, lists, bools,
floats, and strings; drops null; and adds real integers and
datetimes). The specification is short
and contains many examples. The Rust Cargo docs contain many more examples of
using TOML to configure a build system.
Our running example:
schema-version = 1
bootstrap-requirements = [
    "setuptools >= 27",
    # Temporarily comment this out 2016-01-10
    # "magic-build-tool",
    "numpy >= 1.10", # for the new frobnicate feature
    # Pinned until issue #123 is fixed:
    "the-versionator == 0.13",  # <- trailing comma ok, unlike JSON
    ]

# The owner of pypi name "flit" decides what goes under the
#   extension.flit
# key
[extension.flit]
whatever = true

(Note the Python-like list syntax and mandatory string quoting.) This
parses into a Python data structure like:
{
  "version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "ext": {
    "flit": {
      "whatever": True,
    },
  },
}

Unicode is fully supported -- TOML's string type is unicode, compliant
TOML files are required to be encoded in UTF-8, and pytoml handles
this correctly on all Python versions:
-- test.toml --
author = Stéfan van der Walt

>>> sys.version
'2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]'
>>> import pytoml
>>> pytoml.load(open("/tmp/test.toml"))
{u'author': u'St\xe9fan van der Walt'}

(NB though that the toml package doesn't seem to handle unicode
correctly on py2, so stay away from that one.)
So as far as all this goes, TOML seems like the no-brainer best
option. But the potential downsides for TOML aren't about the
technical features of the language -- they're about its relative
immaturity compared to the other options above. So I spent a bit of
time today trying to dial in exactly what its status is.
The specification: The latest version of the TOML specification is
v0.4.0, released Feb. 2015. It has a scary warning at the top: "Be
warned, this spec is still changing a lot. Until it's marked as 1.0,
you should assume that it is unstable and act accordingly."
This doesn't seem to be a wholly accurate reflection of their actual
behavior. There are implementations for many languages
and a slightly out-of-date compatibility test suite. I
went back and looked at what they changed from 0.3.1 to 0.4.0, and not
only were the changes small, but they actually worked with the Rust
developers to check that every existing Cargo.toml file remained valid
both before and after the changes. One of the two main developers
wrote recently
that "I'd personally be against most or all breaking changes at this
point---too much has become de facto stable.".
There are almost certainly some edge cases and incompatibilities
remaining to be discovered and clarified in the spec and
implementations; none of these seem likely to affect our core use
cases of basic strings and lists and so forth, and it's much better
specified than ConfigParser. Presumably any really dire issues that
might affect us have already been uncovered by Rust, given their
similar use case.
I think that what all this means for us is that if we were to go with
TOML, we'd just specify that our bootstrap file format is TOML v0.4.0
-- which is a stable document, by definition :-) -- and then once they
finally release a v1.0.0, we can look at the changes and decide
whether we want to update. Most likely, it will be tiny
compatibility-preserving improvements, in which case all is fine; or
if not, then we (and Rust, and others) will stick with the old
version, which is exactly the same situation as happened with
YAML. ("YAML" to most people means "YAML 1.1"; supposedly YAML 1.2 is
the latest version, but ~nobody supports it.)
TOML implementations: As mentioned above, the best TOML parser for
Python currently appears to be pytoml. It's TOML v0.4.0 compliant,
passes the TOML test suite (which appears to give pytoml >90%
statement coverage), and the complete parser is 300 lines of code
(plus another 100 lines for the TOML writing support). (Compare to
PyYAML, which is >4200 lines of code.) Nominally, pyyaml only supports
Python 2.7 and 3.4+, while pip also supports 2.6 and 3.3. It turns out
that this is trivially fixable, though: it took me about 15 minutes to
add 2.6 and 3.3 support.
This would be an extra library that the pip maintainers would have to
vendor. My impression is that this is a relatively low cost endeavour
compared to the other libraries that pip vendors, given that it's a
small library without external dependencies, and that it performs a
fixed task processing trusted input, so it's unlikely to see much
churn. However, I don't know what the pip maintainers think of this.
I don't know if pytoml's maintainers have any opinion on the prospect
of suddenly finding themselves upstream for pip.

Summary

Personally, I would sum up the above as:
|                             | YAML | JSON | CP  | TOML |
|-----------------------------+------+------+-----+------|
| Well-defined                | yes  | yes  |     | yes  |
| Real data types             | yes  | yes  |     | yes  |
| Sensible commenting support | yes  |      |     | yes  |
| Consistent unicode support  | yes  | yes  |     | yes  |
| Makes humans happy          |      |      | yes | yes  |

I personally started this hoping that writing all this down would
reconcile me to the momentum behind setup.cfg, but unfortunately it
did the opposite... Given all of the above, I tend to think the
trade-offs fall in favor of TOML. I'd be willing to contact the pytoml
maintainers to get their perspective, and having taken a look at the
code I'd be willing to take on the responsibility of maintaining
pytoml if worst came to worst and it turned out we needed to fork it
(because upstream didn't want to deal with suddenly having so many
users / because the TOML specification authors decide to switch to an
XML-based format / because ...).  I think that'd a reasonable price
for making Python packaging more fun and enjoyable.
Or if we end up going with something else, then oh well, hopefully
this document is still useful to make sure we know and can write down
whatever trade-offs we end up making.