Skip to content

Instantly share code, notes, and snippets.

@njsmith
Last active October 27, 2024 09:49
Show Gist options
  • Save njsmith/78f68204c5d969f8c8bc645ef77d4a8f to your computer and use it in GitHub Desktop.
Save njsmith/78f68204c5d969f8c8bc645ef77d4a8f to your computer and use it in GitHub Desktop.

Comparison of configuration file languages

We need to PEPify a static format for writing down bootstrap information in Python source trees. The initial target is a list of PEP 508 package requirement strings. It's possible that in the future we might want to add more features like a build system backend specification (as in PEPs 516, 517), or an extension namespace feature to allow third-party developer tools (flit, pytest, coverage, flake8, etc.) to consolidate their configuration in this file in a systematic way without bumping into each other.

This file will be a central part of the Python package developer user experience, and since its role is to provide bootstrap information it will be rather difficult to change our minds about its format later. (The goal is that you can change what build system you use by editing the bootstrap file... but you can't change what bootstrap file you use by editing the bootstrap file.)

There are a number of perfectly workable options. But given its central role in developer experience, and that I want the Python packager developer experience to be one of pleasure and joy (no really), it seems worth lining up the contenders so that we at least know exactly what the trade-offs are.

In this document I review the four main options that have been suggested:

  • JSON
  • YAML
  • ConfigParser (as used in setup.cfg)
  • TOML

YAML

YAML is a widely-used data structure format. I'll use it to introduce my running example.

This example is meant to give a general flavor of what our eventual bootstrap files might look like -- I'm not actually proposing anything here as an actual standard, but hopefully it's enough to get a sense of how the different formats would feel in actual usage. Each document includes a schema-version as a hedge for future extensibility, a list of PEP 508 bootstrap requirement strings, and an extension entry to see what it would look like if we allow tools like flit to add their own namespaced configuration. In YAML, it looks like this:

schema-version: 1  # optional
bootstrap-requirements:
  # Temporarily commented out 2016-01-10
  # - magic-build-helper
  - setuptools >= 27
  - numpy >= 1.10  # for the new frobnicate feature
  # Pinned until we get a fix for
  #   https://github.com/cyberdyne/the-versionator/issues/123
  - the-versionator == 0.13

# The owner of pypi name "flit" decides what goes under the
#   extension: flit:
# key
extension:
  flit:
    whatever: true

Running this through PyYAML produces:

# Python
{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "extension": {
    "flit": {
      "some-flag": True,
    },
  },
}

In my experience, YAML is full of subtle and hidden gotchas -- sometimes you need quotes for mysterious reasons, small errors tend to produce YAML that's valid but meaningless, and so forth. I don't actually understand how the parsing works, and no-one I know understands how the parsing works either. The specification is 80 pages of dense text (and that's the 1.1 spec, which most implementations seem to have settled on -- the 1.2 spec is different and subtly incompatible). This is a fuzzy metric, but a real issue -- dealing with YAML doesn't help me be a better happier person. My main experience of YAML is cursing at the screen because why did it just do that wtf, and this feeling seems to be wide-spread. On the other hand, the reason it's wide-spread is that YAML itself is wide-spread, and lots of people are at least familiar with it.

JSON

JSON needs no introduction. In JSON our example looks like:

{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13"
  ],
  "extension": {
    "flit": {
      "some-flag": true
    }
  }
}

And here's the Python parse:

# Python
{
  "schema-version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "extension": {
    "flit": {
      "some-flag": True,
    },
  },
}

The nice thing is that the Python and JSON versions are almost identical. The not-so-nice thing is that we had to strip out all the comments. Plus there are finicky annoyances like the lack of support for trailing commas, which trips up human editors and makes diffs harder to read.

ConfigParser

ConfigParser is an INI-like format built into the stdlib. It has many configuration options that affect the file format; the configuration traditionally used by setup.cfg is `RawConfigParser

with its default settings <https://github.com/pypa/setuptools/blob/04d10ff025e1cbef7ec93a2008c930e856045c8a/setuptools/command/setopt.py#L43>`, and these defaults are listed here.

This format has the following attributes:

  • The key namespace is hierarchical with exactly 2 levels: it maps (<section>, <keyname>) tuples to values.
  • All values are strings. (But multi-line strings are supported by indenting continuation lines.)
  • Both = and : are allowed as assignment characters.
  • Comments are allowed using either # or ;.

Example:

; (this is innocuous-looking but broken, see below)
[schema]
version = 1  ; optional

[bootstrap]
requirements =
    setuptools >= 27
    ; Temporarily commented out 2016-01-10
    ; magic-build-helper
    numpy >= 1.10  ; for the frobnicate feature
    ; Pinned until we get a fix for:
    ;   https://github.com/cyberdyne/the-versionator/issues/123
    the-versionator == 0.13

; The owner of pypi name "flit" decides what goes under the
;   extension.flit
; section
[extension.flit]
whatever = True

Or well... no, actually, the above file is broken on both Python 2 and Python 3, but in different ways. On Python 2, the version line is parsed correctly (because ; comments are allowed to begin in the middle of a line -- though # comments are not), but comments are not recognized inside multiline values, so the requirements entry gets all the comments mixed in:

# Python 2
{
  "schema": {
    "version": "1",
  },
  "bootstrap": {
    "requirements": "\nsetuptools >= 27\n; Temporarily commented out 2016-01-10\n; magic-build-helper\nnumpy >= 1.10  ; for the frobnicate feature\n; Pinned until we get a fix for:\n;   https://github.com/cyberdyne/the-versionator/issues/123\nthe-versionator == 0.13"
  },
  "extension.flit": {
    "whatever": "True",
  },
}

On Python 3, comments are recognized inside multi-line values, but are never allowed to begin in the middle of a line, so we instead get:

# Python 3
{
  "schema": {
     "version": "1  ; optional",
  },
  "bootstrap": "{
     "requirements": "\nsetuptools >= 27\nnumpy >= 1.10  ; for the frobnicate feature\nthe-versionator == 0.13",
  },
  "extension.flit": {
    "whatever": "True",
  },
}

So compared to the Python 2 parse, some (but not all) of the comments under the "requirements" key have disappeared -- but under the "version" key, a new comment has snuck in.

The obvious workaround here is to teach everyone to stick to the common subset of Python 2 ConfigParser and Python 3 ConfigParser, so that comments appear only at the beginning of lines and never in the middle of multi-line values.

; ConfigParser, corrected example
[schema]
; version is optional
version = 1

[bootstrap]
; numpy 1.10 needed for the frobnicator feature
; the-versionator is pinned to 0.13 until we get a fix for:
;   https://github.com/cyberdyne/the-versionator/issues/123
requirements =
    setuptools >= 27
    numpy >= 1.10
    the-versionator == 0.13

; Temporarily commented out 2016-01-10
;    magic-build-tool

The trade-off is that we've had to rearrange and rewrite the comments in awkward ways, since we can no longer place the comments next to the things being commented on.

Also, as far as I can tell from testing and web searches, in Python 2 ConfigParser has no support at all for unicode:

-- test.cfg --
[metadata]
author = Stéfan van der Walt

>>> sys.version
'2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]'
>>> import ConfigParser
>>> cp = ConfigParser.RawConfigParser()
>>> cp.read("test.cfg")
>>> cp.items("metadata")
[('author', 'St\xc3\xa9fan van der Walt')]

Fortunately, this does not cause immediate problems for the bootstrap requirements use case, because PyPI mandates that all distribution names be ascii-only. But it does mean that if in the future we ever want to add new build metadata that is genuinely textual, then we'll either need to add a new file in a better-defined format, or else define an extended file format -- something like [ConfigParser + a mandatory post-processing step of calling .decode("utf-8") on all values].

Potentially a setup.cfg PEP could fix up the comment handling in a similar manner, by defining and mandating a post-processing step that strips out comments from values according to some PEP-defined grammar.

OTOH, advantages of ConfigParser include that (a) it's in the stdlib, (b) setup.cfg is a thing that has some precedence.

TOML

TOML is a relatively new contender in the config format races; possibly its most prominent deployment so far is that it's been used for some years as the standard format for Rust package metadata.

TOML is basically the good parts of INI/ConfigParser (human friendliness) crossed with the good parts of JSON (consistent and unambiguous grammar supported across lots of languages + a simple yet rich data model -- TOML keeps JSON's string-keyed-dicts, lists, bools, floats, and strings; drops null; and adds real integers and datetimes). The specification is short and contains many examples. The Rust Cargo docs contain many more examples of using TOML to configure a build system.

Our running example:

schema-version = 1
bootstrap-requirements = [
    "setuptools >= 27",
    # Temporarily comment this out 2016-01-10
    # "magic-build-tool",
    "numpy >= 1.10", # for the new frobnicate feature
    # Pinned until issue #123 is fixed:
    "the-versionator == 0.13",  # <- trailing comma ok, unlike JSON
    ]

# The owner of pypi name "flit" decides what goes under the
#   extension.flit
# key
[extension.flit]
whatever = true

(Note the Python-like list syntax and mandatory string quoting.) This parses into a Python data structure like:

{
  "version": 1,
  "bootstrap-requirements": [
    "setuptools >= 27",
    "numpy >= 1.10",
    "the-versionator == 0.13",
  ],
  "ext": {
    "flit": {
      "whatever": True,
    },
  },
}

Unicode is fully supported -- TOML's string type is unicode, compliant TOML files are required to be encoded in UTF-8, and pytoml handles this correctly on all Python versions:

-- test.toml --
author = Stéfan van der Walt

>>> sys.version
'2.7.11+ (default, Apr 17 2016, 14:00:29) \n[GCC 5.3.1 20160409]'
>>> import pytoml
>>> pytoml.load(open("/tmp/test.toml"))
{u'author': u'St\xe9fan van der Walt'}

(NB though that the toml package doesn't seem to handle unicode correctly on py2, so stay away from that one.)

So as far as all this goes, TOML seems like the no-brainer best option. But the potential downsides for TOML aren't about the technical features of the language -- they're about its relative immaturity compared to the other options above. So I spent a bit of time today trying to dial in exactly what its status is.

The specification: The latest version of the TOML specification is v0.4.0, released Feb. 2015. It has a scary warning at the top: "Be warned, this spec is still changing a lot. Until it's marked as 1.0, you should assume that it is unstable and act accordingly."

This doesn't seem to be a wholly accurate reflection of their actual behavior. There are implementations for many languages and a slightly out-of-date compatibility test suite. I went back and looked at what they changed from 0.3.1 to 0.4.0, and not only were the changes small, but they actually worked with the Rust developers to check that every existing Cargo.toml file remained valid both before and after the changes. One of the two main developers wrote recently that "I'd personally be against most or all breaking changes at this point---too much has become de facto stable.".

There are almost certainly some edge cases and incompatibilities remaining to be discovered and clarified in the spec and implementations; none of these seem likely to affect our core use cases of basic strings and lists and so forth, and it's much better specified than ConfigParser. Presumably any really dire issues that might affect us have already been uncovered by Rust, given their similar use case.

I think that what all this means for us is that if we were to go with TOML, we'd just specify that our bootstrap file format is TOML v0.4.0 -- which is a stable document, by definition :-) -- and then once they finally release a v1.0.0, we can look at the changes and decide whether we want to update. Most likely, it will be tiny compatibility-preserving improvements, in which case all is fine; or if not, then we (and Rust, and others) will stick with the old version, which is exactly the same situation as happened with YAML. ("YAML" to most people means "YAML 1.1"; supposedly YAML 1.2 is the latest version, but ~nobody supports it.)

TOML implementations: As mentioned above, the best TOML parser for Python currently appears to be pytoml. It's TOML v0.4.0 compliant, passes the TOML test suite (which appears to give pytoml >90% statement coverage), and the complete parser is 300 lines of code (plus another 100 lines for the TOML writing support). (Compare to PyYAML, which is >4200 lines of code.) Nominally, pyyaml only supports Python 2.7 and 3.4+, while pip also supports 2.6 and 3.3. It turns out that this is trivially fixable, though: it took me about 15 minutes to add 2.6 and 3.3 support.

This would be an extra library that the pip maintainers would have to vendor. My impression is that this is a relatively low cost endeavour compared to the other libraries that pip vendors, given that it's a small library without external dependencies, and that it performs a fixed task processing trusted input, so it's unlikely to see much churn. However, I don't know what the pip maintainers think of this.

I don't know if pytoml's maintainers have any opinion on the prospect of suddenly finding themselves upstream for pip.

Summary

Personally, I would sum up the above as:

|                             | YAML | JSON | CP  | TOML |
|-----------------------------+------+------+-----+------|
| Well-defined                | yes  | yes  |     | yes  |
| Real data types             | yes  | yes  |     | yes  |
| Sensible commenting support | yes  |      |     | yes  |
| Consistent unicode support  | yes  | yes  |     | yes  |
| Makes humans happy          |      |      | yes | yes  |

I personally started this hoping that writing all this down would reconcile me to the momentum behind setup.cfg, but unfortunately it did the opposite... Given all of the above, I tend to think the trade-offs fall in favor of TOML. I'd be willing to contact the pytoml maintainers to get their perspective, and having taken a look at the code I'd be willing to take on the responsibility of maintaining pytoml if worst came to worst and it turned out we needed to fork it (because upstream didn't want to deal with suddenly having so many users / because the TOML specification authors decide to switch to an XML-based format / because ...). I think that'd a reasonable price for making Python packaging more fun and enjoyable.

Or if we end up going with something else, then oh well, hopefully this document is still useful to make sure we know and can write down whatever trade-offs we end up making.

@ChrisBarker-NOAA
Copy link

It would be nice to add Python literals (as parsed by ast.literal_eval) -- as I see it, like JSON but with comments and less picky about trailing commas. SO, at least for pythonistas, human readable/writable, and clearly defined.

Downside: - Only python has a parser for it.

Unknown: python2 unicode issues?

Maybe you should add a row to your table:

  • well supported by multiple other languages.

@ChrisBarker-NOAA
Copy link

A comment about YAML:

While I"m sure it has all the complexity issues presented, my experience using it for conda has been very pleasant -- probably because I"ve never looked at the spec, and only looked and conda docs and examples, to only used the basic functionality that is easy to do.

But it is a good example of it being used in a similar way and has worked out fine.

@leorochael
Copy link

Nominally, pyyaml only supports Python 2.7 and 3.4+, while pip also supports 2.6 and 3.3. It turns out that this is trivially fixable, though: it took me about 15 minutes to add 2.6 and 3.3 support.

I guess you mean pytoml instead of pyyaml above, considering the link to the python 2.6/3.3 fix.

@nexocentric
Copy link

👍

@guettli
Copy link

guettli commented May 24, 2016

I worked with YAML during the last days (saltstack). And I think your conclusion "Makes humans happy: False" is true.

It is too short. You need to look twice to see if something is a dict or a list.

Does TOML allow nested datastructures: A dict in a list in a dict ...?

@cclauss
Copy link

cclauss commented Dec 5, 2018

Perhaps update this gist based on v0.5 https://github.com/toml-lang/toml/blob/master/README.md

@dejlek
Copy link

dejlek commented Mar 25, 2020

I like the TOML format, but I think format that has a built-in Python support (so either ConfigParser or JSON) should have been picked up. Unless there is a plan for the standard library to have TOML support?

@dg-nvm
Copy link

dg-nvm commented Jun 17, 2020

stealing this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment