Skip to content

Instantly share code, notes, and snippets.

@hauntsaninja
Last active January 2, 2022 22:07
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hauntsaninja/9f136a5a60f63d8ca2cdfadb50edba44 to your computer and use it in GitHub Desktop.
Save hauntsaninja/9f136a5a60f63d8ca2cdfadb50edba44 to your computer and use it in GitHub Desktop.
Proposal: Support for TOML in the Standard Library

Proposal: Support for TOML in the Standard Library

Previous discussion:

Motivation

The TOML format is the format of choice for Python packaging, as evidenced by PEP 517, PEP 518 and PEP 621. Including TOML support in the standard library helps avoid bootstrapping problems for Python build tools.

Python tools are increasingly configurable via TOML, for examples: black, mypy, pytest, tox, pylint, isort. Those that are not, such as flake8, cite the lack of standard library support as a main reason why.

Given the special place TOML already has in the Python ecosystem, it makes sense for this to be an included battery.

Finally, TOML as a format is increasingly popular (some reasons for this are outlined in PEP 518). Hence this is likely to be a generally useful addition, even looking beyond the needs of Python packaging and Python tooling: various Python TOML libraries have about 2000 reverse dependencies on PyPI (requests has about 28k reverse dependencies).

Survey of third party TOML packages

This is a widely used library, with about 1.7k reverse dependencies on PyPI. However, it was maintained by a single person and has become effectively unmaintained. In particular, it does not support TOML v1 (specified as of January 2021 with a release candidate in April 2020).

Given the importance of TOML to Python packaging, inclusion of a TOML package in the standard library could help avoid similar situations going forward.

tomli is a newer library with support for TOML v1. Many projects have recently switched to using tomli from toml. tomli has 113 reverse dependencies on PyPI. These include pip, pytest, mypy, black, flit, coverage, setuptools-scm, cibuildwheel.

tomli is about 800 lines of code with claimed 100% branch coverage. tomli itself only allows you to read TOML; write support is included in its sister package tomli-w. tomli-w is about 200 lines of code with claimed 100% branch coverage (although worth noting that tomli-w is currently much less widely used than tomli).

The author is supportive of potential inclusion in the standard library, as per this

See also tomli's FAQs:

tomlkit supports TOML v1. tomlkit has been around for a while. It was the first Python library I'm aware of to support TOML v1. It has 244 reverse dependencies on PyPI, notably, the poetry packaging tool.

It's more featureful than other libraries mentioned. In particular, it supports round-trip parsing and writing (that is, it preserves whitespace, comments, ordering, style, etc).

tomlkit is about 4600 lines of code.

pytomlpp and rtoml are Python wrappers for the C++ project toml++ and the Rust project toml-rs, respectively.

Concrete proposal

I propose including a TOML package in the standard library with the following API, based on tomli's implementation.

A quick digression: there's been much meta-level discussion, e.g. on what the correct process to add something to the standard library is. I'd personally find it helpful if when replying to this you include at what point you get off the train:

  • Are you +1/0/-1 on TOML to the standard library in the abstract?
  • Are you +1/0/-1 on TOML in the standard library with the proposed API?
  • Are you +1/0/-1 on TOML in the standard library with implementation based on tomli?

Anyway, without further ado...

Read API

I propose using the tomli API for reading TOML:

def load(__fp: SupportsRead[bytes], *, parse_float: Callable[[str], Any] = float) -> dict[str, Any]: ...
def loads(__s: str, *, parse_float: Callable[[str], Any] = float) -> dict[str, Any]: ...

As in stdlib's json, parse_float is a function that takes a string and returns a float, for example, decimal.Decimal in cases where precision is important.

Note we make no attempt to preserve style (comments, whitespace, etc).

Comparison

Here is the read API of toml:

def load(f: Union[str, list, SupportsRead[str]], _dict: Type[MutableMapping[str, Any]] = ..., decoder: TomlDecoder = ...) -> MutableMapping[str, Any]: ...
def loads(s: str, _dict: Type[MutableMapping[str, Any]] = ..., decoder: TomlDecoder = ...) -> MutableMapping[str, Any]: ...

The _dict argument allows the user to control the type of the returned mapping. The decoder argument is undocumented and the API of TomlDecoder is not simple. Its main use case is to pass toml.TomlPreserveCommentDecoder, which allows the user to collect TOML comments. I could only find one use of this on https://grep.app, TomlPreserveCommentDecoder isn't fully style preserving and it has known bugs.

Here is the read API of json:

def loads(
    s: str | bytes,
    *,
    cls: Type[JSONDecoder] | None = ...,
    object_hook: Callable[[dict[Any, Any]], Any] | None = ...,
    parse_float: Callable[[str], Any] | None = ...,
    parse_int: Callable[[str], Any] | None = ...,
    parse_constant: Callable[[str], Any] | None = ...,
    object_pairs_hook: Callable[[list[tuple[Any, Any]]], Any] | None = ...,
    **kwds: Any,
) -> Any: ...
def load(
    fp: SupportsRead[str | bytes],
    *,
    cls: Type[JSONDecoder] | None = ...,
    object_hook: Callable[[dict[Any, Any]], Any] | None = ...,
    parse_float: Callable[[str], Any] | None = ...,
    parse_int: Callable[[str], Any] | None = ...,
    parse_constant: Callable[[str], Any] | None = ...,
    object_pairs_hook: Callable[[list[tuple[Any, Any]]], Any] | None = ...,
    **kwds: Any,
) -> Any: ...

Discussion points

  • Should we preserve style information?

    Tentatively, no.

    The main use case for style preservation is allowing tools to automatically edit TOML without affecting human markup. This is a relatively small fraction of use (as judged by reverse dependencies of toml and tomli vs the style preserving tomlkit) so it seems okay to relegate this additional functionality to third party libraries.

    In particular, we don't need it for the core Python packaging use cases or for tools that merely need to read configuration. Note that this would likely require a large change if we wished to implement it later.

  • Should we add an argument that works like toml's _dict or json's object_hook?

    Tentatively, maybe no.

    a) It's not necessary for core use cases, b) can be pretty easily worked around, and c) could be added in a backward compatible way.

    I was able to find a couple use cases of toml's _dict functionality on https://grep.app. These were a) mostly passing _dict=OrderedDict which should no longer be necessary since 3.7 / 3.6, b) a single case which passed a custom class for friendlier KeyErrors, c) a single case that added several methods to the dictionary-like object (e.g. to help resolve dotted keys).

  • What should we be able to pass to the first argument of load?

    Tentatively, anything with a read method that returns bytes.

    toml allows passing path-like objects (and lists of path-like objects!). I propose not doing this, for consistency with json.load, pickle.load, etc. tomli.load takes a SupportsRead[bytes], toml.load takes a SupportsRead[str], while json takes SupportsRead[str | bytes]. While slightly opinionated, this was a recent change in tomli v1.2 to a) ensure utf-8 is the encoding used, b) avoid ambiguity in the TOML spec regarding universal newlines (see toml-lang/toml#835)

Write API

I propose we use the write API of tomli-w:

def dump(__obj: Mapping[str, Any], __fp: SupportsWrite[bytes], *, multiline_strings: bool = False) -> None: ...
def dumps(__obj: Mapping[str, Any], *, multiline_strings: bool = False) -> str: ...

The multiline_strings controls whether strings containing newlines are written as multiline strings. This defaults to False to ensure preservation of newline byte sequences.

Comparison

Here is the write API of toml:

def dump(o: Mapping[str, Any], f: SupportsWrite[str], encoder: TomlEncoder = ...) -> str: ...
def dumps(o: Mapping[str, Any], encoder: TomlEncoder = ...) -> str: ...

The encoder argument a) gives users some amount of control over formatting, b) lets users do some type dispatch for serialisation of custom types. However, the API of the TomlEncoder class isn't particularly simple.

Discussion points

  • Should we have a write API at all?

    Tentatively, yes.

    Reasons for:

    • Users will likely expect a write API to be available for consistency.
    • Empirically it seems useful: toml.dump has about 1/4x as many hits as toml.load on https://grep.app.
    • At the very least, it will be useful for testing application uses of toml.load (about 1/5 of toml.dump hits are in test files).
    • If we keep featureset narrow, a write API shouldn't be too much additional burden, e.g. tomli-w is 200 LoC.
    • If we're able to re-use the toml package name, having a write API will minimise disruption for any affected users.

    Reasons against:

    • A write API is not needed for the core Python packaging use cases or for tools that merely need to read configuration.
    • Many write use cases I found on https://grep.app made small edits to user specified TOML and wrote it back. These use cases would be better served by a style preserving library to avoid loss of user comments and formatting.
    • Values in TOML can be represented in multiple ways. Several write use cases I found on https://grep.app had extra munging of outputted TOML strings in order to format things in a specific way and may be better served by a more complex API.
  • Should we allow users more control over formatting?

    Tentatively, no.

    As mentioned, TOML values can be represented in multiple ways, so inevitably, people will have strong opinions over how to format strings, when to inline arrays or tables, how much to indent, whether to reorder contents, and so on. In several cases, users could enforce TOML formatting by using an autoformatter of their choice at a later point.

    I acknowledge that supporting multiline_strings is something of an exception to this, if controversial we can err on the side of simplicity and remove it.

  • Should we allow users more control over serialisation?

    Tentatively, maybe no.

    It could be useful to add the equivalent of a default argument (like json.dump) to allow users to specify how custom types should be serialised. However, I could find only one instance of using toml.TomlEncoder to accomplish this kind of thing on https://grep.app.

TOML is used more for configuration than serialisation of arbitrary data, so users are perhaps less likely to require custom serialisation than with say JSON. Support for this could be added in a backward compatible way.

Package name

Ideally, we would be able to use the toml package name. This seems most doable if the maintainer of toml resurfaced and was willing to give up the toml name on PyPI, in which case we could repurpose the PyPI package as a stdlib backport. This would still potentially be breaking for users, but based on my grepping of toml usage, relatively few users should be affected as long as we include a basic write API.

Otherwise, bikesheds include tomllib (like plistlib or pathlib), tomlparser (like configparser) or tomli (assuming we use tomli as the basis for implementation). tomllib seems the best of these to me.

Maintenance

How stable is TOML?

The release of TOML v1 in January 2021 indicates stability. Empirically, TOML has proven to be a stable format even prior to the release of TOML v1. From the changelog, we see TOML has had no major changes since April 2020 and has had two releases in the last five years.

How maintainable is the proposed implementation?

The proposed implementation is in pure Python, well tested and weighs under 1000 lines of code.

The author of tomli has indicated willingness to help integrate tomli into the standard library and help maintain it, as per this.

Since TOML is mainly intended as human readable configuration, there is relatively less need for performance, so we won't need a C extension. Users with extreme performance needs can use a third party library (as is already often the case with JSON, despite an extension module in the standard library).

Is including TOML support a slippery slope for the standard library?

As discussed, TOML holds a special place in the Python ecosystem. The chief reason to include TOML in the standard library does not apply to other formats, such as YAML or MessagePack.

In addition, the simplicity of TOML can help serve as a dividing line, for example, YAML is large and complicated.

What next?

Would love to hear python-dev's thoughts on:

  • Proposed API
  • tomli as the basis of implementation
  • What the next steps should be (turn this into a PEP, write a pull request, etc)
@hukkin
Copy link

hukkin commented Jan 2, 2022

Awesome work, thanks!

As you mentioned on the Tomli issue, a PEP is likely required at this point so the style of this document may need to change a bit.

I'm not sure if there's a good way to review and work on gists collaboratively (?) but I'll add a few notes here. Maybe a regular PR would be easier to work on?


I propose using the tomli API for reading TOML

I suggest using the / syntax for positional args instead of __fp, __s and __object. The reason I don't do this in Tomli (yet) is because it has to support Python 3.7 but I'd like to change this when 3.7 goes EOL. Standard library obviously doesn't need to support EOL Pythons though.

Why positional-only in the first place? Because pickle.loads set a precedent and IMO using a keyword argument here (especially with a non-descriptive name such as s or fp) is bad style and the world may be a better place if nobody writes code like that.


b) avoid ambiguity in the TOML spec regarding universal newlines (see toml-lang/toml#835)

The correct issue to link to here IMO would be toml-lang/toml#837.

EDIT: Also, I don't think there is any ambiguity. Lone CR characters are strictly prohibited in TOML v1.0.0 (no ambiguity here) and if read with universal newlines enabled, will be converted to an LF character and parse just fine, even though being invalid TOML.


Should we have a write API at all?

Tentatively, yes.

I don't feel too strongly about this, but I'd personally be leaning slightly towards "Tentatively, no". Mainly because:

  • We now have PyPI, pip, packaging standards and the Internet! The standard library no longer needs to be able do everything.
  • As discussed, write use-cases probably benefit from style preservation so an external library may be a better choice for writing anyways.

@hukkin
Copy link

hukkin commented Jan 2, 2022

I hope it's fine if I ping @gaborbernat as they were also willing to write a PEP.

@hauntsaninja
Copy link
Author

Thanks, this is great! I've incorporated your feedback into https://github.com/hauntsaninja/peps/blob/toml-pep/pep-9999.rst (as linked in hukkin/tomli#141)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment