hauntsaninja/toml.md Secret

## toml.md

      
    Raw
  

              toml.md
            
          
    Proposal: Support for TOML in the Standard Library

Previous discussion:

https://bugs.python.org/issue40059
https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/
https://mail.python.org/pipermail/python-dev/2019-May/157405.html

Motivation

The TOML format is the format of choice for Python packaging, as evidenced by PEP 517, PEP 518 and PEP 621.
Including TOML support in the standard library helps avoid bootstrapping problems for Python build tools.
Python tools are increasingly configurable via TOML, for examples: black, mypy, pytest, tox, pylint, isort.
Those that are not, such as flake8, cite the lack of standard library support as a main reason why.
Given the special place TOML already has in the Python ecosystem, it makes sense for this to be an included battery.
Finally, TOML as a format is increasingly popular (some reasons for this are outlined in PEP 518).
Hence this is likely to be a generally useful addition, even looking beyond the needs of Python packaging and Python tooling: various Python TOML libraries have about 2000 reverse dependencies on PyPI (requests has about 28k reverse dependencies).
Survey of third party TOML packages

toml

This is a widely used library, with about 1.7k reverse dependencies on PyPI.
However, it was maintained by a single person and has become effectively unmaintained.
In particular, it does not support TOML v1 (specified as of January 2021 with a release candidate in April 2020).
Given the importance of TOML to Python packaging, inclusion of a TOML package in the standard library could help avoid similar situations going forward.
tomli

tomli is a newer library with support for TOML v1.
Many projects have recently switched to using tomli from toml.
tomli has 113 reverse dependencies on PyPI. These include pip, pytest, mypy, black, flit, coverage, setuptools-scm, cibuildwheel.
tomli is about 800 lines of code with claimed 100% branch coverage.
tomli itself only allows you to read TOML; write support is included in its sister package tomli-w.
tomli-w is about 200 lines of code with claimed 100% branch coverage (although worth noting that tomli-w is currently much less widely used than tomli).
The author is supportive of potential inclusion in the standard library, as per this
See also tomli's FAQs:

https://github.com/hukkin/tomli#faq
https://github.com/hukkin/tomli-w#faq

tomlkit

tomlkit supports TOML v1.
tomlkit has been around for a while. It was the first Python library I'm aware of to support TOML v1.
It has 244 reverse dependencies on PyPI, notably, the poetry packaging tool.
It's more featureful than other libraries mentioned. In particular, it supports round-trip parsing and writing (that is, it preserves whitespace, comments, ordering, style, etc).
tomlkit is about 4600 lines of code.
pytomlpp / rtoml

pytomlpp and rtoml are Python wrappers for the C++ project toml++ and the Rust project toml-rs, respectively.
Concrete proposal

I propose including a TOML package in the standard library with the following API, based on tomli's implementation.
A quick digression: there's been much meta-level discussion, e.g. on what the correct process to add something to the standard library is.
I'd personally find it helpful if when replying to this you include at what point you get off the train:

Are you +1/0/-1 on TOML to the standard library in the abstract?
Are you +1/0/-1 on TOML in the standard library with the proposed API?
Are you +1/0/-1 on TOML in the standard library with implementation based on tomli?

Anyway, without further ado...
Read API

I propose using the tomli API for reading TOML:
def load(__fp: SupportsRead[bytes], *, parse_float: Callable[[str], Any] = float) -> dict[str, Any]: ...
def loads(__s: str, *, parse_float: Callable[[str], Any] = float) -> dict[str, Any]: ...

As in stdlib's json, parse_float is a function that takes a string and returns a float, for example, decimal.Decimal in cases where precision is important.
Note we make no attempt to preserve style (comments, whitespace, etc).
Comparison

Here is the read API of toml:
def load(f: Union[str, list, SupportsRead[str]], _dict: Type[MutableMapping[str, Any]] = ..., decoder: TomlDecoder = ...) -> MutableMapping[str, Any]: ...
def loads(s: str, _dict: Type[MutableMapping[str, Any]] = ..., decoder: TomlDecoder = ...) -> MutableMapping[str, Any]: ...

The _dict argument allows the user to control the type of the returned mapping.
The decoder argument is undocumented and the API of TomlDecoder is not simple. Its main use case is to pass toml.TomlPreserveCommentDecoder, which allows the user to collect TOML comments.
I could only find one use of this on https://grep.app, TomlPreserveCommentDecoder isn't fully style preserving and it has known bugs.
Here is the read API of json:
def loads(
    s: str | bytes,
    *,
    cls: Type[JSONDecoder] | None = ...,
    object_hook: Callable[[dict[Any, Any]], Any] | None = ...,
    parse_float: Callable[[str], Any] | None = ...,
    parse_int: Callable[[str], Any] | None = ...,
    parse_constant: Callable[[str], Any] | None = ...,
    object_pairs_hook: Callable[[list[tuple[Any, Any]]], Any] | None = ...,
    **kwds: Any,
) -> Any: ...
def load(
    fp: SupportsRead[str | bytes],
    *,
    cls: Type[JSONDecoder] | None = ...,
    object_hook: Callable[[dict[Any, Any]], Any] | None = ...,
    parse_float: Callable[[str], Any] | None = ...,
    parse_int: Callable[[str], Any] | None = ...,
    parse_constant: Callable[[str], Any] | None = ...,
    object_pairs_hook: Callable[[list[tuple[Any, Any]]], Any] | None = ...,
    **kwds: Any,
) -> Any: ...

Discussion points


Should we preserve style information?
Tentatively, no.
The main use case for style preservation is allowing tools to automatically edit TOML without affecting human markup.
This is a relatively small fraction of use (as judged by reverse dependencies of toml and tomli vs the style preserving tomlkit) so it seems okay to relegate this additional functionality to third party libraries.
In particular, we don't need it for the core Python packaging use cases or for tools that merely need to read configuration.
Note that this would likely require a large change if we wished to implement it later.


Should we add an argument that works like toml's _dict or json's object_hook?
Tentatively, maybe no.
a) It's not necessary for core use cases, b) can be pretty easily worked around, and c) could be added in a backward compatible way.
I was able to find a couple use cases of toml's _dict functionality on https://grep.app. These were a) mostly passing _dict=OrderedDict which should no longer be necessary since 3.7 / 3.6, b) a single case which passed a custom class for friendlier KeyErrors, c) a single case that added several methods to the dictionary-like object (e.g. to help resolve dotted keys).


What should we be able to pass to the first argument of load?
Tentatively, anything with a read method that returns bytes.
toml allows passing path-like objects (and lists of path-like objects!). I propose not doing this, for consistency with json.load, pickle.load, etc.
tomli.load takes a SupportsRead[bytes], toml.load takes a SupportsRead[str], while json takes SupportsRead[str | bytes].
While slightly opinionated, this was a recent change in tomli v1.2 to a) ensure utf-8 is the encoding used, b) avoid ambiguity in the TOML spec regarding universal newlines (see toml-lang/toml#835)


Write API

I propose we use the write API of tomli-w:
def dump(__obj: Mapping[str, Any], __fp: SupportsWrite[bytes], *, multiline_strings: bool = False) -> None: ...
def dumps(__obj: Mapping[str, Any], *, multiline_strings: bool = False) -> str: ...

The multiline_strings controls whether strings containing newlines are written as multiline strings.
This defaults to False to ensure preservation of newline byte sequences.
Comparison

Here is the write API of toml:
def dump(o: Mapping[str, Any], f: SupportsWrite[str], encoder: TomlEncoder = ...) -> str: ...
def dumps(o: Mapping[str, Any], encoder: TomlEncoder = ...) -> str: ...

The encoder argument a) gives users some amount of control over formatting, b) lets users do some type dispatch for serialisation of custom types. However, the API of the TomlEncoder class isn't particularly simple.
Discussion points


Should we have a write API at all?
Tentatively, yes.
Reasons for:

Users will likely expect a write API to be available for consistency.
Empirically it seems useful: toml.dump has about 1/4x as many hits as toml.load on https://grep.app.
At the very least, it will be useful for testing application uses of toml.load (about 1/5 of toml.dump hits are in test files).
If we keep featureset narrow, a write API shouldn't be too much additional burden, e.g. tomli-w is 200 LoC.
If we're able to re-use the toml package name, having a write API will minimise disruption for any affected users.

Reasons against:

A write API is not needed for the core Python packaging use cases or for tools that merely need to read configuration.
Many write use cases I found on https://grep.app made small edits to user specified TOML and wrote it back. These use cases would be better served by a style preserving library to avoid loss of user comments and formatting.
Values in TOML can be represented in multiple ways. Several write use cases I found on https://grep.app had extra munging of outputted TOML strings in order to format things in a specific way and may be better served by a more complex API.


Should we allow users more control over formatting?
Tentatively, no.
As mentioned, TOML values can be represented in multiple ways, so inevitably, people will have strong opinions over how to format strings, when to inline arrays or tables, how much to indent, whether to reorder contents, and so on.
In several cases, users could enforce TOML formatting by using an autoformatter of their choice at a later point.
I acknowledge that supporting multiline_strings is something of an exception to this, if controversial we can err on the side of simplicity and remove it.


Should we allow users more control over serialisation?
Tentatively, maybe no.
It could be useful to add the equivalent of a default argument (like json.dump) to allow users to specify how custom types should be serialised.
However, I could find only one instance of using toml.TomlEncoder to accomplish this kind of thing on https://grep.app.


TOML is used more for configuration than serialisation of arbitrary data, so users are perhaps less likely to require custom serialisation than with say JSON.
Support for this could be added in a backward compatible way.
Package name

Ideally, we would be able to use the toml package name.
This seems most doable if the maintainer of toml resurfaced and was willing to give up the toml name on PyPI, in which case we could repurpose the PyPI package as a stdlib backport. This would still potentially be breaking for users, but based on my grepping of toml usage, relatively few users should be affected as long as we include a basic write API.
Otherwise, bikesheds include tomllib (like plistlib or pathlib), tomlparser (like configparser) or tomli (assuming we use tomli as the basis for implementation). tomllib seems the best of these to me.
Maintenance

How stable is TOML?

The release of TOML v1 in January 2021 indicates stability.
Empirically, TOML has proven to be a stable format even prior to the release of TOML v1.
From the changelog, we see TOML has had no major changes since April 2020 and has had two releases in the last five years.
How maintainable is the proposed implementation?

The proposed implementation is in pure Python, well tested and weighs under 1000 lines of code.
The author of tomli has indicated willingness to help integrate tomli into the standard library and help maintain it, as per this.
Since TOML is mainly intended as human readable configuration, there is relatively less need for performance, so we won't need a C extension. Users with extreme performance needs can use a third party library (as is already often the case with JSON, despite an extension module in the standard library).
Is including TOML support a slippery slope for the standard library?

As discussed, TOML holds a special place in the Python ecosystem. The chief reason to include TOML in the standard library does not apply to other formats, such as YAML or MessagePack.
In addition, the simplicity of TOML can help serve as a dividing line, for example, YAML is large and complicated.
What next?

Would love to hear python-dev's thoughts on:

Proposed API
tomli as the basis of implementation
What the next steps should be (turn this into a PEP, write a pull request, etc)