diff --git a/pep-0680.rst b/pep-0680.rst new file mode 100644 index 00000000000..1dffa4a00df --- /dev/null +++ b/pep-0680.rst @@ -0,0 +1,501 @@ +PEP: 680 +Title: tomllib: Support for parsing TOML in the Standard Library +Author: Taneli Hukkinen, Shantanu Jain +Sponsor: Petr Viktorin +Discussions-To: https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068 +Status: Draft +Type: Standards Track +Content-Type: text/x-rst +Created: 01-Jan-2022 +Python-Version: 3.11 +Post-History: 1900-01-01 + + +Abstract +======== + +This proposes adding a module, ``tomllib``, to the standard library for +parsing TOML (Tom's Obvious Minimal Language, +`https://toml.io `_). + + +Motivation +========== + +The TOML format is the format of choice for Python packaging, as evidenced by +:pep:`517`, :pep:`518` and :pep:`621`. Including TOML support in the standard +library helps avoid bootstrapping problems for Python build tools. Currently +most Python build tools need to vendor a TOML parsing library. + +Python tools are increasingly configurable via TOML, for examples: ``black``, +``mypy``, ``pytest``, ``tox``, ``pylint``, ``isort``. Those that are not, such +as ``flake8``, cite the lack of standard library support as a `main reason why +`_. + +Given the special place TOML already has in the Python ecosystem, it makes sense +for this to be an included battery. + +Finally, TOML as a format is increasingly popular (some reasons for this are +outlined in PEP 518). Hence this is likely to be a generally useful addition, +even looking beyond the needs of Python packaging and Python tooling: various +Python TOML libraries have about 2000 reverse dependencies on PyPI. For +comparison, ``requests`` has about 28k reverse dependencies. + + +Rationale +========= + +This PEP proposes basing the standard library support for reading TOML on the +third party library ``tomli`` +(`github.com/hukkin/tomli `_). + +Many projects have recently switched to using ``tomli``, for example, ``pip``, +``build``, ``pytest``, ``mypy``, ``black``, ``flit``, ``coverage``, +``setuptools-scm``, ``cibuildwheel``. + +``tomli`` is actively maintained and well-tested. ``tomli`` is about 800 lines +of code with 100% test coverage and passes all tests in a test suite `proposed +as the official TOML compliance test suite +`_, as well as `the more +established BurntSushi/toml-test suite +`_. + + +Specification +============= + +A new module ``tomllib`` with the following functions will be added: + +.. code-block:: + + def load(fp: SupportsRead[bytes], /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ... + def loads(s: str, /, *, parse_float: Callable[[str], Any] = ...) -> dict[str, Any]: ... + +``tomllib.load`` deserializes a binary file containing a +TOML document to a Python dict. +The ``fp`` argument must have a ``read()`` method with the same API as +``io.RawIOBase.read()``. + +``tomllib.loads`` deserializes a str instance containing a TOML document +to a Python dict. + +``parse_float`` is a function that takes a string representing a TOML float and +returns a Python object (similar to ``parse_float`` in ``json.load``). For +example, a function returning a ``decimal.Decimal`` in cases where precision is +important. By default, TOML floats are represented as ``float`` type. + +The returned object contains only basic Python objects (``str``, ``int``, +``bool``, ``float``, ``datetime.{datetime,date,time}``, ``list``, ``dict`` with +string keys), and the results of ``parse_float``. + +``tomllib.TOMLDecodeError`` is raised in the case of invalid TOML. + +Note that this PEP does not propose ``tomllib.dump`` or ``tomllib.dumps`` +functions, see ``_ for details. + + +Maintenance Implications +======================== + +Stability of TOML +----------------- + +The release of TOML v1 in January 2021 indicates stability. Empirically, TOML +has proven to be a stable format even prior to the release of TOML v1. From the +`changelog `_, we +see TOML has had no major changes since April 2020 and has had two releases in +the last five years. + +In the event of changes to the TOML specification, we could treat minor +revisions as bug fixes and update the implementation in place. In the event of +major breaking changes, we should preserve support for TOML v1. + +Maintainability of proposed implementation +------------------------------------------ + +The proposed implementation (``tomli``) is in pure Python, well tested and +weighs under 1000 lines of code. It is minimalist, offering a smaller API +surface area than other TOML implementations. + +The author of ``tomli`` is willing to help integrate ``tomli`` into the standard +library and help maintain it, `as per this post +`__. +Petr Viktorin has indicated willingness to maintain a read API, +`as per this post +`__. + +Rewriting the parser in C is not deemed necessary at this time. It's rare for +TOML parsing to be a bottleneck in applications. Users with higher performance +needs can use a third party library (as is already often the case with JSON, +despite a stdlib extension module). + +TOML support a slippery slope for other things +---------------------------------------------- + +As discussed in motivations, TOML holds a special place in the Python ecosystem. +This chief reason to include TOML in the standard library does not apply to +other formats, such as YAML or MessagePack. + +In addition, the simplicity of TOML can help serve as a dividing line, for +example, YAML is large and complicated. + +Including an API for writing TOML may, however, be added in a future PEP. + + +Backwards Compatibility +======================= + +This proposal has no backwards compatibility issues within the stdlib, as it +describes a new module. +Any existing third-party module named ``tomllib`` will break, as +``import tomllib`` will import standard library module. +However, ``tomllib`` is not registered on PyPI, so it is unlikely that such +a module is widely used. + +Note that we avoid using the more straightforward name ``toml``, to avoid +backwards compatibility implications for users who have pinned versions of the +current ``toml`` PyPI package. For more details, see ``_. + + +Security Implications +===================== + +Errors in the implementation could cause potential security issues. +The parser's output is limited to simple data types; inability to load +arbitrary classes avoids security issues common in more "powerful" formats like +pickle and YAML. Also, the implementation will be in pure Python, which reduces +security issues endemic to C, such as buffer overflows. + + +How to Teach This +================= + +The API of ``tomllib`` mimics that of other well-established file format +libraries, such as ``json`` and ``pickle``. The lack of a ``dump`` function will +be explained in the documentation, with a link to relevant third-party libraries +(``tomlkit``, ``tomli-w``, ``pytomlpp``). + + +Reference Implementation +======================== + +The proposed implementation can be found at https://github.com/hukkin/tomli + + +Rejected Ideas +============== + +Basing on another TOML implementation +------------------------------------- + +Potential alternatives include: + +* ``tomlkit``. + ``tomlkit`` is well established, actively maintained and supports TOML v1. An + important difference is that ``tomlkit`` supports style roundtripping. As a + result, it has a more complex API and implementation (about 5x as much code as + ``tomli``). The author does not believe that ``tomlkit`` is a good choice for + the standard library. + +* ``toml``. + ``toml`` is a widely used library. However, it is not actively maintained, + does not support TOML v1 and has several known bugs. Its API is more complex + than that of ``tomli``. It has some very limited and mostly unused ability to + preserve style through an undocumented decoder API. It has the ability to + customise output style through a complicated encoder API. For more details on + API differences to this PEP, refer to `Appendix A`_. + +* ``pytomlpp``. + ``pytomlpp`` is a Python wrapper for the C++ project ``toml++``. Pure Python + libraries are easier to maintain than extension modules. + +* ``rtoml``. + ``rtoml`` is a Python wrapper for the Rust project ``toml-rs`` and hence has + similar shortcomings to ``pytomlpp``. + In addition, it does not support TOML v1. + +* Writing from scratch. + It's unclear what we would get from this: ``tomli`` meets our needs and the + author is willing to help with its inclusion in the standard library. + +Including an API for writing TOML +--------------------------------- + +There are several reasons to not include an API for writing TOML: + +The ability to write TOML is not needed for the use cases that motivate this +PEP: for core Python packaging use cases or for tools that need to read +configuration. + +Use cases that involve editing TOML (as opposed to writing brand new TOML) are +better served by a style preserving library. TOML is intended as human-readable +and human-editable configuration, so it's important to preserve human markup, +such as comments and formatting. This requires a parser whose output includes +style-related metadata, making it impractical to output plain Python types like +``str`` and ``dict``. Designing such an API is complicated. + +But even without considering style preservation, there are too many degrees of +freedom in how to design a write API. For example, how much control to allow +users over output formatting, over serialization of custom types, and over input +and output validation. While there are reasonable choices on how to resolve +these, the nature of the standard library is such that one only gets one chance +to get things right. + +Currently no CPython core developers have expressed willingness to maintain a +write API or sponsor a PEP that includes a write API. Since it is hard to change +or remove something in the standard library, it is safer to err on the side of +exclusion and potentially revisit later. + +So, writing TOML is left to third-party libraries. If a good API and relevant +use cases for it are found later, it can be added in a future PEP. + + +Assorted API details +-------------------- + +Types accepted by the first argument of ``tomllib.load`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The ``toml`` library on PyPI allows passing paths (and lists of path-like +objects, ignoring missing files and merging the documents into a single object). +Doing this would be inconsistent with ``json.load``, ``pickle.load``, etc. If we +agree consistency with other stdlib modules is desirable, allowing paths is +somewhat out of scope for this PEP. This can easily and explicitly be worked +around in user code, or a third-party library. + +The proposed API takes a binary file, while ``toml.load`` takes a text file and +``json.load`` takes either. Using a binary file allows us to a) ensure utf-8 is +the encoding used, b) avoid incorrectly parsing single carriage returns as valid +TOML due to universal newlines. + +Type accepted by the first argument of ``tomllib.loads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +While ``tomllib.load`` takes a binary file, ``tomllib.loads`` takes +a text string. This may seem inconsistent at first. + +Quoting TOML v1.0.0 specification: + +> A TOML file must be a valid UTF-8 encoded Unicode document. + +``tomllib.loads`` does not intend to load a TOML file, but rather the +document that the file stores. The most natural representation of +a Unicode document in Python is ``str``, not ``bytes``. + +It is possible to add ``bytes`` support in the future if needed, but +we are not aware of any use cases for it. + +Controlling the type of mappings returned by ``tomllib.load[s]`` +---------------------------------------------------------------- + +The ``toml`` library on PyPI supports a ``_dict`` argument, which works +similarly to the ``object_hook`` argument in ``json.load[s]``. There are several +uses of ``_dict`` found on https://grep.app, however, almost all of them are +passing ``_dict=OrderedDict``, which should be unnecessary as of Python 3.7. We +found two instances of legitimate use: in one case, a custom class was passed +for friendlier KeyErrors, in another case, the custom class had several +additional lookup and mutation methods (e.g. to help resolve dotted keys). + +Such an argument is not necessary for the core use cases outlined in the +motivation section. The absence of this can be pretty easily worked around using +a wrapper class, transformer function, or a third-party library. Finally, +support could be added later in a backward compatible way. + + +Removing support for ``parse_float`` in ``tomllib.load[s]`` +----------------------------------------------------------- + +This option is not strictly necessary, since TOML floats are "IEEE 754 binary64 +values", which is ``float`` on most architectures. Using ``decimal.Decimal`` +thus allows users extra precision not promised by the TOML format. However, in +the author of ``tomli``'s experience, this is useful in scientific and financial +applications. TOML-facing users may include non-developers who are not aware of +the limits of double-precision float. + +There are also niche architectures where the Python ``float`` is not a IEEE-754 +binary64. The ``parse_float`` argument allows users to achieve correct TOML +semantics even on such architectures. + + +Alternative names for module +---------------------------- + +Ideally, we would be able to use the ``toml`` module name. + +However, the ``toml`` package on PyPI is widely used, so there are backward +compatibility concerns. Since the standard library takes precedence over third +party packages, users who have pinned versions of ``toml`` would be broken when +upgrading Python versions by any API incompatibilities. + +To further clarify, the user pins are the specific concern here. Even if we were +able to get control over the ``toml`` PyPI package and repurpose it as a +standard library backport, we would still break users who have pinned to +versions of the current ``toml`` package. This is unfortunate, since pinning +would likely be a common response to breaking changes introduced by repurposing +the ``toml`` package as a backport (that is incompatible with today's ``toml``). + +There are several API incompatibilities between ``toml`` and the API proposed in +this PEP, listed in `Appendix A`_. + +Finally, the ``toml`` package on PyPI is not actively maintained and `we have +been unable to contact the author `, +so action here would likely have to be taken without the author's consent. + +This PEP proposes ``tomllib``. This mirrors ``plistlib`` (another file format +module in the standard library), as well as several others such as ``pathlib``, +``graphlib``, etc. + +Other considered names include: + +* ``tomlparser``. This mirrors ``configparser``, but is perhaps slightly less + appropriate if we include a write API in the future. +* ``tomli``. This assumes we use ``tomli`` as the basis for implementation. +* ``toml`` under some namespace, such as ``parser.toml``. However, this is + awkward, especially so since existing libraries like ``json``, ``pickle``, + ``marshal``, ``html`` etc. would not be included in the namespace. + + +TODO: Random things +=================== + +Previous discussion: + +* https://bugs.python.org/issue40059 +* https://mail.python.org/archives/list/python-ideas@python.org/thread/IWJ3I32A4TY6CIVQ6ONPEBPWP4TOV2V7/ +* https://mail.python.org/pipermail/python-dev/2019-May/157405.html +* https://github.com/hukkin/tomli/issues/141 +* https://discuss.python.org/t/adopting-recommending-a-toml-parser/4068/84 + +Useful https://grep.app searches (note, ignore vendored): + +* toml.load[s] usage https://grep.app/search?q=toml.load&filter[lang][0]=Python +* toml.dump[s] usage https://grep.app/search?q=toml.dump&filter[lang][0]=Python +* TomlEncoder subclasses https://grep.app/search?q=TomlEncoder%29%3A&filter[lang][0]=Python + + +.. _Appendix A: + +Appendix A: Differences between proposed API and ``toml`` +========================================================= + +This appendix covers the differences between the API proposed in this PEP and +that of the third party package ``toml``. These differences are relevant to +understanding the amount of breakage we could expect if we used the ``toml`` +name for the standard library module, as well as to better understand the design +space. Note that this list might not be exhaustive. + +#. This PEP currently proposes not to include a write API. That is, there will + be no equivalent of ``toml.dump`` or ``toml.dumps``. + + Discussed at ``_. + + If we included a write API, it would be relatively simple to convert most + code that uses ``toml`` to use the API proposed in this PEP (acknowledging + that that is very different from a compatible API). + + A significant fraction of ``toml`` users rely on this. + +#. Different first argument of ``toml.load`` + + ``toml.load`` has the following signature: + + .. code-block:: + + def load( + f: Union[SupportsRead[str], str, bytes, list[PathLike | str | bytes]], + _dict: Type[MutableMapping[str, Any]] = ..., + decoder: TomlDecoder = ..., + ) -> MutableMapping[str, Any]: ... + + This is pretty different from the first argument proposed in this PEP: ``SupportsRead[bytes]``. + + Recapping the reasons for this, previously mentioned at + ``_: + + * Allowing passing of paths (and lists of path-like objects, ignoring missing + files and merging the documents into a single object) is inconsistent with + other similar functions in the standard library. + * Using ``SupportsRead[bytes]`` allows us to a) ensure utf-8 is the encoding used, + b) avoid incorrectly parsing single carriage returns as valid TOML due to + universal newlines. TOML specifies file encoding and valid newline + sequences, and hence is simply stricter format than what text file objects + represent. + + A significant fraction of ``toml`` users rely on this. + +#. Errors + + ``toml`` raises ``TomlDecodeError`` vs the proposed PEP 8 compliant + ``TOMLDecodeError``. + + A significant fraction of ``toml`` users rely on this. + +#. ``toml.load[s]`` accepts a ``_dict`` argument + + Discussed at ``_. + + As discussed, almost all usage consists of ``_dict=OrderedDict``, which is + not necessary in Python 3.7 and later. + +#. ``toml.load[s]`` support an undocumented ``decoder`` argument + + It seems the intended use case is for an implementation of comment + preservation. The information recorded is not sufficient to roundtrip the + TOML document preserving style, the implementation has known bugs, the + feature is undocumented and I could only find one instance of its use on + https://grep.app. + + The ``toml.TomlDecoder`` interface exposed is not simple, containing nine methods. + See `here `__. + + Users are probably better served by a more complete implementation of style + preserving parsing and writing. + +#. ``toml.dump[s]`` support an ``encoder`` argument + + Note that we currently propose not to include a write API, however if that + were to change, these differences would likely become relevant. + + This enables two use cases, a) control over how custom types should be + serialized, b) control over how output should be formatted. + + The first use case is reasonable, however, I could only find two instances of + this on https://grep.app. One of these two instances used this ability to add + support for dumping ``decimal.Decimal`` (which a potential standard library + implementation would support out of the box). + + If needed, this use case could be well served by the equivalent of the + ``default`` argument in ``json.dump``. + + The second use case is enabled by allowing users to specify subclasses of + ``toml.TomlEncoder`` and overriding methods to specify parts of the TOML + writing process. The API consists of five methods and exposes a lot of + implementation detail. See `here `__. + + There is some usage of the ``encoder`` API on https://grep.app, however, it + likely accounts for a tiny fraction of overall usage of ``toml``. + +#. Timezones + + ``toml`` uses and exposes custom ``toml.tz.TomlTz`` timezone objects. The + proposed implementation uses ``datetime.timezone`` objects from the standard + library. + + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: