Skip to content

Instantly share code, notes, and snippets.

@njsmith
Created July 20, 2017 10:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save njsmith/8eee7cecdd7e02fef5e4e9bb589877c9 to your computer and use it in GitHub Desktop.
Save njsmith/8eee7cecdd7e02fef5e4e9bb589877c9 to your computer and use it in GitHub Desktop.

PEP: 517 Title: A build-system independent format for source trees Version: $Revision$ Last-Modified: $Date$ Author: Nathaniel J. Smith <njs@pobox.com>, Thomas Kluyver <thomas@kluyver.me.uk> BDFL-Delegate: Nick Coghlan <ncoghlan@gmail.com> Discussions-To: <distutils-sig@python.org> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 30-Sep-2015 Post-History: 1 Oct 2015, 25 Oct 2015, 1 July 2017

Abstract

While distutils / setuptools have taken us a long way, they suffer from three serious problems: (a) they're missing important features like usable build-time dependency declaration, autoconfiguration, and even basic ergonomic niceties like DRY-compliant version number management, and (b) extending them is difficult, so while there do exist various solutions to the above problems, they're often quirky, fragile, and expensive to maintain, and yet (c) it's very difficult to use anything else, because distutils/setuptools provide the standard interface for installing packages expected by both users and installation tools like pip.

Previous efforts (e.g. distutils2 or setuptools itself) have attempted to solve problems (a) and/or (b). This proposal aims to solve (c).

The goal of this PEP is get distutils-sig out of the business of being a gatekeeper for Python build systems. If you want to use distutils, great; if you want to use something else, then that should be easy to do using standardized methods. The difficulty of interfacing with distutils means that there aren't many such systems right now, but to give a sense of what we're thinking about see flit or bento. Fortunately, wheels have now solved many of the hard problems here -- e.g. it's no longer necessary that a build system also know about every possible installation configuration -- so pretty much all we really need from a build system is that it have some way to spit out standard-compliant wheels and sdists.

We therefore propose a new, relatively minimal interface for installation tools like pip to interact with package source trees and source distributions.

Reversion to Draft Status

While this PEP was provisionally accepted for implementation in pip and other tools, some additional concerns were subsequently raised around adequately supporting out of tree builds. It has been reverted to Draft status while those concerns are being resolved.

Terminology and goals

A source tree is something like a VCS checkout. We need a standard interface for installing from this format, to support usages like pip install some-directory/.

A source distribution is a static snapshot representing a particular release of some source code, like lxml-3.4.4.tar.gz. Source distributions serve many purposes: they form an archival record of releases, they provide a stupid-simple de facto standard for tools that want to ingest and process large corpora of code, possibly written in many languages (e.g. code search), they act as the input to downstream packaging systems like Debian/Fedora/Conda/..., and so forth. In the Python ecosystem they additionally have a particularly important role to play, because packaging tools like pip are able to use source distributions to fulfill binary dependencies, e.g. if there is a distribution foo.whl which declares a dependency on bar, then we need to support the case where pip install bar or pip install foo automatically locates the sdist for bar, downloads it, builds it, and installs the resulting package.

Source distributions are also known as sdists for short.

A build frontend is a tool that users might run that takes arbitrary source trees or source distributions and builds wheels from them. The actual building is done by each source tree's build backend. In a command like pip wheel some-directory/, pip is acting as a build frontend.

An integration frontend is a tool that users might run that takes a set of package requirements (e.g. a requirements.txt file) and attempts to update a working environment to satisfy those requirements. This may require locating, building, and installing a combination of wheels and sdists. In a command like pip install lxml==2.4.0, pip is acting as an integration frontend.

Source trees

There is an existing, legacy source tree format involving setup.py. We don't try to specify it further; its de facto specification is encoded in the source code and documentation of distutils, setuptools, pip, and other tools. We'll refer to it as the setup.py-style.

Here we define a new style of source tree based around the pyproject.toml file defined in PEP 518, extending the [build-system] table in that file with one additional key, build-backend. Here's an example of how it would look:

[build-system]
# Defined by PEP 518:
requires = ["flit"]
# Defined by this PEP:
build-backend = "flit.api:main"

build-backend is a string naming a Python object that will be used to perform the build (see below for details). This is formatted following the same module:object syntax as a setuptools entry point. For instance, if the string is "flit.api:main" as in the example above, this object would be looked up by executing the equivalent of:

import flit.api
backend = flit.api.main

It's also legal to leave out the :object part, e.g. :

build-backend = "flit.api"

which acts like:

import flit.api
backend = flit.api

Formally, the string should satisfy this grammar:

identifier = (letter | '_') (letter | '_' | digit)*
module_path = identifier ('.' identifier)*
object_path = identifier ('.' identifier)*
entry_point = module_path (':' object_path)?

And we import module_path and then lookup module_path.object_path (or just module_path if object_path is missing).

If the pyproject.toml file is absent, or the build-backend key is missing, the source tree is not using this specification, and tools should fall back to running setup.py.

Where the build-backend key exists, it takes precedence over setup.py, and source trees need not include setup.py at all. Projects may still wish to include a setup.py for compatibility with tools that do not use this spec.

Build backend interface

The build backend object is expected to have callable attributes called "hooks", which the build frontend can use to perform various actions. The two high-level actions defined by this spec are creation of an sdist (analogous to the legacy setup.py sdist command) and building of a wheel (analogous to the legacy setup.py bdist_wheel command). We additionally define a namespace for tool-specific hooks, which may be useful for prototyping future extensions to this specification.

General rules for all hooks

Finding the source tree

All hooks are run with the process working directory set to the root of the source tree (i.e., the directory containing pyproject.toml). To find the source tree, hooks should call os.getpwd() or equivalent.

Rationale: the process working directory has to be set to something, and if we were to leave it up to the build frontend to pick, then packages developers would accidentally write code that assumes a particular answer here (example: long_desc = open("README.rst").read()), and this code would break when used with other build frontends. So it's important that we standardize a value for all build frontends to use consistently. And this is the obvious thing to specify it as, especially because it's compatible with popular and long-standing conventions like calling open("README.rst").read(). Then, given that we've decided to standardize on working directory = source directory, it makes sense to say that this is the only way that this information is passed, because providing a second redundant way (example: as an explicit argument to hooks) would only increase the possiblity of error without any benefit.

Lifecycle

XX TODO: do we want to require frontends to use a new process for every hook call, or do we want to require backends to support multiple calls from the same process? Apparently scons and setuptools both can get cranky if you try to invoke them twice from the same process, so someone will be spawning extra processes here; the question is where to put that responsibility. The basic trade-off is that making it the backend's responsibility has better best-case performance if both the frontend and backend are able to re-use a single host process; but, if common frontends end up using new processes for each hook call for other reasons, then in practice either backends will end up spawning unnecessary extra processes, or else will end up with poorly tested paths when multiple hooks are run in the same process.

Given that get_build_*_requiresbuild_* in general requires changing the Python environment, it doesn't necessarily make sense to run these in the same process anyway. However, there's an important special case where it does: when get_build_*_requires returns []. And this is probably the overwhelmingly most common case.

Does it even matter? Windows is notoriously slow at spawning subprocesses. As a quick test, I tried measuring the time to spawn CPython 3.6 + import a package on a Windows 10 VM running on my laptop. python3.6 -c "import flit" was about 300 ms per call; python3.6 -c "import setuptools" was about 600 ms per call.

We could also potentially get fancy and have a flag to let the frontend and backend negotiate this (e.g. process_reuse_safe as an opt-in flag). This could also be added later as an extension, as long as we initially default to requiring separate processes for each hook.

Calling conventions

Hooks MAY be called with positional or keyword arguments, so backends implementing them MUST be careful to make sure that their signatures – including argument names – exactly match those specified here.

Output

Hooks MAY print arbitrary informational text on stdout and stderr. They MUST NOT read from stdin, and the build frontend MAY close stdin before invoking the hooks.

The build frontend may capture stdout and/or stderr from the backend. If the backend detects that an output stream is not a terminal/console (e.g. not sys.stdout.isatty()), it SHOULD ensure that any output it writes to that stream is UTF-8 encoded. The build frontend MUST NOT fail if captured output is not valid UTF-8, but it MAY not preserve all the information in that case (e.g. it may decode using the replace error handler in Python). If the output stream is a terminal, the build backend is responsible for presenting its output accurately, as for any program running in a terminal.

If a hook raises any exception, or causes the process to terminate, then this indicates that the operation has failed.

User-specified configuration

All hooks take a standard config_settings argument.

This argument is an arbitrary dictionary provided as an "escape hatch" for users to pass ad-hoc configuration into individual package builds. Build backends MAY assign any semantics they like to this dictionary. Build frontends SHOULD provide some mechanism for users to specify arbitrary string-key/string-value pairs to be placed in this dictionary. For example, they might support some syntax like --package-config CC=gcc. Build frontends MAY also provide arbitrary other mechanisms for users to place entries in this dictionary. For example, pip might choose to map a mix of modern and legacy command line arguments like:

pip install                                           \
  --package-config CC=gcc                             \
  --global-option="--some-global-option"              \
  --build-option="--build-option1"                    \
  --build-option="--build-option2"

into a config_settings dictionary like:

{
 "CC": "gcc",
 "--global-option": ["--some-global-option"],
 "--build-option": ["--build-option1", "--build-option2"],
}

Of course, it's up to users to make sure that they pass options which make sense for the particular build backend and package that they are building.

Hook execution environment

One of the responsibilities of a build frontend is to set up the Python environment in which the build backend will run.

We do not require that any particular "virtual environment" mechanism be used; a build frontend might use virtualenv, or venv, or no special mechanism at all. But whatever mechanism is used MUST meet the following criteria:

  • All requirements specified by the project's build-requirements must be available for import from Python. In particular, the distributions specified in the pyproject.toml key build-system.requires must be made available to all hooks. Some hooks have additional requirements documented below.
  • This must remain true even for new Python subprocesses spawned by the build environment, e.g. code like:

    import sys, subprocess
    subprocess.check_call([sys.executable, ...])

    must spawn a Python process which has access to all the project's build-requirements. For example, this is necessary to support build backends that want to run legacy setup.py scripts in a subprocess.

  • All command-line scripts provided by the build-required packages must be present in the build environment's PATH. For example, if a project declares a build-requirement on flit, then the following must work as a mechanism for running the flit command-line tool:

    import subprocess
    subprocess.check_call(["flit", ...])

A build backend MUST be prepared to function in any environment which meets the above criteria. In particular, it MUST NOT assume that it has access to any packages except those that are present in the stdlib, or that are explicitly declared as build-requirements.

Building an sdist

Building an sdist involves three phases:

  1. The frontend calls the backend's get_requires_for_build_sdist hook to query for any extra requirements that are needed for the sdist build.
  2. The frontend obtains those requirements. For example, it might download them from PyPI and install them into some kind of virtual environment.
  3. The frontend calls the backend's build_sdist hook to create the sdist.

If either hook is missing, or returns the built-in constant NotImplemented. (Note that this is the object NotImplemented, not the string "NotImplemented"), then this indicates that this backend does not support building an sdist from this source tree. For example, some build backends might only support building sdists from a VCS checkout, and not from an unpacked sdist. If this occurs then the frontend should respond in whatever way it feels is appropriate. For example, it might display an error to the user.

get_requires_for_build_sdist

def get_requires_for_build_sdist(config_settings):
  ...

Computes any additional requirements needed for build_sdist.

Returns: a list of strings containing PEP 508 dependency specifications, or NotImplemented.

Execution environment: everything specified by the build-system.requires key in pyproject.toml.

Example:

def get_requires_for_build_sdist(config_settings):
    return ["cython"]

Or if there are no additional requirements beyond those specified in pyproject.toml:

def get_requires_for_build_sdist(config_settings):
    return []

build_sdist

def build_sdist(sdist_directory, config_settings):
    ...

Builds a .tar.gz source distribution and places it in the specified sdist_directory. sdist_directory MUST be an absolute path.

Returns: The basename (not the full path) of the new .tar.gz file, as a unicode string, or NotImplemented.

Execution environment: everything specified by the build-system.requires key in pyproject.toml and by the return value of get_requires_for_build_sdist.

Notes:

A .tar.gz source distribution (sdist) is named like {name}-{version}.tar.gz (for example: foo-1.0.tar.gz), and contains a single top-level directory called {name}-{version} (for example: foo-1.0), which contains the source files of the package. This directory must also contain the pyproject.toml from the build directory, and a PKG-INFO file containing metadata in the format described in PEP 345. Although historically zip files have also been used as sdists, this hook should produce a gzipped tarball. This is already the more common format for sdists, and having a consistent format makes for simpler tooling, so build backends MUST generate .tar.gz sdists.

The generated tarball should use the modern POSIX.1-2001 pax tar format, which specifies UTF-8 based file names. This is not yet the default for the tarfile module shipped with Python 3.6, so backends using the tarfile module need to explicitly pass format=tarfile.PAX_FORMAT.

Building a wheel

The interface for building a wheel is exactly analogous to that for building an sdist: the same three phases, the same interpretation of NotImplemented, etc., except of course that at the end it produces a wheel instead of an sdist.

get_requires_for_build_wheel

def get_requires_for_build_wheel(config_settings):
    ...

Computes any additional requirements needed for build_wheel.

Returns: a list of strings containing PEP 508 dependency specifications, or NotImplemented.

Execution environment: everything specified by the build-system.requires key in pyproject.toml.

Example:

def get_requires_for_build_wheel(config_settings):
    return ["wheel >= 0.25", "setuptools"]

build_wheel

def build_wheel(wheel_directory, config_settings):
    ...

Builds a .whl binary distribution, and places it in the specified wheel_directory. wheel_directory MUST be an absolute path.

Returns: the basename (not the full path) of the new .whl, as a unicode string, or NotImplemented.

Execution environment: everything specified by the build-system.requires key in pyproject.toml and by the return value of get_requires_for_build_wheel.

Note: If you unpack an sdist named {name}-{version}.tar.gz, and then build a wheel from it, then the resulting wheel MUST be named {name}-{version}-{compat-info}.whl.

Extensions

Particular frontends and backends MAY coordinate to define additional hooks beyond those described here, but they MUST NOT claim top-level attributes on the build backend object to do so; these attributes are reserved for future PEPs. Backends MAY provide an extensions dict, and the semantics of the object at BACKEND.extensions["XX"] can be defined by the project that owns the name XX on PyPI. For example, the pip project could choose to define extension hooks like:

BACKEND.extensions["pip"].get_wheel_metadata

or:

BACKEND.extensions["pip"]["prepare_build_files"]

Recommendations for build frontends (non-normative)

A build frontend MAY use any mechanism for setting up a build environment that meets the above criteria. For example, simply installing all build-requirements into the global environment would be sufficient to build any compliant package -- but this would be sub-optimal for a number of reasons. This section contains non-normative advice to frontend implementors.

A build frontend SHOULD, by default, create an isolated environment for each build, containing only the standard library and any explicitly requested build-dependencies. This has two benefits:

  • It allows for a single installation run to build multiple packages that have contradictory build-requirements. E.g. if package1 build-requires pbr==1.8.1, and package2 build-requires pbr==1.7.2, then these cannot both be installed simultaneously into the global environment -- which is a problem when the user requests pip install package1 package2. Or if the user already has pbr==1.8.1 installed in their global environment, and a package build-requires pbr==1.7.2, then downgrading the user's version would be rather rude.
  • It acts as a kind of public health measure to maximize the number of packages that actually do declare accurate build-dependencies. We can write all the strongly worded admonitions to package authors we want, but if build frontends don't enforce isolation by default, then we'll inevitably end up with lots of packages on PyPI that build fine on the original author's machine and nowhere else, which is a headache that no-one needs.

However, there will also be situations where build-requirements are problematic in various ways. For example, a package author might accidentally leave off some crucial requirement despite our best efforts; or, a package might declare a build-requirement on foo >= 1.0 which worked great when 1.0 was the latest version, but now 1.1 is out and it has a showstopper bug; or, the user might decide to build a package against numpy==1.7 -- overriding the package's preferred numpy==1.8 -- to guarantee that the resulting build will be compatible at the C ABI level with an older version of numpy (even if this means the resulting build is unsupported upstream). Therefore, build frontends SHOULD provide some mechanism for users to override the above defaults. For example, a build frontend could have a --build-with-system-site-packages option that causes the --system-site-packages option to be passed to virtualenv-or-equivalent when creating build environments, or a --build-requirements-override=my-requirements.txt option that overrides the project's normal build-requirements.

The general principle here is that we want to enforce hygiene on package authors, while still allowing end-users to open up the hood and apply duct tape when necessary.

Comparison to competing proposals

The primary difference between this and competing proposals (in particular, PEP 516) is that our build backend is defined via a Python hook-based interface rather than a command-line based interface.

We do not expect that this will, by itself, intrinsically reduce the complexity calling into the backend, because build frontends will in any case want to run hooks inside a child -- this is important to isolate the build frontend itself from the backend code and to better control the build backends execution environment. So under both proposals, there will need to be some code in pip to spawn a subprocess and talk to some kind of command-line/IPC interface, and there will need to be some code in the subprocess that knows how to parse these command line arguments and call the actual build backend implementation. So this diagram applies to all proposals equally:

+-----------+          +---------------+           +----------------+
| frontend  | -spawn-> | child cmdline | -Python-> |    backend     |
|   (pip)   |          |   interface   |           | implementation |
+-----------+          +---------------+           +----------------+

The key difference between the two approaches is how these interface boundaries map onto project structure:

.-= This PEP =-.

+-----------+          +---------------+    |      +----------------+
| frontend  | -spawn-> | child cmdline | -Python-> |    backend     |
|   (pip)   |          |   interface   |    |      | implementation |
+-----------+          +---------------+    |      +----------------+
                                            |
|______________________________________|    |
   Owned by pip, updated in lockstep        |
                                            |
                                            |
                                 PEP-defined interface boundary
                               Changes here require distutils-sig


.-= Alternative =-.

+-----------+    |     +---------------+           +----------------+
| frontend  | -spawn-> | child cmdline | -Python-> |    backend     |
|   (pip)   |    |     |   interface   |           | implementation |
+-----------+    |     +---------------+           +----------------+
                 |
                 |     |____________________________________________|
                 |      Owned by build backend, updated in lockstep
                 |
    PEP-defined interface boundary
  Changes here require distutils-sig

By moving the PEP-defined interface boundary into Python code, we gain three key advantages.

First, because there will likely be only a small number of build frontends (pip, and... maybe a few others?), while there will likely be a long tail of custom build backends (since these are chosen separately by each package to match their particular build requirements), the actual diagrams probably look more like:

.-= This PEP =-.

+-----------+          +---------------+           +----------------+
| frontend  | -spawn-> | child cmdline | -Python+> |    backend     |
|   (pip)   |          |   interface   |        |  | implementation |
+-----------+          +---------------+        |  +----------------+
                                                |
                                                |  +----------------+
                                                +> |    backend     |
                                                |  | implementation |
                                                |  +----------------+
                                                :
                                                :

.-= Alternative =-.

+-----------+          +---------------+           +----------------+
| frontend  | -spawn+> | child cmdline | -Python-> |    backend     |
|   (pip)   |       |  |   interface   |           | implementation |
+-----------+       |  +---------------+           +----------------+
                    |
                    |  +---------------+           +----------------+
                    +> | child cmdline | -Python-> |    backend     |
                    |  |   interface   |           | implementation |
                    |  +---------------+           +----------------+
                    :
                    :

That is, this PEP leads to less total code in the overall ecosystem. And in particular, it reduces the barrier to entry of making a new build system. For example, this is a complete, working build backend:

# mypackage_custom_build_backend.py
import os.path
import pathlib

def get_requires_for_build_wheel(config_settings):
    return ["wheel"]

def build_wheel(wheel_directory, config_settings):
    from wheel.archive import archive_wheelfile
    filename = "mypackage-0.1-py2.py3-none-any"
    path = os.path.join(wheel_directory, filename)
    archive_wheelfile(path, "src/")
    return filename

def _exclude_hidden_and_special_files(archive_entry):
    """Tarfile filter to exclude hidden and special files from the archive"""
    if entry.isfile() or entry.isdir():
        if not os.path.basename(archive_entry.name).startswith("."):
            return archive_entry
    return None

def get_requires_for_build_sdist(config_settings):
    return []

def build_sdist(sdist_dir, config_settings):
    sdist_subdir = "mypackage-0.1"
    sdist_path = pathlib.Path(sdist_dir) / (sdist_subdir + ".tar.gz")
    sdist = tarfile.open(sdist_path, "w:gz", format=tarfile.PAX_FORMAT)
    # Tar up the whole directory, minus hidden and special files
    sdist.add(os.getcwd(), arcname=sdist_subdir,
              filter=_exclude_hidden_and_special_files)
    return sdist_subdir + ".tar.gz"

Of course, this is a terrible build backend: it requires the user to have manually set up the wheel metadata in src/mypackage-0.1.dist-info/; when the version number changes it must be manually updated in multiple places... but it works, and more features could be added incrementally. Much experience suggests that large successful projects often originate as quick hacks (e.g., Linux -- "just a hobby, won't be big and professional"; IPython/Jupyter -- a grad student's $PYTHONSTARTUPfile), so if our goal is to encourage the growth of a vibrant ecosystem of good build tools, it's important to minimize the barrier to entry.

Second, because Python provides a simpler yet richer structure for describing interfaces, we remove unnecessary complexity from the specification -- and specifications are the worst place for complexity, because changing specifications requires painful consensus-building across many stakeholders. In the command-line interface approach, we have to come up with ad hoc ways to map multiple different kinds of inputs into a single linear command line (e.g. how do we avoid collisions between user-specified configuration arguments and PEP-defined arguments? how do we specify optional arguments? when working with a Python interface these questions have simple, obvious answers). When spawning and managing subprocesses, there are many fiddly details that must be gotten right, subtle cross-platform differences, and some of the most obvious approaches --e.g., using stdout to return data for the build_requires operation -- can create unexpected pitfalls (e.g., what happens when computing the build requirements requires spawning some child processes, and these children occasionally print an error message to stdout? obviously a careful build backend author can avoid this problem, but the most obvious way of defining a Python interface removes this possibility entirely, because the hook return value is clearly demarcated).

In general, the need to isolate build backends into their own process means that we can't remove IPC complexity entirely -- but by placing both sides of the IPC channel under the control of a single project, we make it much cheaper to fix bugs in the IPC interface than if fixing bugs requires coordinated agreement and coordinated changes across the ecosystem.

Third, and most crucially, the Python hook approach gives us much more powerful options for evolving this specification in the future.

For concreteness, imagine that next year we add a new build_wheel2 hook, which replaces the current build_wheel2 hook with something that adds new features (for example, the ability to build multiple wheels from the same source tree). In order to manage the transition, we want it to be possible for build frontends to transparently use build_wheel2 when available and fall back onto build_wheel otherwise; and we want it to be possible for build backends to define both methods, for compatibility with both old and new build frontends.

Furthermore, our mechanism should also fulfill two more goals: (a) If new versions of e.g. pip and flit are both updated to support the new interface, then this should be sufficient for it to be used; in particular, it should not be necessary for every project that uses flit to update its individual pyproject.toml file. (b) We do not want to have to spawn extra processes just to perform this negotiation, because process spawns can easily become a bottleneck when deploying large multi-package stacks on some platforms (Windows).

In the interface described here, all of these goals are easy to achieve. Because pip controls the code that runs inside the child process, it can easily write it to do something like:

command, backend, args = parse_command_line_args(...)
if command == "build_wheel":
   if hasattr(backend, "build_wheel2"):
       backend.build_wheel2(...)
   elif hasattr(backend, "build_wheel"):
       backend.build_wheel(...)
   else:
       # error handling

In the alternative where the public interface boundary is placed at the subprocess call, this is not possible -- either we need to spawn an extra process just to query what interfaces are supported (as was included in an earlier draft of PEP 516, an alternative to this), or else we give up on autonegotiation entirely (as in the current version of that PEP), meaning that any changes in the interface will require N individual packages to update their pyproject.toml files before any change can go live, and that any changes will necessarily be restricted to new releases.

Evolutionary notes

A goal here is to make it as simple as possible to convert old-style sdists to new-style sdists. (E.g., this is one motivation for supporting dynamic build requirements.) The ideal would be that there would be a single static pyproject.toml that could be dropped into any "version 0" VCS checkout to convert it to the new shiny. This is probably not 100% possible, but we can get close, and it's important to keep track of how close we are... hence this section.

A rough plan would be: Create a build system package (setuptools_pypackage or whatever) that knows how to speak whatever hook language we come up with, and convert them into calls to setup.py. This will probably require some sort of hooking or monkeypatching to setuptools to provide a way to extract the setup_requires= argument when needed, and to provide a new version of the sdist command that generates the new-style format. This all seems doable and sufficient for a large proportion of packages (though obviously we'll want to prototype such a system before we finalize anything here). (Alternatively, these changes could be made to setuptools itself rather than going into a separate package.)

But there remain two obstacles that mean we probably won't be able to automatically upgrade packages to the new format:

  1. There currently exist packages which insist on particular packages being available in their environment before setup.py is executed. This means that if we decide to execute build scripts in an isolated virtualenv-like environment, then projects will need to check whether they do this, and if so then when upgrading to the new system they will have to start explicitly declaring these dependencies (either via setup_requires= or via static declaration in pyproject.toml).
  2. There currently exist packages which do not declare consistent metadata (e.g. egg_info and bdist_wheel might get different install_requires=). When upgrading to the new system, projects will have to evaluate whether this applies to them, and if so they will need to stop doing that.

Rejected and deferred features

A number of potential extra features were discussed beyond the above. For the most part the decision was made that it was better to defer trying to implement these until we had more experience with the basic interface, and to provide a minimal extension interface (the extensions dictionary) that will allow us to prototype these features before standardizing them. Specifically:

  • Editable installs: This PEP originally specified another hook, install_editable, to do an editable install (as with pip install -e). It was removed due to the complexity of the topic, but may be specified in a later PEP.

    Briefly, the questions to be answered include: what reasonable ways existing of implementing an 'editable install'? Should the backend or the frontend pick how to make an editable install? And if the frontend does, what does it need from the backend to do so.

  • Getting wheel metadata from a source tree without building a wheel: it's believed that when pip adds a backtracking constraint solver for package dependencies, it may be useful to add a hook to query a source tree to get metadata about the wheel that it would generate, if it were asked to build a wheel. Specifically, the kind of situation where it's anticipated that this might come up is:

    1. Package A depends on B and C==1.0
    2. B is only available as an sdist
    3. We fetch the sdist for the latest version of B, build it into a wheel, and then discover that it depends on C==1.5, which means that it isn't compatible with this version of A.
    4. We fetch the sdist for the latest-but-one version of B, build it into a wheel, and then discover that it depends on C==1.4, which means that it isn't compatible with this version of A.
    5. We fetch the sdist for the latest-but-two version of B...

    The idea would be that we could reduce (but not eliminate) the cost of steps 3, 4, 5, ... if there were a way to query a build backend to find out the requirements without actually building a wheel, which is a potentially expensive operation.

    Of course, these repeated fetches are expensive no matter what we do, so the ideal solution would be to provide wheels for B, so that none of this needs to be done at all. And for many packages (for example, pure Python packages), building a wheel is nearly as cheap as fetching the metadata. And building a wheel also has the advantage of giving us something we can store in the wheel cache for next time. But perhaps this is still a good idea for packages that are particularly slow to build (for example, complex packages like scipy or qt).

    It was eventually decided to defer this for now, since it adds non-trivial complexity for build backends (the metadata fetching phase and the wheel building phase run at different times, yet have to produce consistent results), and until pip's backtracking resolver is actually implemented, we're only guessing at the value of this optimization and the exact semantics it will require.

  • A specialized hook for copying a source tree into a new source tree: in certain cases, like when installing directly from a local VCS checkout, pip prefers to copy the source tree to a temporary directory before building it. This provides some protection against build systems that can give incorrect results when repeatedly building in the same tree. Historically, pip has accomplished this copy using a simple shutil.copytree, but this causes various problems, like copying large git checkouts or intermediate artifacts from previous in-place builds. In the future, therefore, pip might move to a multi-step process like:

    1. Create an sdist from the VCS checkout
    2. Unpack this sdist into a temporary directory.
    3. Build a wheel from the unpacked sdist.
    4. Install the wheel.

    Even better, this provides some guarantee that going from VCS checkout → sdist → wheel will produce identical results to going directly from VCS checkout → wheel.

    However, this runs into a potential problem: what if this particular combination of source tree + build backend can't actually build an sdist? (For example, flit may have this limitation for certain trees unpacked from sdists.) Therefore, we considered adding an optional hook like prepare_temporary_tree_for_build_wheel that would copy the required source files into a specified temporary directory.

    But:

    • Such a hook would add non-trivial complexity to this spec: it requires us to promote the idea of an "out of tree build" to a first class concept, and specify which kinds of trees are required to support which operations, etc.
    • A major motivation for doing the build-sdist-unpack-sdist dance in the first place is that we don't trust the backend code to produce the same result when building from a VCS checkout as when building from an sdist, but if we don't trust the backend then it seems odd to add a special hook that puts the backend in charge of doing the dance.
    • If sdist creation is unsupported, then pip can fall back on a shutil.copytree strategy in just a few lines of code
    • And in fact, for the one known case where this might be a problem (unpacked sdist using flit), shutil.copytree is essentially optimal
    • Though in fact for flit, this is still a pointless expense – doing an in-place build is perfectly safe and even more efficient.
    • Plus projects using flit always have wheels, so this will essentially never even come up in the first place
    • And pip hasn't even implemented the sdist optimization for legacy setup.py-based projects yet, so we have no operational experience to refer to and it might turn out there are some unknown-unknowns that we'll want to take into account before standardizing an optimization for it here.

    And since this would be an optional hook anyway, it's just as easy to add later once the exact parameters are better understood.

  • There was some discussion of extending these hooks to allow a single source tree to produce multiple wheels. But this is a complex enough topic that it clearly requires its own PEP.
  • We also discussed making the wheel and sdist hooks build unpacked directories containing the same contents as their respective archives. In some cases this could avoid the need to pack and unpack an archive, but this seems like premature optimisation. It's advantageous for tools to work with archives as the canonical interchange formats (especially for wheels, where the archive format is already standardised). Close control of archive creation is important for reproducible builds. And it's not clear that tasks requiring an unpacked distribution will be more common than those requiring an archive.

Todo:

  • Is the use of unicode for paths on Python 2 going to cause horrible brokenness for people on Unix with non-UTF8 locales? AFAIK py2 doesn't have surrogate-escape :-(. What if the absolute path to the source tree cannot be represented as unicode, for example because it starts /home/<some KOI-8>/...?
  • The process working dir is the source tree. Does this mean that it is automatically on sys.path, or not? We should pick one. Probably it should not be, and neither should random other directories, so frontends should do something like sys.path.pop(0) before calling any hooks.
  • Should there be some way to import a backend from the source tree? Perhaps a pyproject.toml key that names a source-tree relative directory that should be absolutified and then added to sys.path?

    If we do this we should probably recommend frontends set PYTHONDONTWRITEBYTECODE=1

  • Finish the discussion about re-using the same process to call multiple hooks.
  • Either copy in the rationale for NotImplemented from here, or replace with NotImplementedError if we go that way.
  • Add some discussion of in-place/out-of-place – to the "deferred for future PEP" section if nothing else.

Copyright

This document has been placed in the public domain.

Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment