Skip to content

Instantly share code, notes, and snippets.

@mcg1969
Last active May 5, 2017 15:59
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mcg1969/38589eeefb046c417720f1027f97085b to your computer and use it in GitHub Desktop.
Save mcg1969/38589eeefb046c417720f1027f97085b to your computer and use it in GitHub Desktop.
Conda hackery: virtual hotfixing via build groups

Conda proposal: virtual hotfixing via build groups

In this note, I propose a modification to conda and to its implied package specification that will provide, in my view, a solution to a common pitall of the conda ecosystem, and will provide additional performance benefits as well.

Motivation

With conda, as with many package ecosystems, it is often necessary for a package to place version constraints on its dependencies to ensure that they interoperate properly. In practice, such constraints must balance two competing concerns:

  • if the constraints are specified too tightly, then packages will often unable to coexist, because their version constraints will conflict. For example, if package A depends on B <2.0, and package C depends on B >2.0, then packages A, B, and C cannot coexist. The tighter the constraints, the more likely this is to happen.
  • on the other hand, if the constraints are specifed too loosely, then packages will often break if one of their dependencies ceases to be compatible upon update. For instance, in the example above, if A truly did require B <2.0 but did not enforce it, then A would break if B 2.0 were installed.

It is my view that the latter risk is the preferable one. Nevertheless, it is important that the conda ecosystem provide a means to correct issues that arise when it is discovered that a version incompatibility exists.

The problem: broken dependency metadata

Bundled with a conda packages is a dictionary of metadata, including but not limited to its name, version, build string, build number, and dependency requirements. It is the dependency information that is causing the issues above. The dependency metadata has become inaccurate or broken, and it needs to be corrected somehow.

It is important to note that it is not sufficient to simply construct a new, corrected package, and serve it alongside the older, broken ones. Due to the way conda operates, there will be a variety of situations where the broken package is selected in spite of the presence of the newer one. Furthermore, users who have already installed the offending package still risk breaking their working environments by installing incompatible dependencies.

For more detail on these issues, I refer the reader to this comment in the conda issue tracker. Indeed, this is but one of many discussions that have been had about this problem.

Anaconda's current solution: metadata hotfixes

For the maintainers of the Anaconda Python distribution and the "defaults" conda channel, it has been clear for some time that an active solution to this dependency challenge is necessary. The current approach is to to employ metadata hotfixing, in which a package's metadata is modified, and the package is rebuilt. Unfortunately, this results in changes to the package's MD5 signature, prompting deserved consternation (as one can see in the discussion thread linked to above). And patching the metadata is a bit of a tedious process as well.

Despite these issues, I remain convinced that metadata hotfixes are the best approach for conda users. Typical users do not concern themselves with MD5 signatures; they care that their functioning conda environments don't break.

The improvement: build groups and "virtual" hotfixes

In this note, I propose a modification to conda and to its implied package specification that provides the benefits of the hotfixing approach without some of its problems. What we propose is to allow newer builds of a package to effectively "hotfix" older siblings of the same package. But we must be very careful to identify what a package's true "siblings" are, hence the need to formally define a concept I am calling build groups.

Build strings

When conda build constructs packages, it uses a filename convention that combines the package name, version, and build string. For instance, the package numpy-1.11.2-py27_1.tar.bz2 represents a numpy package, version 1.11.2, build string py27_1.

We would be right to guess that the build string py27_1 denotes that this is a Python 2.7 build of NumPy, and that its build number is 1. But in fact, conda actually ascribes no semantic content to the build string. Internally, it serves only one purpose: to make the filename unique---e.g., to distinguish it from the Python 3.5 version. It could easily have been called, say, numpy-1.11.2-YUSDXS.tar.bz2, and conda would treat the package no differently. That is not to say that conda ignores the build number or the Python 2.7 version dependency; rather, it pulls them from their dedicated metadata fields instead of the build string.

With vanishingly small exception, however, all conda packages do adopt a certain convention: the build string ends with an underscore followed by the build number. Furthermore, the portion of the build string is constructed programmatically from key dependencies of the package, like the Python version. The reason this is so commonplace is, of course, because it is the convention that conda build uses; but again, as far as base conda is concerned, it is not a standard.

Build groups: a spec

We now propose to elevate this de facto standard to something more official, by giving conda permission to depend in a limited way upon the structure in the build string, and to alter the way that it computes package solutions because of it. In particular, I propose that we define build groups in two steps:

  1. The build stub of a conda package is given by the following function of its build string and build number:
import re
def build_stub(build_string, build_number):
    match = re.match(r'^(.*)_([0-9]+)$', build_string)
    if not match:
        return build_string
    stub, num = match.groups()
    return stub if num == str(build_number) else build_string
  1. Packages are said to belong to the same build group if they share the same name, version, and build stub. So for instance, numpy-1.11.2-py27_1 and numpy-1.11.2-py27_2 are in the same build group, but numpy-1.11.2-py27_1 and numpy-1.11.2-py35_1 are not.

Note in particular that any package that does not obey the stub_build number convention is treated as if it is in its own build group. This effectively means that the behavior we are about to propose does not apply to such packages; conda behaves as it always has in those cases.

I would also propose that future versions of conda build support the ability to specify the build stub instead of the entire build string, and auto-generate the build string by combining the stub with the build number. But that is not necessary here.

Virtual metadata hotfixes: a spec

Armed with this definition, we are now prepared to propose the following virtual hotfixing behavior:

  1. Among packages in the same build group, the package with the highest build number dictates the depdendency behavior of the entire group. So for example, if numpy-1.11.2-py27_1 and numpy-1.11.2-py27_2 have different dependency information, conda uses only the information from the latter. On the other hand, numpy-1.11.2-py35_2 has no influence on the dependency behavior of numpy-1.11.2-py27_1, because they do not belong to the same build group.

What this means is this: if a package is discovered to have broken dependency information or other metadata, the channel maintainer must simply issue a new build of the package, with an incremented build number, and the corrected metadata. Once this package is in place, conda will effectively ignore the dependency information for the older packages, replacing it with the newer information. Thus we have accomplished the "virtual hotfixing" we seek, but without the need to remove or alter existing packages.

Other consequences

This rule has other benefits as well:

  1. This provides a simple, straightforward way for all channel maintainers to correct metadata issues in their channels. In fact, some maintainers probably issue new "metadata correction" builds already, wrongly assuming that this fixes such issues correctly.
  2. The approach that the Anaconda distribution takes to metadata hotfixes modifies the MD5 signature of the package, which is undesirable for many users (as the linked discussion above highlights). Such signature changes will not be necessary.
  3. Conda is far more likely to recommend the latest build of a package. In certain unusual but non-contrived scenarios, it is possible for conda to prefer an older build of a package because its dependency requirements are somehow "more" compatible with the goals of the overall installation recipe. With this fix, this simply isn't possible, since all of the builds will be equally compatible.
  4. It will allow conda to reduce the number of packages that it must consider during the solution process. All older builds of a package with the same name, version, and build group can be removed from consideration unless they are pinned in another package's dependencies. This should result in faster solve times.

Conclusion

I do not believe it will be difficult to implement this behavior. And if desired, we can make it optional---e.g., with a configuration setting---until it has been thoroughly studied in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment