Skip to content

Instantly share code, notes, and snippets.

@pradyunsg
Last active February 7, 2020 19:56
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save pradyunsg/5cf4a35b81f08b6432f280aba6f511eb to your computer and use it in GitHub Desktop.
Save pradyunsg/5cf4a35b81f08b6432f280aba6f511eb to your computer and use it in GitHub Desktop.
@pradyunsg's GSoC 2017 proposal

Adding Proper Dependency Resolution to pip

About Me

I was introduced to Python about five years ago, through "Core Python Programming" by Wesley J Chun. After the initial struggle with Python 2/3, the ball was set rolling and hasn't stopped since. I have fiddled around with Game Programming (PyGame), Computer Vision (OpenCV), Data Analytics (Pandas, SciPy, NumPy), transcompilers (Py2C) and more.

As a high school student, I did my internship at Enthought in 2013 and 2014. The two summers that I spent at Enthought were a great learning experience and I thoroughly enjoyed the environment there.

Other than Python, I have also used C, C++, Web Technologies (JavaScript, HTML, CSS) and preprocessors (Pug, TypeScript, LESS/SASS/SCSS), Java and Bash/Zsh for some other projects.

Curious to understand how pip works, I began looking into pip's source code. I started contributing to pip in May 2016. I am now fairly familiar with the design of pip and have a adequate understanding of how it works, due to the various contributions I have made to pip in the past year.

Mentors

  • Donald Stufft (@dstufft on GitHub)
  • Justin Cappos (@JustinCappos on GitHub)

Communication with the mentors will be mostly done on issues and pull requests on pip's GitHub repository. If at any point in time, a real time discussion is to be done with the mentors, it would be conducted over IRC or Skype. Email can also be used if needed.

Proposal

This project is regarding improving dependency resolution performed within pip by implementing a dependency resolver within it.

Abstract

Currently, pip does not resolve dependencies correctly when there are conflicting requirements. The lack of dependency resolution has caused hard-to-diagnose bugs/failures due to the installation of incompatible packages. The lack of a dependency resolver is also a blocker for various other features - adding an upgrade-all functionality to pip and properly determining build-time dependencies for packages are two such features.

Deliverables

At the end of this project, pip will have the ability to:

  • detect requirement conflicts
  • resolve conflicting dependency requirements where possible

Implementation

The implementation language will be Python. The code will maintain the compatibility requirements of pip - the same source code will support the multiple Python implementations and versions, including but not limited to, CPython 2.7, CPython 3.3+, PyPy 2 and PyPy 3.

New Tests for the functionality introduced will be added to pip's current test suite.

User documentation would not be a major part of this project. The only change needed is to mention, in certain sections, that pip can now resolve dependencies properly even when there are conflicts and update the same sections accordingly.

Overview

The project will be composed of the following stages:

  1. Refactor the dependency resolution logic of pip into a separate module.
  2. Implement dependency information caching in pip.
  3. Implement a backtracking dependency resolver to resolve the dependencies.

Every stage depends on the previous ones being completed. This step-wise approach would make incremental improvements so that there is a constant feedback on the work being done as well as making it easy to change course without losing existing work; if needed for some unforeseen reason.

Discussion

There is a class in pip - RequirementSet, which is currently a God class. It is responsible for the following:

  1. Holding a list of things to be installed
  2. Downloading Files
  3. Dependency Resolution
  4. Unpacking downloaded files (preparation for installation)
  5. Uninstalling packages
  6. Installing packages

This is clearly a bad situation since this is most of the heavy lifting involved in pip. These responsibilities need to be separated and moved out into their independent modules/classes, to allow for simplification of RequirementSet while providing a clean platform for the remaining work. This is the most tricky portion of this project, given the complexity of RequirementSet as it stands today.

There are two kinds of distributions that may be used to install a package - wheels (binary) and sdists (source). When installing a package, pip builds a wheel from an sdist and then proceeds to install the wheel. The difference between the two formats of distribution relevant to this project is - wheels store the information about dependencies within them statically; sdists do not.

Determining the dependencies of a wheel distribution is merely the matter of fetching the information from a METADATA file within the .whl file. The dependency information for an sdist, on the other hand, can only be determined after running its setup.py file on the target system. This means that dependencies of an sdist depend on how its setup.py behaves; which can change due to variations on target systems or could even contain through random choices.

Computation of an sdist's dependencies on the target system is a time-consuming task since it potentially involves fetching a package from PyPI and executing it's setup.py to get the dependency information. In order to improve performance, once an sdist's dependencies are computed, they would be stored in a cache so that during dependency resolution, the dependencies of an sdist are not computed every time they are needed. Further, pip caches wheels it downloads or builds. This means that any built or downloaded wheel's dependency information would be available statically, from the wheel that has been cached, without the need to fetch from the sdist dependency cache.

Like the wheel cache, sdist-dependency-cache will be a file system based cache. The sdist-dependency-cache would only be used if the corresponding sdist is being used.

Since sdist dependencies are non-deterministic, the cached dependency information is potentially incorrect - in certain corner cases such as using random choices in setup.py files. Such uses are not seen as important enough to cater to, compared the benefits of having a cache. Further, this is already the case with the information in the wheel cache.

SAT solver based resolution is not feasible for pip since a SAT solver needs the entire set of packages and their dependencies to compute a solution, which cannot be generated due to the aforementioned non-deterministic behaviour of setup.py file. Computing dependency information for all of PyPI on a target system for installing a package is simply not feasible.

The most reasonable approach is using a backtracking solver. Such a solver can be incrementally provided information about the dependencies of a package and would only need dependency information about packages in the dependency graph of the current system.

There is a need to keep a cache of visited packages during dependency resolution. A certain package-version combination may be reached via multiple paths and it is an inefficient use of computation time to re-compute that whether it is indeed going to satisfy the requirements or not. By storing information about which package-version combinations have been visited and do (or do not) satisfy the constraints, it is possible to speedup the resolution.

Consider the following example:

A-1.0 (depends: B)
A-2.0 (depends: B and E-1.0)
B-1.0 (depends: C and D)
C-1.0 (depends: D)
D-1.0
D-1.1 (depends: E-2.0)
E-1.0
E-2.0

If an installation of A is required, either A-2.0 or D-1.1 should not be installed because they have a conflicting requirement - E. While there are multiple possible solutions to this situation, the "most obvious" one us to use the D-1.0, instead of not installing A-2.0. Further, as multiple packages depend on D, the algorithm would "reach it" multiple times. By maintaining a cache for the visited packages, it is possible to achieve a speedup in such a scenario.

Details

Pull requests would be made on a regular basis during the project to ensure that the feedback loop is quick. This also reduces the possibilities of conflicts due to unrelated changes in pip.

All the code will be tested within pip's existing testing infrastructure. It has everything needed to write tests related to all the changes to be made. Every PR made to pip as a part of this project will contain new tests or modifications to existing ones as needed.

Stage 1

Initially, some abstractions will be introduced to the pip codebase to improve the reuse of certain common patterns within the codebase. This includes cleaner temporary directory management through a TempDirectory class.

RequirementSet.prepare_files and RequirementSet._prepare_file are downloading, unpacking packages as well as doing what pip does as dependency resolution. Taking these functions apart neatly is going to be a tricky task.

The following is a listing of the final modules that will be responsible for the various tasks that are currently being performed by RequirementSet:

  • pip.resolve - Dependency Resolution
  • pip.download - Downloading Files & Unpacking downloaded files
  • pip.req.req_set - Holding a list of things to be installed / uninstalled
  • pip.operations.install - Installing Packages (preparation for installation)
  • pip.operations.uninstall - Uninstalling Packages

To be able to proceed to the next step, only the dependency resolution related code needs to be refactored into a separate module. Other portions of RequirementSet are not required to be refactored but the same will be tackled as optional deliverables. In other words, only pip.resolve needs to be completed to be able to proceed to the next stage in this project. This is needed since in Stage 3, the resolver would be written in pip.resolve, independent of the rest of the codebase.

Stage 2

A new module pip.cache will be created. Within this module, all the caching will be handled. Thus, the code for the current wheel cache would be moved. The new code for a dependency cache for sdists would also be written here.

The new cache would hold all of an sdist's egg-info. The information will be stored on the file-system in a sub directory structure much like that of the wheel cache, in a directory structure based on hash of the sdist file holding the egg-info at the end. This will be done in a class named EggInfoCache.

EggInfoCache cache will be used only if a corresponding wheel for an sdist is not available. Installing an sdist results in the creation of a wheel which contains the dependency information, which would be used over the information available in the EggInfoCache.

To be able to proceed to the next step, it is required that EggInfoCache is implemented.

Stage 3

The module pip.resolve will be modified and a class named BacktrackingResolver will be added to it. The class does what you expect it to do - it would resolve dependencies with recursive backtracking. As described above, there will be some global state stored about the packages that have been explored. Other than the maintenance of global state, in the form of the cache, the rest of the algorithm will essentially follow the same structure as any backtracking algorithm.

The project would be complete when the aforementioned resolver is implemented.

Existing Work

There is existing work directly related to dependency resolution in pip, done by multiple individuals.

  • Robert Collins (un-merged closed pull request on pypa/pip)

    This has an incomplete backtracking dependency resolver and dependency caching.

  • Sebastien Awwad (branch on a separate fork)

    This was used for the depresolve project, to investigate the state of Dependency Resolution in PyPI/pip ecosystem.

  • pip-compile (separate project)

    This has a backtracking dependency resolver implemented to overcome pip's inablity to resolve dependencies.

Their work would be used for reference, where appropriate, during the course of the project. Further, there are many package managers which implement dependency resolution, which would also be looked into.

Tentative Timeline

  • Community Bonding Period: 5 May - 29 May

    • Clean up and finalize my existing pull requests.
    • Read existing literature regarding dependency resolution.
    • Inspect existing implementations of dependency resolvers.

    GOAL: Finish existing work to proceed for the work coming up.

  • Week 1: 30 May - 5 June

    • Introduce abstractions across pip's codebase to make refactoring RequirementSet easier.
    • Begin breaking down RequirementSet.prepare_file.
  • Week 2: 6 June - 12 June

    • Continue working on the refactor of RequirementSet.
  • Week 3: 13 June - 19 June

    • Continue working on the refactor of RequirementSet.
    • Finish and polish pip.resolve.

    GOAL: pip.resolve module will be ready, using the current resolution strategy.

  • Week 4: 20 June - 26 June

    • Finish and polish all work on the refactor of RequirementSet.
  • Week 5: 27 June - 3 July

    • Move code for WheelCache into a new pip.cache module.
    • Write tests for pip.cache.EggInfoCache, based on WheelCache.
    • Begin implementation of pip.cache.EggInfoCache.
  • Week 6: 4 July - 10 July

    • Finish and polish pip.cache.EggInfoCache.

    GOAL: A cache for storing dependency information of sdists would be ready to add to pip.

  • Week 7: 10 July - 16 July

    • Create a comprehensive set of tests for the dependency resolver. (in tests/unit/test_resolve.py)
    • Begin implementation on the backtracking algorithm.

    GOAL: A comprehensive test bed is ready for testing the dependency resolver.

  • Week 8: 17 July - 24 July

    • Complete a rough implementation of the backtracking algorithm

    GOAL: An implementation of a dependency resolver to begin running tests on and work on improving.

  • Week 9: 25 July - 31 July

    • Fixing bugs in dependency resolver
  • Week 10: 1 August - 6 August

    • Finish and polish work on dependency resolver

    GOAL: A ready-to-merge PR adding a backtracking dependency resolver for pip.

  • Week 11: 6 August - 13 August

    Buffer Week.

  • Week 12: 13 August - 21 August

    Buffer Week. Finalization of project for evaluation.

If the deliverable is achieved ahead of schedule, the remaining time will be utilized to resolve open issues on pip's repository in the order of priority as determined under the guidance of the mentors.

Other Commitments

I believe I will not be able to put in 40 hour/week for at most 3 weeks throughout the working period, due to the schedule of my university.

I will have semester-end examinations - from 10th May 2017 to 24th May 2017 - during the Community Bonding Period. My university will re-open for my third semester on 12 July 2017. I expect mid-semester examinations to be held in my University around 20th August 2017. During these times, I would not be able to put in full 40 hour weeks due to the academic workload.

I might take a 3-4 day break during this period, regarding which I would be informing my mentor around a week in advance.

I will be completely free from 30th May 2017 to 11 July 2017.

Future Work

There seems to be some interest in being able to reuse the above dependency resolution algorithm in other packaging related tools, specifically from the buildout project. I intend to eventually move the dependency resolution code that would come out of this project into a separate library to allow for reuse by installer projects - pip, buildout and other tools.

Previous Contributions to pip

(As on 12th March 2017)

Issues

Authored:

  • #3785 - Prefering wheel-based installation over source-based installation (Open)
  • #3786 - Make install command upgrade packages by default (Closed)
  • #3787 - Check if pip broke the dependency graph and warn the user (Open)
  • #3807 - Tests fail since change on PyPI (Closed)
  • #3809 - Switch to TOML for configuration files (Open)
  • #3871 - Provide a way to perform non-eager upgrades (Closed)
  • #4198 - Travis CI - pypy broken dues to dependency change in pycrypto (Closed)
  • #4282 - What's the release schedule? (Closed)

Participated:

  • #59 - Add "upgrade" and "upgrade-all" commands (Open)
  • #988 - Pip needs a dependency resolver (Open)
  • #1056 - pip install -U does not remember whether a package was installed with --user (Open)
  • #1511 - ssl certificate hostname mismatch errors presented badly (Open)
  • #1668 - Default to --user (Open)
  • #1736 - Create a command to make it easy to access the configuration file (Open)
  • #1737 - Don't tell the user what they meant, just do what they meant (Open)
  • #2313 - Automated the Creation and Upload of Release Artifacts (Open)
  • #2732 - pip install hangs with interactive setup.py setups (Open)
  • #3549 - pip -U pip fails (Open)
  • #3580 - Update requests/urllib3 (Closed)
  • #3610 - pip install from package from github, with github dependencies (Open)
  • #3788 - pip version suggested is older than the version which is installed (Closed)
  • #3789 - Error installing Mayavi in Mac OS X (Closed)
  • #3798 - On using python -m pip install -upgrade pip.. its throwing an error like the one below (Closed)
  • #3811 - no matching distribution found for install (Closed)
  • #3814 - pip could not find a version that satisfies the requirement oslo.context (Closed)
  • #3876 - support git refs in @ syntax (Open)
  • #4021 - Will you accept PRs with pep484 type hints? (Open)
  • #4087 - pip list produces error (Closed)
  • #4149 - Exception thrown when binary is already linked to /usr/local/bin (Open)
  • #4160 - Pip does not seem to be handling deep requirements correctly (Open)
  • #4162 - Let --find-links be context aware to support github, gitlab, etc. links (Open)
  • #4170 - pip list |head raises BrokenPipeError (Open)
  • #4182 - pip install should install packages in order to avoid ABI incompatibilities in compiled (Open)
  • #4186 - IOError: [Errno 13] Permission denied: '/usr/local/bin/pip' (Open)
  • #4206 - Where on Windows 10 is pip.conf or pip.ini located? (Closed)
  • #4221 - Feature request: Check if user has permissions before downloading files (Closed)
  • #4229 - "pip uninstall" is too noisy (Open)

Pull Requests

Authored:

  • #3806 - Change install command's default behaviour to upgrade packages by default (Closed, superseded by #3972)
  • #3808 - Fix Tests (Merged)
  • #3818 - Improve UX and tests of check command (Merged)
  • #3972 - Add an upgrade-strategy option (Merged)
  • #3974 - [minor] An aesthetic change to wheel command source (Merged)
  • #4192 - Move out all the config code to a separate module (Merged)
  • #4193 - Add the ability to autocorrect a user's command (Open)
  • #4199 - Fix Tests for Travis CI (Merged)
  • #4200 - Reduce the API exposed by the configuration class (Merged)
  • #4232 - Update documentation to mention upgrade-strategy (Merged)
  • #4233 - Nicer permissions error message (Open)
  • #4240 - [WIP] Add a configuration command (Open)

Participated:

  • #2716 - Issue #988: new resolver for pip (Closed)
  • #2975 - Different mock dependencies based on Python version (Merged)
  • #3744 - Add a "Upgrade all local installed packages" (Open)
  • #3750 - Add a pip check command. (Merged)
  • #3751 - tox.ini: Add "cover" target (Open)
  • #3794 - Use the canonicalize_name function for finding .dist-info (Merged)
  • #4142 - Optionally load C dependencies based on platform (Open)
  • #4144 - Install build dependencies as specified in PEP 518 (Open)
  • #4150 - Clarify default for --upgrade-strategy is eager (Merged)
  • #4194 - Allow passing a --no-progress-bar to the install script to surpress progress bar (Merged)
  • #4201 - Register req_to_install for cleanup sooner (Merged)
  • #4202 - Switch to 3.6.0 final as our latest 3.x release (Merged)
  • #4211 - improve message when installing requirements file (#4127) (Merged)
  • #4241 - Python 3.6 is supported (Merged)

References

  1. pypa/pip#988

    Tracking issue for adding a proper dependency resolver in pip. Contains links to various useful resources.

  2. pypa/pip#2716

    Prior work by Robert Collins for adding a proper dependency resolver in pip.

  3. Python Dependency Resolution

    A writeup by Sebastian Awwad on the current state of dependency resolution in pip and PyPI in general.

  4. PSF Application Template

    For guidance on how to write the application and what information is needed.

  5. Stork: Secure Package Management for VM Environments

    A Paper by Justin Cappos about Stork, used for reference regarding Backtracking Resolution

@erikrose
Copy link

setup.py files do fairly commonly change their dependencies based on what version of Python is running: 2.6, 2.7, 3, etc. One example is Let's Encrypt, whose canonical client, certbot, depends on argparse only in 2.6. pip has had bugs in the past due to overzealous wheel-caching. For example, pip would build an sdist under one version of Python, cache it as a wheel, and then attempt to install the wheel under a different version, which would mistakenly install (or omit) dependencies based on the original version's needs. So I urge you to not completely disregard setup.py conditionals. :-)

@greysteil
Copy link

So, so, so psyched to see this work getting done. Seriously cool.

I've been following your blog posts, and it looks like you're about to start work on the backtracking logic itself. If it's any help, I've been working with folks at Molinillo, which is the resolver used in Ruby's Bundler. The architecture file there might be useful, as might the integration specs which cover some gnarly examples. Happy to help if there's ever any way I can.

@pradyunsg
Copy link
Author

Sorry for the delay in responding. I never got notifications for these comments. o.o


Hi @erikrose! The sdist egg-info is cached using the cache_tag -- which means that it's as specific as a .pyc would be... There's a fallback when it's not available and it depends on the Python Implementation, Version and ABI version. You can see the implementation here. So, the conditionals are not completely disregarded. The only situation where they're disregarded is when they behave differently when run by the same Python interpreter in a slightly different environment - which seems reasonable since we'll be moving to completely declarative information anyway. :)


Hi @greysteil! Those specs are AWESOME and the architecture doc was definitely an interesting read! I'm definitely gonna pick up at least a few of those specs and use them. Thanks a ton! ^.^

@RDIL
Copy link

RDIL commented Nov 22, 2019

Can't wait :D

@pradyunsg
Copy link
Author

In case anyone's looking, the work following up on this GSoC project, is now a grant-funded project: https://wiki.python.org/psf/Pip2020DonorFundedRoadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment