Skip to content

Instantly share code, notes, and snippets.

@tacaswell
Created February 11, 2022 00:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tacaswell/74bf14307d995643ce25c22f99ae0142 to your computer and use it in GitHub Desktop.
Save tacaswell/74bf14307d995643ce25c22f99ae0142 to your computer and use it in GitHub Desktop.
Notes on conda "overlays"

Hot-fixing and extending conda environments

Introduction

We deploy root-owned conda environments which are the basis of the data collection and analysis environments. On one hand because these are owned by root they are write-protected and ensure that users can not accidentally break the environment, on the other hand because they are write-protected they can not be upgraded or extended. While we want to run with a stable, standard, well understood software environment, we do need this ability for both development and for time-critical hot-fixes.

There are a number of possible solutions to this:

  1. use sudo to edit the environment in place (via conda, pip, or "by hand")
  2. create / clone the conda environment into user space
  3. use $PYTHONPATH and pip install --prefix to create "overlay" directories

Historically, we have primarily gone with option 1 and 2, however they have significant down sides. Modifying the environment requires doing operations with elevated privileges and it is very hard to track what has been done after the fact.

This document lays out using "overlays" as a technique to both locally replace already installed packages and to add new packages for development.

General theory of operation

When you do python import foo Python goes through a process to find and load the requested module. An early step uses import path to search disk locations for the requested modules. This path can be accessed via sys.path

Installation tools typically place files in directories that Python searches by default, conventionally site-packages. In addition to being directly controllable from inside of a Python process, the entries in sys.path can be controlled via the PYTHONPATH envvar. When searching for an import Python stops looking when it finds the module allowing you to effectively shadow modules by putting their locations earlier in the path.

Taken together we can now do two things:

  1. install place modules someplace we can write to as an un-privileged user
  2. use PYTHONPATH to tell Python to find our modules there

Location, location, location

From Python's point of view these extra files can be anywhere, however as a matter of policy we are going to use the location

/some/path/overlays/{env_name}/

as the prefix which means we will have to add the path

/some/path/overlays/{env_name}/lib/{python_version}/site-packages

to the PYTHONPATH.

Similarly, if the package contains anything that will be run from the shell, then

/some/path/overlays/{env_name}/bin

needs to be added to PATH by any mechanism.

Install a new package for development

To install a new packages into our overlay directory using pip we use the --prefix flag for pip:

$ conda activate {env_name}
$ pip install --prefix=/some/path/overlays/{env_name} ...

Any dependencies that are already installed in the host environment will be picked up (conda provides the meta-data that pip needs to agree a package is installed) and any missing dependencies will be installed along side your requested package. All standard pip command line flags and arguments should work as expected.

To access the packages you need to arrange for the site-packages directory in the overlay to be added to the PYTHONPATH / sys.path.

Upgrade an existing package

If we want to upgrade an existing package using this technique the above will fail because as part of the installation process pip will (rather sensibly) attempt to uninstall any existing versions of the package. Because our host environment required elevated privileges to modify this will fail. To upgrade a package we need to additionally add the -I flag to ignore any information about the already installed packages which prevents the permissions error. However, because this also means that pip is no longer aware of the already installed dependencies! To avoid re-installing all of the dependencies along with the target package we use the --no-deps flag to tell pip to not try to install them. Thus :

$ conda activate {env_name}
$ pip install \
    -I --no-deps \
    --prefix=/some/path/overlays/{env_name} \
    ...
@timhoffm
Copy link

timhoffm commented Feb 15, 2022

Nice write-up!

A comment on "Upgrade an existing package": One has to be aware that this will fail if the updated version needs updated dependencies as well.

I'm trying out a different approach for a similar goal:

Task: "Get an environment that matches a standard reference, but can be modified in certain aspects"

Steps:

  • Create a spec file using conda list --explicit (usually stored in a central place)
  • Optionally create a modified spec by stripping out some packages you don't want.
  • Recreate the environment locally from the spec file. Note that this is fast because everything is pinned and the solver does not have to run.
  • Modify the local environment through additional conda or pip commands.

This can and should be hidden in a script controlled via a configuration file.

Advantages

  • You can add packages via conda and pip
  • You can remove packages
  • You can have multiple baseline specs
  • You don't need to have all baselines available as envs. The spec file is sufficient, which e.g. makes it easy to version the baseline.
  • You get regular conda environments and don't need path acrobatics
  • The environments are decoupled from the baseline: A change in the baseline cannot break the env.

Disadvantages

  • You create conda environments locally.
  • For pip, similar limitations hold as in the --prefix solution (but you could strip out something from the spec file, so that you could get away without the need for -I.
  • The environments are decoupled from the baseline: An update in the baseline is not reflected in the env.

The last point can become a problem if users hold on to their environments for a long time. A key point here is that environments are disposable and can be easily recreated from the config file. So to update to a new state of the baseline, you simply delete the env and create it from the config anew. - One could even introduce monitoring and notification if the baseline changes, but that may be overkill.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment