Skip to content

Instantly share code, notes, and snippets.

@mcg1969
Last active November 12, 2021 14:28
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save mcg1969/da5aec380d2ed083b79ddcf151ca16f1 to your computer and use it in GitHub Desktop.
Save mcg1969/da5aec380d2ed083b79ddcf151ca16f1 to your computer and use it in GitHub Desktop.
Conda hackery: namespaces

Conda Proposal: namespaces

Motivation

We would like to position Conda as a language-agnostic package manager, but at present it maintains a distinct bias towards Python. Given its origins this was expected and, frankly, reasonable. Nevertheless, as we begin to use it to subsume other packaging ecosystems, such as CRAN, NPM, Ruby Gems, etc., we are going to want to overcome this history; and one key challenge is to address naming conflicts across platforms.

Our first attempt to incorporate a separate ecosystem involved R. Our solution to the naming conflict issue was to prepend an r- prefix to all of the R packages. In my view this is an aesthetically displeasing solution, likely to be made worse if we continue this practice with other ecosystems; e.g., node-, ruby-, etc. We have long discussed the need to implement namespaces to address this properly, but it has necessarily been a lower priority.

Conda Forge began to see this issue as well. They made a preliminary decision to solve this problem by appending an ecosystem prefix to every package that merits one—including Python. Unfortunately, they began appending this prefix even to packages that were already in defaults, a move certain to cause genuine confusion for users. They've pulled back from this approach for now, but ultimately it points out the fact that a namespace solution of some sort is needed.

Core principles

What I'd like to do here is to outline some core principles that ought to govern the design of the namespace solution, and some proposed solutions to address those principles. For the sake of discussion, we are going to pretend that the R packages are not prefixed with r-.

Principle 1: clear syntax for explicit namespace specification

While much of our design here is oriented towards minimizing the need for users to worry about namespaces, there will clearly be occasions where explicitly specifying them is necessary. We need a syntax to do so that is easy to read and consistent across the conda command line, meta.yaml files for conda build, and environment.yml files for conda env, etc.

What is more, with conda 4.1 we introduced the notion of channel prioritization, and we need to enhance this by allowing people to explicitly specify a channel for a given package. Therefore we should decide on these syntaxes together, so that they can easily work together.

Based on the discussions with Sean and Ilan, I propose we do this:

channel/namespace:package

The channel and namespace entries are optional, so we could have these combinations as well:

namespace:package          channel/package

*For the purposes of this document, we are going to utilize this syntax, but it is not necessarily final.*

Principle 2: namespace names should be obvious

For example, consider a package name digest. There are Python and R packages with this package name. The Python package would be uniquely named python:digest, and the R package would be named r:digest.

Whenever possible, the name of the namespace should be identical to its "anchor" package like this. For instance, a NodeJS conda namespace should be simply node, and not npm. We propose to go further and require that every namespace be given a name identical to its anchor package. What if a namespace simply doesn't have a logical anchor package? We suspect that this siutation is rare. And yet, if it arises, we propose that an artificial anchor package be created for that purpose.

A package will be considered a member of a namespace if it includes any version of that anchor package as a depednency~~—at any level of its dependency tree~~ (see EDIT 1 below).

One interesting consequence of this definition is that it will be possible for the same package to live in more than one namespace. For instance, Continuum's build of rpy2 includes both R and Python as dependencies, so it will resolve as both r:rpy2 and python:rpy2.

There will also be packages that, arguably, do not live in a larger ecosystem. This would include things like standalone executables, C libraries, and other packages that are readily used across multiple ecosystems. For such packages, we propose that the namespace be empty, and that a namespace-explicit reference to such a package would simply involve a bare colon. For instance, :graphviz refers to the standalone GraphViz package, whereas python:graphviz refers to the Python package with the same name. Note that this global namespace does not have an anchor package; it is the only exception to that rule.

OPEN ISSUE: are there instances where this dependency-based approach to namespace resolution fails? Should we offer a facility to override this determination? Are there consequences if a package maintainer accidentally drops an anchor dependency, or does so maliciously?

Principle 3: determining the active namespaces in an environment should be straightforward

We seek to insure that in most cases, it will not be necessary to supply explicit namespaces every time a package is installed. For this to hold, we need to readily determine which namespaces are "active" in a given environment.

The basic rule is this: the list of active namespaces in a given environment is is determined by the set of anchor packages installed, or requested to be installed, in that environment. So if python is installed, the python: namespace is active; if R is installed, the r: namespace is active.

When determining this namespace list during a conda install command, any anchor packages included in the specs should be included. So for instance, if an environment has Python installed and the user does conda install r ..., the namespace list should include both Python and R for the purposes of namespace resolution.

For the purposes of this rule, we propose to treat the global namespace specially: it is never included in the namespace list _unless there are no other active namespaces. Given the principles outlined below for resolving namespaces, we suspect that this will very rarely be an issue.

OPEN QUESTION: how do we specify the set of namespaces to consider? Do we hardcode a few, like Python, R, and Lua? How do we ensure that these namespaces are properly treated if the user has a non-standard set of conda channels?

Principle 4: when there is no name conflict, an explicit namespace should never be needed

If a package has a unique name across all namespaces, it should never be necessary to explicitly attach a namespace to the package to retrieve it. So for instance, installing pyomo should never require a python: prefix, even if Python is not yet installed.

It may seem that this should go without saying, but indeed isn't true now. For instance, Continuum now prepends all R packages with the r- prefix. For the purposes of this document, such prefixing is simply a weak version of the namespacing we are proposing to do here, and it fails this principle as a result.

Principle 5: when there is no name conflict among active namespaces, an explicit namespace should never be needed

If a user is working entirely with Python packages, that user should not be forced to specify a prefix just because the package happens to have the same name as an R package, a Perl package, or a Node package.

To illustrate this core principle, consider the package name digest, which exists in both the Python and R namespaces. We argue that the following behavior should occur:

  • conda create -n newenv python digest, or conda install digest applied to an environment containing Python and not R, should install only python:digest.
  • conda create -n newenv r digest, or conda install digest applied to an environment containing R and not Python, should install only r:digest.

Principle 6: when ambiguity cannot be resolved, favor convenience for the user

What do we do if both Python and R are in the environment? We propose that conda install both packages:

  • conda create -n newenv r python digest, or conda install digest applied to an environment containing both R and Python, should install both python:digest and r:digest.

Is it possible that the user did not indend this? Absolutely. But this may reveal useful information to the user; for instance, they may have been unaware that an R version existed. And of course, they can always reject the installation and re-do it with a namespace prefix if they so desire.

Here is a more complex scenario. GraphViz is a popular tool for creating visualizations of graphs. It is a standalone tool, not a Python package. But there are a variety of Python packages which utilize it, including a PyPi package with the name graphviz. This unfortunate naming choice results in confusion for Python users. For instance, conda install graphviz gives you the global Graphviz package, but not the Python module. On the other hand, pip install graphviz gives you the Python module, but not the global package—which it needs to operate properly.

For the purposes of this example, let's assume there is an R package with the name graphviz as well, with the same concomitant ambiguity concerns. Thus we now have three packages with the same name: python:graphviz, r:graphviz, and :graphviz for the global case. In our view, we propose the following behaviors:

  • conda create -n newenv python graphviz, or conda install graphviz applied to an environment containing Python and not R, would install python:graphviz. This would then install :graphviz by dependency, so that the Python package would be fully functioning.
  • conda create -n newenv r graphviz, or conda install graphviz applied to an environment containing R and not Python, would install r:graphviz, and :graphviz package by dependency.
  • conda create -n newenv r python graphviz, or conda install graphviz applied to an environment containing both R and Python, would install both python:graphviz and r:graphviz; and again, by dependency, :graphviz as well.
  • conda create -n newenv graphiz should create an environment containing nothing but :graphviz.

Suppose for the sake of argument, however, that python:graphviz and r:graphviz did not depend on the global package. For instance, perhaps both packages vendor the original Graphviz. As a result, neither one of these packages would have a :graphviz dependency. In this case, the only scenario where :graphviz would be installed is the last one, due to the principle that the global namespace is considered active only if no others are. If the user wishes to override this behavior, then they can use explicit namespacing; i.e., conda install graphviz :graphviz.

Principle 7: require affirmative ambiguity resolution when building packages

We have already discussed a scenario where a package can have a dependency with the same name from another namespace; e.g., python:graphviz depends on :graphviz. The corresponding conda-build recipe for python:graphviz might include these run requirements:

  - run:
    - python
    - :graphviz

The desire to make things convenient for the conda user should not necessarily extend to the conda-build user. How strict must we be when parsing the run requirements?

Ideally, we would like to define the logic here such that most packages will continue to build properly, unmodified. Obviously, when there are no naming conflicts across namespaces, this will not be an issue. If the anchor packages present in the run requirements are enough to eliminate ambiguity, that should be sufficient as well. But what we want to avoid is the selection of multiple packages due to namespace ambiguity. In effect, Principles 4 and 5 should still hold for package building, but not Principle 6.

I believe it may be sufficient to assume that all dependencies without an explicit namespace are assumed to be a member of the same namespace as the package itself.

Implementation issues

The package index

Obviously, we can no longer rely on the simple principle that two packages with the same name shall not be installed in the same environment. This must be modified: two packages with the same name and namespace shall not be installed in the same environment. Thus python:graphviz, r:graphviz, and :graphviz may be installed alongside each other.

The groups member variable in conda.Resolve is a dictionary with package name for keys and package keys (channel, filename) for values. This dictionary would now need to be modified so that its keys are (namespace, package name) pairs; or perhaps a dictionary of dictionaries, with the outer key being the namespace, the inner being the package name.

EDIT 1: Determining a package's namespace from its dependencies

Originally, we considered that a package's namespace would be determined by the presence of an anchor package "at any level of its dependency tree." However, it seems clear that we should limit our view to top-level dependencies for several reasons.

  1. It ensures that the namespace for a package is entirely in control of the package maintainer. Otherwise, the namespace could possibly change if an anchor package were added or removed as a downstream dependency.
  2. The process of determining a package's namespace can now be accomplished entirely locally, without a tree search algorithm.
  3. We need want to support situations where a package exists in one namespace but depends incidentally on packages from another.
  4. Conda Build is similarly restrictive when constructing its build strings. If you want py27, py35, etc. added to the build string (or py in the case of a noarch package), then python has to be in the run dependencies. So package builders are likely adhering to this convention anyway.

Let us expand on point 3 here. Suppose there exists a package that behaves as a standalone executable, and hence should live in the global namespace, but depends on Python as its execution engine, and therefore requires Python as a dependency. In my view, this points to the fact that Python itself must sit in the global namespace. Packages in the python namespace will simply include python as a dependency, while packages that rely on Python as a dependency but are not in the namespace should include :python instead.

EDIT 2: Naming conflicts between anchor and non-anchor packages

Consider the Python packages node. Presumably this is going to come into conflict with the logical name of the node namespace. I think that for the sake of clarity, we need to at least attempt to disallow such naming conflicts whenever possible.

However, we may be able to tolerate naming conflicts such as these if we adopt the condition that anchor packages live in the global namespace. Thus :node is the anchor package for node itself, while this package is python:node. If a user runs conda install node in an existing python environment, it will select python:node alone; if they meant to install :node, they will have to use the colon notation. On the other hand, when creating a new environment, conda create -n python node would include both of the anchor packages.

This is a corner case that we may need to study with a first implementation.

@mcg1969
Copy link
Author

mcg1969 commented Jul 24, 2016

@pelson @jakirkham @kalefranz @msarahan added a couple of EDIT sections at the end, for your comments.

Also: I haven't included this in the document yet, but I think I've figured out how the depends metadata needs to be treated with regards to namespace resolution. It should not be as liberal/inclusive as the command-line case, where conda install digest, for instance, installs all of the packages with that name for the active namespaces. Instead, a dependency should resolve to exactly one package. Furthermore, unless there is an explicit namespace specification, it should look first in the namespace corresponding to the package itself, and only if that fails look in the global namespace.

So, for instance, depends = ['python', 'graphviz'] will be resolved to [':python', 'python:graphviz'].

Interestingly, we could say that if a package refers to another package with the same name, it will be assumed to refer to the global namespace. So the dependencies for python:graphviz could indeed include graphviz, and this would resolve to :graphviz.

@mcg1969
Copy link
Author

mcg1969 commented Jul 24, 2016

One more thing. I'd like to make this such that all of this namespace resolution/disambiguation occurs outside of the Boolean/SAT logic. So for instance, conda create -n test python r graphviz would end up passing :python, :r, python:graphviz, and r:graphviz to the underlying solver, which in turn would not need to handle namespaces at all.

@croth1
Copy link

croth1 commented Aug 6, 2016

@mcg1969 is there already meta-issue out there on github I can track?

@jakirkham
Copy link

Sorry if I have missed the answer somewhere above, but there is another question that is bothering me. Suppose that conda/conda-build does not have some namespace that we want, how do we add a new namespace? Do we have to wait for it to make its way into those tools and a release? Or can we explicitly state a package belongs to a new namespace?

@goanpeca
Copy link

goanpeca commented Oct 4, 2016

How would the UI for this look like? (Navigator / Conda - Manager)

@kalefranz
Copy link

Cross Ref: conda/conda#3889

@asmeurer
Copy link

asmeurer commented May 4, 2017

In the face of ambiguity, refuse the temptation to guess.

One thing that is missing here (unless I missed it), is just asking the user which package(s) she wants. Graphviz seems a bad example because there is a empty namespace package (which is confusing, why not just create some name for the default namespace like base or something), and a Python package that depends on it. So conda can at least attempt to be smart about things, especially since python-graphviz depends on graphviz anyway.

A better example would be an R package and a Python package that happen to share the same name, but aren't actually related. I couldn't find a concrete example, although I didn't do an extensive search. The ones that I thought would work like bokeh and ggplot are actually slightly different (bokeh vs. rbokeh and ggplot vs. ggplot2).

@mcg1969
Copy link
Author

mcg1969 commented May 5, 2017

I'd be willing to move away from trying to be "super smart" in the face of a true ambiguity. It is far more important to me that we not bog down users with the requirement to make explicit choices when there really isn't an ambiguity. So for instance, if they have a Python-only environment, the existence of an R (say) package of the same name should never be an issue. But if they have both Python and R in the same environment, I think it would be fine to ask them to disambiguate.

I'm also open to giving the base environment a non-empty name.

@ChrisBarker-NOAA
Copy link

ChrisBarker-NOAA commented Nov 2, 2017

Reviving this conversation:

Am I missing something or is this essentially a proposal to enhance conda itself?

In which case, we're pretty late to the game and need to move it forward!

And in the meantime, we still need a clear naming convention for conda-forge pre-namespace.

@ChrisBarker-NOAA
Copy link

And on concern:

This seems like a system that would work great for folks that are primarily PYthon users, or primarily R usrs, or...

But if you are workign on a project that, say, has:

some Python code
some R code
some Perl code
A node-based webserver that is serving it all up to an API.

then this is jsut going to make it ugly to deal with.

Which makes me think that implicit namespace in the package names maybe be cleaner and certainly easier.

Leaving the key question -- what to do about the existing python-heavy focus of package names?

One thought on that:

build a hack into conda where it will look for a "py_something" whenever someone searches for "something".

  • If only one exists, it gets installed
  • If both exist, and the py_something version is newer, it gets installed
  • if both exist, and they have the same version, or the py_ version is older, the user gets prompted with a question.

Anyway, probably problematic, it's just what I came up with on the spur of the moment.

@kalefranz
Copy link

The conda 4.4 MatchSpec supports channel::package_name and channel/subdir::package_name syntax. The natural syntactic extension to include namespaces would be, in full, channel/subdir:namespace:package_name, with variations like */subdir:namespace:package_name and namespace:package_name.

@encukou
Copy link

encukou commented Mar 12, 2019

Hi!
I'm a packager for Fedora, where we're solving similar problems (but with more historical cruft). But here are some interesting things you might want to consider. Pardon me if I am too ignorant of Conda issues; you might well be going in a different direction entirely.

There will also be packages that, arguably, do not live in a larger ecosystem. This would include things like standalone executables, C libraries, and other packages that are readily used across multiple ecosystems.

You could think of those as ecosystems – but ones that don't work per-package.

Speaking the proposed syntax, the :graphviz package could provide executable:dot, executable:circo, etc., and (at least on Linux) pkgconfig:libcdt, pkgconfig:libcgraph, pkgconfig:libgvc, etc. – and other packages could depend on those. You might even be able to generate some of the dependencies programatically. (I'm not familiar with that specific ecosystem at all, so I don't know if, say, CMake can do it – but maybe we can agree it would be a useful feature.)

Fedora models Packages vs. the things they Provide as an N:N relationship. That's not easy to implement, but I fear that it's a better model for the reality.


Another thing is that some packages aren't published on the “normal” distribution channels. You might want to think about how to package a Python module that's not published on PyPI.
python doesn't sound like a well-defined namespace; pipy is. Same for node and npm. That doesn't prevent you from using python, of course. But you'll probably want to be very explicit about what that python means – especially if you plan to add any automation.
pip installs from PyPI only, so setup_requires will use the PyPI namespace (whatever you call it). If you want to package a non-PyPI Python module, it will need to have a name, and you'll not want to shadow a name that is or can be claimed on PyPI for something else.

@jakirkham
Copy link

FWIW @mcg1969, @njsmith raised a similar idea on the Python discuss recently.

@mcg1969
Copy link
Author

mcg1969 commented Apr 24, 2019

@jakirkham I sure wish gists had notifications, I only saw this because I got pointed to the Python discuss mention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment