Skip to content

Instantly share code, notes, and snippets.

@shoyer
Last active May 24, 2018 19:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shoyer/bf6d2c7d271f05e08ca5f77d729a8917 to your computer and use it in GitHub Desktop.
Save shoyer/bf6d2c7d271f05e08ca5f77d729a8917 to your computer and use it in GitHub Desktop.

I'll mention one other option that I've been contemplating recently, a bit of a hybrid of solutions 1 and 2:

  • We could build a new library with dispatchable functions inside NumPy itself, e.g., "numpy.api."

Functions in numpy.api work just like those in numpy, with two critical differences:

  1. They support overloading, via some to be determined mechanism.
  2. They don't coerce unknown types to np.array() using __array__().

This approach has a number of advantages over adjusting existing NumPy functions:

  • Backwards compatibility. For any particular numpy function without overloads, there is assuredly existing code that relies on it always coercing to numpy arrays. Every time we add a new overload in NumPy (e.g., np.any() recently), we've seen things break in downstream libraries like pandas. Even if we require downstream libraries to opt-in (e.g., by implementing __array_ufunc__) that just pushes the breakage downstream.
  • Predictability. We can remove any uncertainty over whether a NumPy function supports dispatching. When using a library like dask or sparse, it's a big deal to convert an object into a NumPy array. Unfortunately, with the current state of affairs, this is easy to do accidentally. This leads to bugs.
  • Cleaning up old APIs. NumPy currently does some ad-hoc dispatching based on attributes, e.g., np.mean(obj) calls obj.mean() if it has a 'mean' attribute. This sort of implicit overloading is unreliable and would be a mess to maintain along with a new dispatch mechanism, but it's widely used so we can't really get rid of it.
  • Performance. NumPy functions can remain specialized to NumPy arrays, and won't need to check for overloading. (This is a minor point, but there are always people counting microseconds in tight loops, or else we won't even have scalar objects in NumPy.)

Most of advantages would be true for an external library, too, but of course putting it inside NumPy itself helps solve the community problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment