I'll mention one other option that I've been contemplating recently, a bit of a hybrid of solutions 1 and 2:
- We could build a new library with dispatchable functions inside NumPy itself, e.g., "numpy.api."
Functions in numpy.api
work just like those in numpy, with two critical differences:
- They support overloading, via some to be determined mechanism.
- They don't coerce unknown types to
np.array()
using__array__()
.
This approach has a number of advantages over adjusting existing NumPy functions:
- Backwards compatibility. For any particular numpy function without overloads, there is assuredly existing code that relies on it always coercing to numpy arrays. Every time we add a new overload in NumPy (e.g., np.any() recently), we've seen things break in downstream libraries like pandas. Even if we require downstream libraries to opt-in (e.g., by implementing
__array_ufunc__
) that just pushes the breakage downstream. - Predictability. We can remove any uncertainty over whether a NumPy function supports dispatching. When using a library like dask or sparse, it's a big deal to convert an object into a NumPy array. Unfortunately, with the current state of affairs, this is easy to do accidentally. This leads to bugs.
- Cleaning up old APIs. NumPy currently does some ad-hoc dispatching based on attributes, e.g.,
np.mean(obj)
callsobj.mean()
if it has a 'mean' attribute. This sort of implicit overloading is unreliable and would be a mess to maintain along with a new dispatch mechanism, but it's widely used so we can't really get rid of it. - Performance. NumPy functions can remain specialized to NumPy arrays, and won't need to check for overloading. (This is a minor point, but there are always people counting microseconds in tight loops, or else we won't even have scalar objects in NumPy.)
Most of advantages would be true for an external library, too, but of course putting it inside NumPy itself helps solve the community problem.