This document describes the reasons for rewriting the gufunc support in numba, and it sets the goals for the rewrite.
Numba supports the creation of numpy ufunc and gufunc using the @vectorize
and @guvectorize
decorators, respectively. These decorators provided an easy way to create ufuncs and gufuncs without sacrificing execution performance (See Appendix 1). To use these decorators, users provide a the kernel function. For ufuncs, the kernel takes scalar arguments only. For gufuncs, the kernel takes Nd arrays.
This section describes the current problems and limitations on the current implementation of the (g)ufunc support.
The @vectorize
and @guvectorize
for ufunc and gufunc respectively were available since a very early versions when numba's feature set was small. As numba feature set grows, the limitation of the current implementation becomes more obvious.
- Ufunc and Gufunc should be the same object. Numba's gufunc creation pipeline is distinct from the ufunc creation pipeline. However, ufuncs should be considered as a special-case of gufunc where all parameters are of scalar type. Combining the pipeline will reduce code duplication.
- Gufunc lack of dynamic type inference. Numba's gufunc support still requires type declaration but the ufunc support can perform dynamic type inference.
- Numba @jit'ed function can't consume gufunc. Numba @jit'ed function can call ufuncs by directly emitting the broadcasting and looping logic into the callsite. Gufuncs are more complicated and we haven't implemented a way to use a gufunc directly in the compiled code. (Note: parfor has hacks to use gufunc kernel directly)
- Numpy (g)ufunc is limited to arrays parameters The new parallel accelerator features leverages parallel gufuncs as the building block of parallel loops. The new features will fail if the loops reference non-array-convertiable types. (Note: non-array-convertiable type means any type that is not array or not safely convertible to an array)
- No array contiguous information When the numpy (g)ufunc machinery invokes the kernel function, there is no guarantee to the contiguous-ness (C-order? Fortran-order?) of the array parameters. This affects potential performance gain with automatic SIMD-vectorization.
This section lists the features of the new gufunc system.
- A single type gufunc where ufunc is just a special-case.
- A gufunc implementation that is indepedent of numpy and python.
- Gufunc to be first-class numba type.
- First-class gufuncs so they can passed as parameters and be used in return value.
- Currently, all callable types in numba are not first-class.
- Supports any numba data types
- Element type (dtypes): tuples, user defined structures
- Container types: any sequence types, ragged array...
- Ahead-of-time (AoT) compilation. (Depends on [2]) Precompiling gufunc into native shared library can avoid compilation overhead at deployment.
- Usable with or without numba runtime (NRT).
- The NRT provides support like reference-counting
- This wil depend on features used in the gufunc kernel.
- Callable from C code. (Depends on [2]; related to [5, 6])
- Guarantee of array contiguous-ness in kernel. To enable aggressive optimization.
- Hardware Heterogeneous
- Heterogeneous in the perspective of the user of a gufunc. The gufunc will abstract away how the kernel is dispatched on different hardware.
- CPU, Multicore, GPU targets
- Gufunc of certain HW target can be inlined to callsite.
- Does not require auto HW target selection.
- Feature extensions
- Passthru/non-broadcasting argument.
- Flexible type signature; i.e.
float32, int32, Any, Any -> int32, float32
- 2nd, 3rd arguments can be any type
T, T, Any, Any -> T, T
- 2nd, 3rd arguments can be any type
T
is like C++ template parameter
- Dynamic type inference when type declaration is not provided.
- Like how ufunc (dufunc) work currently.
This section describes the major milestones in the gufunc rewrite work.
The following are milestones will go before the numba version 1.0 release.
-
Replacement for current numpy gufunc system
- CPU replacement
- Non-CPU replacement (mostly there in numba as purely python code)
- Unified (g)ufunc implementation
- Dynamic type inference
- array contiguous-ness guarantee
- Needed for parallel accelerator features; better SIMD-vectorization.
-
Feature extension Lvl 1
-
Passthru/non-broadcasting arg
-
Potential API
@guf('(m, n), (n), *, * -> (n), (m)') def foo(matrix, vector, extra1, extra2, output1, output2): # extra1 and extra2 are not broadcasted pass # explicit output foo(matrix, vector, extra1, extra2, out=(output1, output2)) # implicit output (output1, output2) = foo(matrix, vector, extra1, extra2)
-
Prioritize due to frequent request from users.
-
-
Generalized for more common numba types as dtypes.
- Prioritize due to need in parallel accelerator.
The following milestones are scheduled for post v1.0. They not ordered.
-
Feature extension Lvl 2
- Flexible type signature
- Consider datashape type pattern to aid inference of output type?
@guf('(m, n), (n), *, * -> (n), (m)', ['float32, int32, Any, Any -> int32, float32', 'T, T, Any, Any -> T, T',])
- overload resolution by first compatible type-signature.
- Consider datashape type pattern to aid inference of output type?
- Flexible type signature
-
CPU AoT compilation
- Compile to shared lib
- distutil/setuptools helper (like CFFI)
-
Generalized for any numba/ndtypes dtypes
- Numba types to produce ndtypes spec
-
Generalized for any sequence container
-
General heterogeneous gufunc object (?)
- make XND address-space aware
- example: cpu-gufunc calling gpu-gufunc.
- non-example: gpu-gufunc cannot call cpu-gufunc.
-
GPU AoT compilation (?)
- Potential usecase in exporting code for use in machine-learning/deep-learning application.
- We may consider perf enhancements, like inlining gufunc and compiling explicit loopnest, for post v1.0. But we want a generic implementation first to define a stable API for v1.0.
- Overload resolution: Allow custom resolution function in the underlying impl. Different user facing API may provide different resolution logic. This could be useful to reuse the new gufunc system to unify all numba function dispatch system.
NumPy univesral functions(ufuncs) are powerful functional machinery for applying computation over arrays of compatible shapes. It is the core of many NumPy builtins functions. Each ufunc has a kernel function that is an elementwise function. The kernel function can take scalars, vectors or ND array slices as an elements. Ufuncs that can take ND array slices are called generalized ufuncs (gufunc_) and they define a shape signature for the array dimensions accepted by the kernel function. Another way to think about it, basic ufuncs are just a special-case of gufunc that the kernel function takes 0D array slices (thus scalars). For the rest of this document, we will simply refer to them as ufuncs.
The extension API for creating a user-defined ufunc is limited in the Python level. The numpy.vectorize
makes ufuncs out of python functions but it is limited in performance. For the most efficient implementation, one must write it using the C-API.
Numba provides the numba.vectorize
and the numba.guvectorize
decorators to simplify the creation of user defined ufuncs and generalized ufuncs, respectively. These decorators will also compile the kernel function; thus, it provides execution performance comparable to a custom ufunc written in C.
See http://ndtypes.readthedocs.io/en/latest/ and http://xnd.readthedocs.io/en/latest/.
How will the annotations and overload resolution interact with the planned improvements to the type system?
Have you considered using pep-484 type annotations? I'm especially interested in type constraints etc