sklam/gufunc_rewrite_plan.md

## gufunc_rewrite_plan.md

      
    Raw
  

              gufunc_rewrite_plan.md
            
          
    (G)Ufunc Rewrite Plan

Introduction

This document describes the reasons for rewriting the gufunc support in numba, and it sets the goals for the rewrite.
Background

Numba supports the creation of numpy ufunc and gufunc using the @vectorize and @guvectorize decorators, respectively.  These decorators provided an easy way to create ufuncs and gufuncs without sacrificing execution performance (See Appendix 1).  To use these decorators, users provide a the kernel function.  For ufuncs, the kernel takes scalar arguments only.  For gufuncs, the kernel takes Nd arrays.
Issues with the current implementation

This section describes the current problems and limitations on the current implementation of the (g)ufunc support.
Issues due to legacy code in Numba

The @vectorize and @guvectorize for ufunc and gufunc respectively were available since a very early versions when numba's feature set was small. As numba feature set grows, the limitation of the current implementation becomes more obvious.

Ufunc and Gufunc should be the same object.
Numba's gufunc creation pipeline is distinct from the ufunc creation pipeline.  However, ufuncs should be considered as a special-case of gufunc where all parameters are of scalar type. Combining the pipeline will reduce code duplication.
Gufunc lack of dynamic type inference.
Numba's gufunc support still requires type declaration but the ufunc support can perform dynamic type inference.
Numba @jit'ed function can't consume gufunc.
Numba @jit'ed function can call ufuncs by directly emitting the broadcasting and looping logic into the callsite.  Gufuncs are more complicated and we haven't implemented a way to use a gufunc directly in the compiled code.  (Note: parfor has hacks to use gufunc kernel directly)

Issues due to limitations of NumPy (g)ufunc


Numpy (g)ufunc is limited to arrays parameters
The new parallel accelerator features leverages  parallel gufuncs as the building block of parallel loops. The new features will fail if the loops reference non-array-convertiable types.  (Note: non-array-convertiable type means any type that is not array or not safely convertible to an array)
No array contiguous information
When the numpy (g)ufunc machinery invokes the kernel function, there is no guarantee to the contiguous-ness (C-order? Fortran-order?) of the array parameters.  This affects potential performance gain with automatic SIMD-vectorization.

Goals of the new gufunc system:

This section lists the features of the new gufunc system.

A single type gufunc where ufunc is just a special-case.
A gufunc implementation that is indepedent of numpy and python.
Gufunc to be first-class numba type.

First-class gufuncs so they can passed as parameters and be used in return value.
Currently, all callable types in numba are not first-class.


Supports any numba data types

Element type (dtypes): tuples, user defined structures
Container types: any sequence types, ragged array...


Ahead-of-time (AoT) compilation. (Depends on [2])
Precompiling gufunc into native shared library can avoid compilation overhead at deployment.
Usable with or without numba runtime (NRT).

The NRT provides support like reference-counting
This wil depend on features used in the gufunc kernel.


Callable from C code.  (Depends on [2]; related to [5, 6])
Guarantee of array contiguous-ness in kernel.
To enable aggressive optimization.
Hardware Heterogeneous

Heterogeneous in the perspective of the user of a gufunc.  The gufunc will abstract away how the kernel is dispatched on different hardware.
CPU, Multicore, GPU targets
Gufunc of certain HW target can be inlined to callsite.
Does not require auto HW target selection.


Feature extensions

Passthru/non-broadcasting argument.
Flexible type signature; i.e.

float32, int32, Any, Any -> int32, float32

2nd, 3rd arguments can be any type


T, T, Any, Any -> T, T

2nd, 3rd arguments can be any type
T is like C++ template parameter


Dynamic type inference when type declaration is not provided.

Like how ufunc (dufunc) work currently.


Milestones

This section describes the major milestones in the gufunc rewrite work.
The following are milestones will go before the numba version 1.0 release.


Replacement for current numpy gufunc system

CPU replacement
Non-CPU replacement (mostly there in numba as purely python code)
Unified (g)ufunc implementation
Dynamic type inference
array contiguous-ness guarantee

Needed for parallel accelerator features; better SIMD-vectorization.


Feature extension Lvl 1


Passthru/non-broadcasting arg


Potential  API
@guf('(m, n), (n), *, * -> (n), (m)')
def foo(matrix, vector, extra1, extra2, output1, output2):
      # extra1 and extra2 are not broadcasted
      pass

# explicit output
foo(matrix, vector, extra1, extra2, out=(output1, output2))
# implicit output
(output1, output2) = foo(matrix, vector, extra1, extra2)


Prioritize due to frequent request from users.


Generalized for more common numba types as dtypes.

Prioritize due to need in parallel accelerator.


The following milestones are scheduled for post v1.0.  They not ordered.


Feature extension Lvl 2

Flexible type signature

Consider datashape type pattern to aid inference of output type?
@guf('(m, n), (n), *, * -> (n), (m)',
     ['float32, int32, Any, Any -> int32, float32',
      'T, T, Any, Any -> T, T',])

overload resolution by first compatible type-signature.


CPU AoT compilation

Compile to shared lib
distutil/setuptools helper (like CFFI)


Generalized for any numba/ndtypes dtypes

Numba types to produce ndtypes spec


Generalized for any sequence container


General heterogeneous gufunc object (?)

make XND address-space aware
example: cpu-gufunc calling gpu-gufunc.
non-example: gpu-gufunc cannot call cpu-gufunc.


GPU AoT compilation (?)

Potential usecase in exporting code for use in machine-learning/deep-learning application.


Other notes


We may consider perf enhancements, like inlining gufunc and  compiling explicit loopnest, for post v1.0. But we want a generic implementation first to define a stable API for v1.0.
Overload resolution: Allow custom resolution function in the underlying impl. Different user facing API may provide different resolution logic.  This could be useful to reuse the new gufunc system to unify all numba function dispatch system.


Appendix


1. More Background on NumPy (g)ufunc

NumPy univesral functions(ufuncs) are powerful functional machinery for applying computation over arrays of compatible shapes.  It is the core of many NumPy builtins functions. Each ufunc has a kernel function that is an elementwise function.  The kernel function can take scalars, vectors or ND array slices as an elements.  Ufuncs that can take ND array slices are called generalized ufuncs (gufunc_) and they define a shape signature for the array dimensions accepted by the kernel function. Another way to think about it, basic ufuncs are just a special-case of gufunc that the kernel function takes 0D array slices (thus scalars).  For the rest of this document, we will simply refer to them as ufuncs.
The extension API for creating a user-defined ufunc is limited in the Python level.  The numpy.vectorize makes ufuncs out of python functions but it is limited in performance. For the most efficient implementation, one must write it using the C-API.
Numba provides the numba.vectorize and the numba.guvectorize decorators to simplify the creation of user defined ufuncs and generalized ufuncs, respectively. These decorators will also compile the kernel function; thus, it provides execution performance comparable to a custom ufunc written in C.
2. Ndtypes and XND

See http://ndtypes.readthedocs.io/en/latest/ and http://xnd.readthedocs.io/en/latest/.