astanin/Array abstraction in Clojure.md

## Array abstraction in Clojure.md

      
    Raw
  

              Array abstraction in Clojure.md
            
          
    Array abstraction in Clojure

Few days ago Mike Anderson wrote a proposal for a generic matrix
API in Clojure which could compete with NumPy.  I wanted to write a
similar post for months, but was reluctant to start. This problem is
very dear to me. Basically, it is a matter if I can use Clojure for
most of my work, or it remains a toy language for me. Thanks to Mike
for bringing the question up. Though, I have a different vision of how
we should approach arrays and matrices in Clojure.
1. Success story: NumPy

The strong point of Python for scientific computing is that the
community has managed to agree on one array abstraction used across
multiple libraries (IMO, the lack of such an agreement is
the weekest point of Haskell). It is exactly
the case of having thousands of functions operating on one data
structure (at least from the end-used view point).
And it gives a huge boost to composability. It's not uncommon to mix
three or four libraries built for different problem domains, and glue
them together in just few lines of Python
code. numpy.ndarray proved to be an abstraction which is
enough for most of the problem domains, but not too specific.
2. Arrays, not matrices

Let's look at ndarray:

it appears to be a fixed size multi-dimensional container,
it is often just a view to another array or an object in memory,
the number of dimensions and items is defined by its shape tuple,
all items are numeric values of the same type and size,
the contents can be accessed by indexing and slicing,
there are many ways to operate on the array as a whole (like element-by-element ops),
its in-memory storage layout is interoperable with C and Fortran libraries.

In the future NumPy will go well beyond this abstraction, but
it is ndarray which made Python the language of choice of many for
numerical computation.
Please note what ndarray is not a matrix, nor a tensor. It
implements some linear algebra methods, and some statistics
methods, but use of additional specialized modules is
required to do more complex stuff.
3. Objectives

I suppose the purpose of the common multi-dimensional array
abstractions is to:


make an array-like output of any Clojure/Java library consumable
by any other library


allow for zero-copying composition of views over arbitrary
array-like values


allow for building array-like values from Clojure collections
and other array-like values


favor array-level operations (arrays as values), rather than
explicit iteration over elements,


Unlike in NumPy, in Clojure we cannot force one conventional storage model.
We have an opposite problem of having many storage models to support.

allow for destructive updates and implement all the API for at
least one "standard" storage type (and promote copy-on-write usage
pattern).

There are always some special-purpose array implementations, like
sparse arrays or analytically defined matrices. The abstraction should
not be too restrictive about how the underlying storage works.
Complexity:

API/protocol should be as simple and orthogonal as possible, so it
is easy to implement its support for new array-like types.

4. My vision of the core API

So I think that the basic Clojure array abstraction should behave like
a generic multi-dimensional container; linear algebra, image
processing, signal processing libraries etc.  should not be part of
it.
So the core array API would be something like:


Random access for an arbitrary number of dimensions.
(get my-3d-array [42 17 1])

The last index is supposed to be the most frequently changing
index.
The API should not mention things like rows or columns
explicitly. Instead only a generic shape function should be
available:
(shape my-3d-array)  ;; => [100 100 3] for a 100x100x3 array

shape function is enough to know the size of the array (apply * (shape ...)), its rank (count (shape ...)) and every dimension.


Slicing/striding syntax to produce partial array views, for
example:
(shape (get my-3d-array [[:from 50], [:every 2], 0]))
;; => [50 50 1]

or probably with a separate slicing function (not get) and with
some shortcuts:
(shape (slice my-3d-array [:all [:from 50 :every 2] :last]))
;; => [100 25 1]


Index-space transforms. Many array operations (like transposition,
cyclic shifts, reversing etc.)  are just index-space transforms;
index-map may produce new array views:
(index-map (comp vec reverse) my-2d-array)  ;; transpose


Changing shape could be done with index-map only, but then we
need to deduce too much about the image of the index transform;
having a separate reshape is probably more pragmatic:
(reshape my-array (apply * (shape my-array)))  ;; flatten
(reshape my-100x100-array [1000 10])


Stacking along an arbitrary axis; stacking for
arrays is like concatenation and zipmap for sequences:
(stack 0 array1 array2)  ;; along the first axis
(stack nil array-2d array-2d array-2d)  ;; => 3D array


Lifting (map) scalar functions to element-by-element functions;
support multiple-arity functions too:
(map + my-array-spam my-array-eggs)

I suppose, lifting may make the hypothetical Clojure ndarray
better than its Python namesake. In Python it works only for
predefined binary arithmetic operators and some predefined unary
functions written in C. It requires to write list
comprehensions/loops in Python and lose performance benefits of
NumPy.
A similar concept is applying a function along some dimension:
(apply-dim 2
    (fn [rgb] (apply + (map * [0.2126 0.7152 0.0722] rgb)))
    my-image-rgb)  ;; linear RGB to BT.709 luminance


A function to explicitly copy/change storage/make contiguous/force
view:
(copy my-array-view)  ;; now contiguous and mutable


Converting  “multi-dimensional” collections (lists of lists,
vectors of vectors etc) to a multi-dimensional array:
(to-ndarray [[1 2] [3 4]])

It should create a contiguous mutable storage by default.


5. The most used features first (candidates for the core API)

I counted my most used NumPy functions
recently. Not counting slicing and arithmetic
operations, they are: asarray, linspace, mean,
dot, sqrt, roll, hstack, loadtxt, linalg.norm.
My overuse of asarray is probably due to how universal (unary
element-by-element) functions are implemented in NumPy. They should be
written in C, and only a limited set of them is available out of the
box. If I cannot express the algorithm using existing UFuncs, masked
arrays, array stacking, reshaping etc, then I run list comprehensions
to do anything I want in Python, and convert the result to ndarray
with asarray. Hopefully, we can do better in Clojure.
It indicates the priority features in building a usable Clojure
ndarray suite:

lifting element-level functions to element-wise array functions
building arrays from arrays (cyclic shifts, stacking, ranges)
basic input-output (loadtxt/savetxt, imread/imwrite are useful too)
some linear algebra and some common statistics/reductions.

I suppose lifting and building array views should be part of the
array API (see above).
Other features to consider for the core API are:

search by value or by condition (like argmin, argmax in NumPy)
zero-copy sorting (like argsort in NumPy) -- can replace search too
element type query (like .dtype) and conversion (probably in to-ndarray)
full and partial reductions (along some axis)
masking and choice

6. Completeness and orthogonality

Provided that arrays are iterable, the core API seems to be enough to
implement almost all features of the NumPy ndarray. A separate set
of utility functions needs to be provided for the end user (things
like transpose).
Strictly speaking, reshape is not orthogonal to index-map, but
it's easier to implement it as a separate function; apply-dim is
redundant if we have slice and map, but I suppose some
implementations may optimize it; slice can be merged with get;
copy and to-ndarray can be merged too, no-op conversion may be
moved to utilities.
Some candidates are redundant too.  I suppose that complete reductions
could be done with reduce if arrays behave like
iterables. clojure.core/map + slice + reduce + stack would do
the work of partial reductions (along some axis), but some
implementations may do better.
Input-output can be implemented as utility functions using the core
API.
Linear algebra is optional, and only some storage models may implement
it correctly.  It might be a good idea to decide on similar extended
API with the support for linear algebra.  Providing generic
unoptimized implementations as a fallback is another solution.
Choice can be covered by stacking and apply-dim. Masking is
equivalent to lifting a conditional function.
I suppose the good approach is to start with an almost orthogonal core
API, and try to implement all utility functions on top of it. Extend
the API, if some functions are too inefficient to do in Clojure.
7. Default storage format

I still don't know what Java implementation should be the default
storage type. I suppose it is better to deal with a contiguous 1D
storage by default, and convert it to other types if one starts doing
things like linear algebra.
Commons Math's ArrayRealVector is my prime
candidate. It doesn't support integer-valued arrays unfortunately.