Skip to content

Instantly share code, notes, and snippets.

@astanin
Last active December 10, 2015 18:49
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save astanin/4477420 to your computer and use it in GitHub Desktop.
Save astanin/4477420 to your computer and use it in GitHub Desktop.

Array abstraction in Clojure

Few days ago Mike Anderson wrote a proposal for a generic matrix API in Clojure which could compete with NumPy. I wanted to write a similar post for months, but was reluctant to start. This problem is very dear to me. Basically, it is a matter if I can use Clojure for most of my work, or it remains a toy language for me. Thanks to Mike for bringing the question up. Though, I have a different vision of how we should approach arrays and matrices in Clojure.

1. Success story: NumPy

The strong point of Python for scientific computing is that the community has managed to agree on one array abstraction used across multiple libraries (IMO, the lack of such an agreement is the weekest point of Haskell). It is exactly the case of having thousands of functions operating on one data structure (at least from the end-used view point).

And it gives a huge boost to composability. It's not uncommon to mix three or four libraries built for different problem domains, and glue them together in just few lines of Python code. numpy.ndarray proved to be an abstraction which is enough for most of the problem domains, but not too specific.

2. Arrays, not matrices

Let's look at ndarray:

  • it appears to be a fixed size multi-dimensional container,
  • it is often just a view to another array or an object in memory,
  • the number of dimensions and items is defined by its shape tuple,
  • all items are numeric values of the same type and size,
  • the contents can be accessed by indexing and slicing,
  • there are many ways to operate on the array as a whole (like element-by-element ops),
  • its in-memory storage layout is interoperable with C and Fortran libraries.

In the future NumPy will go well beyond this abstraction, but it is ndarray which made Python the language of choice of many for numerical computation.

Please note what ndarray is not a matrix, nor a tensor. It implements some linear algebra methods, and some statistics methods, but use of additional specialized modules is required to do more complex stuff.

3. Objectives

I suppose the purpose of the common multi-dimensional array abstractions is to:

  • make an array-like output of any Clojure/Java library consumable by any other library

  • allow for zero-copying composition of views over arbitrary array-like values

  • allow for building array-like values from Clojure collections and other array-like values

  • favor array-level operations (arrays as values), rather than explicit iteration over elements,

Unlike in NumPy, in Clojure we cannot force one conventional storage model. We have an opposite problem of having many storage models to support.

  • allow for destructive updates and implement all the API for at least one "standard" storage type (and promote copy-on-write usage pattern).

There are always some special-purpose array implementations, like sparse arrays or analytically defined matrices. The abstraction should not be too restrictive about how the underlying storage works.

Complexity:

  • API/protocol should be as simple and orthogonal as possible, so it is easy to implement its support for new array-like types.

4. My vision of the core API

So I think that the basic Clojure array abstraction should behave like a generic multi-dimensional container; linear algebra, image processing, signal processing libraries etc. should not be part of it.

So the core array API would be something like:

  • Random access for an arbitrary number of dimensions.

    (get my-3d-array [42 17 1])
    

    The last index is supposed to be the most frequently changing index.

    The API should not mention things like rows or columns explicitly. Instead only a generic shape function should be available:

    (shape my-3d-array)  ;; => [100 100 3] for a 100x100x3 array
    

    shape function is enough to know the size of the array (apply * (shape ...)), its rank (count (shape ...)) and every dimension.

  • Slicing/striding syntax to produce partial array views, for example:

    (shape (get my-3d-array [[:from 50], [:every 2], 0]))
    ;; => [50 50 1]
    

    or probably with a separate slicing function (not get) and with some shortcuts:

    (shape (slice my-3d-array [:all [:from 50 :every 2] :last]))
    ;; => [100 25 1]
    
  • Index-space transforms. Many array operations (like transposition, cyclic shifts, reversing etc.) are just index-space transforms; index-map may produce new array views:

    (index-map (comp vec reverse) my-2d-array)  ;; transpose
    
  • Changing shape could be done with index-map only, but then we need to deduce too much about the image of the index transform; having a separate reshape is probably more pragmatic:

    (reshape my-array (apply * (shape my-array)))  ;; flatten
    (reshape my-100x100-array [1000 10])
    
  • Stacking along an arbitrary axis; stacking for arrays is like concatenation and zipmap for sequences:

    (stack 0 array1 array2)  ;; along the first axis
    (stack nil array-2d array-2d array-2d)  ;; => 3D array
    
  • Lifting (map) scalar functions to element-by-element functions; support multiple-arity functions too:

    (map + my-array-spam my-array-eggs)
    

    I suppose, lifting may make the hypothetical Clojure ndarray better than its Python namesake. In Python it works only for predefined binary arithmetic operators and some predefined unary functions written in C. It requires to write list comprehensions/loops in Python and lose performance benefits of NumPy.

    A similar concept is applying a function along some dimension:

    (apply-dim 2
        (fn [rgb] (apply + (map * [0.2126 0.7152 0.0722] rgb)))
        my-image-rgb)  ;; linear RGB to BT.709 luminance
    
  • A function to explicitly copy/change storage/make contiguous/force view:

    (copy my-array-view)  ;; now contiguous and mutable
    
  • Converting “multi-dimensional” collections (lists of lists, vectors of vectors etc) to a multi-dimensional array:

    (to-ndarray [[1 2] [3 4]])
    

    It should create a contiguous mutable storage by default.

5. The most used features first (candidates for the core API)

I counted my most used NumPy functions recently. Not counting slicing and arithmetic operations, they are: asarray, linspace, mean, dot, sqrt, roll, hstack, loadtxt, linalg.norm.

My overuse of asarray is probably due to how universal (unary element-by-element) functions are implemented in NumPy. They should be written in C, and only a limited set of them is available out of the box. If I cannot express the algorithm using existing UFuncs, masked arrays, array stacking, reshaping etc, then I run list comprehensions to do anything I want in Python, and convert the result to ndarray with asarray. Hopefully, we can do better in Clojure.

It indicates the priority features in building a usable Clojure ndarray suite:

  • lifting element-level functions to element-wise array functions
  • building arrays from arrays (cyclic shifts, stacking, ranges)
  • basic input-output (loadtxt/savetxt, imread/imwrite are useful too)
  • some linear algebra and some common statistics/reductions.

I suppose lifting and building array views should be part of the array API (see above).

Other features to consider for the core API are:

  • search by value or by condition (like argmin, argmax in NumPy)
  • zero-copy sorting (like argsort in NumPy) -- can replace search too
  • element type query (like .dtype) and conversion (probably in to-ndarray)
  • full and partial reductions (along some axis)
  • masking and choice

6. Completeness and orthogonality

Provided that arrays are iterable, the core API seems to be enough to implement almost all features of the NumPy ndarray. A separate set of utility functions needs to be provided for the end user (things like transpose).

Strictly speaking, reshape is not orthogonal to index-map, but it's easier to implement it as a separate function; apply-dim is redundant if we have slice and map, but I suppose some implementations may optimize it; slice can be merged with get; copy and to-ndarray can be merged too, no-op conversion may be moved to utilities.

Some candidates are redundant too. I suppose that complete reductions could be done with reduce if arrays behave like iterables. clojure.core/map + slice + reduce + stack would do the work of partial reductions (along some axis), but some implementations may do better.

Input-output can be implemented as utility functions using the core API.

Linear algebra is optional, and only some storage models may implement it correctly. It might be a good idea to decide on similar extended API with the support for linear algebra. Providing generic unoptimized implementations as a fallback is another solution.

Choice can be covered by stacking and apply-dim. Masking is equivalent to lifting a conditional function.

I suppose the good approach is to start with an almost orthogonal core API, and try to implement all utility functions on top of it. Extend the API, if some functions are too inefficient to do in Clojure.

7. Default storage format

I still don't know what Java implementation should be the default storage type. I suppose it is better to deal with a contiguous 1D storage by default, and convert it to other types if one starts doing things like linear algebra.

Commons Math's ArrayRealVector is my prime candidate. It doesn't support integer-valued arrays unfortunately.

@lantiga
Copy link

lantiga commented Jan 8, 2013

+1

@mikera
Copy link

mikera commented Jan 9, 2013

I don't think our visions are different at all :-)

Possibly this didn't come across since my post focused on matrix / vector functionality (which is the common actual use case) but I'm looking to provide exactly this sort of nd-array capability in the API.

On the complexity issue: I suggest we solve this through default implementations of common functions in terms of the more basic functions. This makes it easy for implementers if they just want to implement the core functions, while still letting them provide faster implementation if they want to for their specialised array/matrix types.

On the default storage issue: I'm a little uncomfortable with the idea of pulling in a "default" matrix library as a mandatory dependency. Perhaps we could provide a very simple lightweight implementation, but if someone wants to use a fully featured implementation then they should import it explicitly IMHO.

Anyway - your contributions to the effort are very much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment