asmeurer/ndindex.md

## ndindex.md

      
    Raw
  

              ndindex.md
            
          
    I would like to have a library that makes it easier to manipulate indices to
ndarrays. This primarily means slices and tuples of slices, but could also
apply to fancy indexing.
Here are some of the pain points of working with the Python slice objects
directly


slice is not hashable, so you have to convert it to a tuple to use it as a
dictionary key.


To pull apart a slice you have to do start, stop, step = s.start, s.stop, s.step. It would be simpler if one could just do, for instance,
range(*s).


They are not canonicalized. What I mean here is that the start or step
can be None. This is because slice makes not assumptions about slicing
semantics. I want something that assumes all indices will be into numpy
arrays.


For more than one dimension, you have tuples of slices. These don't have any
methods to directly access any. Even if you only care about one dimension,
you have to write code that handles both slice(a, b, c), and (slice(a, b, c),).


Furthermore, doing arithmetic on slices is complicated. It would be nice to
have a library that let you manipulate indices. Indices would in general make
no assumptions about the array that they index into, except perhaps implicitly
assuming that the index would not produce an IndexError. Some operations may
require knowledge of the shape to proceed, so they would require that to be
provided in order to proceed.
I do not care about symbolic entries, although it could potentially simplify
things (I'm not sure). It also might make things a lot more complicated,
because you have to represent a lot of if cases. There is no getting around
this, because Python slice semantics are not uniform. Slicing behavior depends
on things like the signs of the indices.
I envision a wrapper class like
class NDIndex:
    ...
which would wrap a slice, Ellipsis, ndarray, or tuple thereof to support the
most important nd-indexing (slices and tuples of slices are the most
important).
NDIndex would have subclasses to represent the different types of indexing.
It would be hashable, and also have some easy way to pull apart arguments (not
__iter__ since we want to override __getitem__ as below, but something
like NDIndex.args). All operations on NDIndex would return another
NDIndex. Nothing will return a slice, tuple, etc. directly.
Some examples of the sorts of operations that would be supported. i1, etc.
are NDIndex classes. You can imagine they are slices.


i1[i2]: This would create a new NDIndex i3 (when possible) so that
a[i1][i2] == a[i3].


split(i0, [i1, i2, ...]): This would return a list of indices [j1, j2, ...], such that
a[i0] = concat(a[i1][j1], a[i2][j2], ...).
I'm not sure if this is the right abstraction. I have a function in
versioned_hdf5 split_chunks which splits a slice into sub-slices over
chunks of an array. So split_chunks(i0, CHUNK) would become split(i0, [slice(0, CHUNK), slice(CHUNK, 2*CHUNK), ...]).


Related to that, some functions to generate slices over chunks like in the
example above would be useful. This is especially since doing this over a
single dimension isn't so bad, but it gets worse once your chunks are
multi-dimensional (my main use-case is HDF5, which deals in chunks. See
https://www.pytables.org/usersguide/optimization.html#understanding-chunking).


len(i1): the maximum number of elements that can be indexed by i1.


i1.reduce(shape): assume that i1 will be indexed into an array of shape
shape and reduce it (e.g., if any indices go beyond the extent of the
array, reduce them down). After this, for instance, len(i1.reduce(shape))
would be more accurate for arrays of shape shape.


Concatenation, so i1 + i2 produces a single index so that a[i1 + i2]
gives all the elements of a[i1] and a[i2]. This could also have a strict
mode so that for instance, it only works if the end result is a valid slice.


It would implement __index__ so you can use it directly as an index. It
would also have a method to convert it back to a normal set of index objects
(tuple of slices, Ellipsis, None, and ndarrays).


For testing, I would use a combination of exhaustive tests and test using
hypothesis. Basically, create an np.arange and test if the index is correct
against it. That way we are assured we are keeping to numpy semantics. Testing
is very important because this logic has a lot of corner cases and it's easy
to get it wrong. We should aim for full test coverage, not just of lines but
of cases.
To start out with, there would be a lot of NotImplementedErrors. We would
then build things as we need them.