I would like to have a library that makes it easier to manipulate indices to ndarrays. This primarily means slices and tuples of slices, but could also apply to fancy indexing.
Here are some of the pain points of working with the Python slice
objects
directly
-
slice
is not hashable, so you have to convert it to a tuple to use it as a dictionary key. -
To pull apart a slice you have to do
start, stop, step = s.start, s.stop, s.step
. It would be simpler if one could just do, for instance,range(*s)
. -
They are not canonicalized. What I mean here is that the
start
orstep
can beNone
. This is becauseslice
makes not assumptions about slicing semantics. I want something that assumes all indices will be into numpy arrays. -
For more than one dimension, you have tuples of slices. These don't have any methods to directly access any. Even if you only care about one dimension, you have to write code that handles both
slice(a, b, c)
, and(slice(a, b, c),)
.
Furthermore, doing arithmetic on slices is complicated. It would be nice to
have a library that let you manipulate indices. Indices would in general make
no assumptions about the array that they index into, except perhaps implicitly
assuming that the index would not produce an IndexError
. Some operations may
require knowledge of the shape to proceed, so they would require that to be
provided in order to proceed.
I do not care about symbolic entries, although it could potentially simplify
things (I'm not sure). It also might make things a lot more complicated,
because you have to represent a lot of if
cases. There is no getting around
this, because Python slice semantics are not uniform. Slicing behavior depends
on things like the signs of the indices.
I envision a wrapper class like
class NDIndex:
...
which would wrap a slice, Ellipsis, ndarray, or tuple thereof to support the most important nd-indexing (slices and tuples of slices are the most important).
NDIndex
would have subclasses to represent the different types of indexing.
It would be hashable, and also have some easy way to pull apart arguments (not
__iter__
since we want to override __getitem__
as below, but something
like NDIndex.args
). All operations on NDIndex
would return another
NDIndex
. Nothing will return a slice, tuple, etc. directly.
Some examples of the sorts of operations that would be supported. i1
, etc.
are NDIndex
classes. You can imagine they are slices.
-
i1[i2]
: This would create a new NDIndexi3
(when possible) so thata[i1][i2] == a[i3]
. -
split(i0, [i1, i2, ...])
: This would return a list of indices[j1, j2, ...]
, such thata[i0]
=concat(a[i1][j1], a[i2][j2], ...)
.I'm not sure if this is the right abstraction. I have a function in versioned_hdf5
split_chunks
which splits a slice into sub-slices over chunks of an array. Sosplit_chunks(i0, CHUNK)
would becomesplit(i0, [slice(0, CHUNK), slice(CHUNK, 2*CHUNK), ...])
. -
Related to that, some functions to generate slices over chunks like in the example above would be useful. This is especially since doing this over a single dimension isn't so bad, but it gets worse once your chunks are multi-dimensional (my main use-case is HDF5, which deals in chunks. See https://www.pytables.org/usersguide/optimization.html#understanding-chunking).
-
len(i1)
: the maximum number of elements that can be indexed byi1
. -
i1.reduce(shape)
: assume thati1
will be indexed into an array of shapeshape
and reduce it (e.g., if any indices go beyond the extent of the array, reduce them down). After this, for instance,len(i1.reduce(shape))
would be more accurate for arrays of shapeshape
. -
Concatenation, so
i1 + i2
produces a single index so thata[i1 + i2]
gives all the elements ofa[i1]
anda[i2]
. This could also have a strict mode so that for instance, it only works if the end result is a valid slice. -
It would implement
__index__
so you can use it directly as an index. It would also have a method to convert it back to a normal set of index objects (tuple of slices, Ellipsis, None, and ndarrays).
For testing, I would use a combination of exhaustive tests and test using
hypothesis. Basically, create an np.arange
and test if the index is correct
against it. That way we are assured we are keeping to numpy semantics. Testing
is very important because this logic has a lot of corner cases and it's easy
to get it wrong. We should aim for full test coverage, not just of lines but
of cases.
To start out with, there would be a lot of NotImplementedError
s. We would
then build things as we need them.
@rgommers pointed me to https://numpy.org/neps/nep-0021-advanced-indexing.html. It's something we should consider supporting as well. CC @saulshanabrook