Skip to content

Instantly share code, notes, and snippets.

@asmeurer
Last active March 6, 2020 19:45
Show Gist options
  • Save asmeurer/4913d32da667690e49f4675822219cb3 to your computer and use it in GitHub Desktop.
Save asmeurer/4913d32da667690e49f4675822219cb3 to your computer and use it in GitHub Desktop.
Idea for a ndindex manipulation library

I would like to have a library that makes it easier to manipulate indices to ndarrays. This primarily means slices and tuples of slices, but could also apply to fancy indexing.

Here are some of the pain points of working with the Python slice objects directly

  • slice is not hashable, so you have to convert it to a tuple to use it as a dictionary key.

  • To pull apart a slice you have to do start, stop, step = s.start, s.stop, s.step. It would be simpler if one could just do, for instance, range(*s).

  • They are not canonicalized. What I mean here is that the start or step can be None. This is because slice makes not assumptions about slicing semantics. I want something that assumes all indices will be into numpy arrays.

  • For more than one dimension, you have tuples of slices. These don't have any methods to directly access any. Even if you only care about one dimension, you have to write code that handles both slice(a, b, c), and (slice(a, b, c),).

Furthermore, doing arithmetic on slices is complicated. It would be nice to have a library that let you manipulate indices. Indices would in general make no assumptions about the array that they index into, except perhaps implicitly assuming that the index would not produce an IndexError. Some operations may require knowledge of the shape to proceed, so they would require that to be provided in order to proceed.

I do not care about symbolic entries, although it could potentially simplify things (I'm not sure). It also might make things a lot more complicated, because you have to represent a lot of if cases. There is no getting around this, because Python slice semantics are not uniform. Slicing behavior depends on things like the signs of the indices.

I envision a wrapper class like

class NDIndex:
    ...

which would wrap a slice, Ellipsis, ndarray, or tuple thereof to support the most important nd-indexing (slices and tuples of slices are the most important).

NDIndex would have subclasses to represent the different types of indexing. It would be hashable, and also have some easy way to pull apart arguments (not __iter__ since we want to override __getitem__ as below, but something like NDIndex.args). All operations on NDIndex would return another NDIndex. Nothing will return a slice, tuple, etc. directly.

Some examples of the sorts of operations that would be supported. i1, etc. are NDIndex classes. You can imagine they are slices.

  • i1[i2]: This would create a new NDIndex i3 (when possible) so that a[i1][i2] == a[i3].

  • split(i0, [i1, i2, ...]): This would return a list of indices [j1, j2, ...], such that a[i0] = concat(a[i1][j1], a[i2][j2], ...).

    I'm not sure if this is the right abstraction. I have a function in versioned_hdf5 split_chunks which splits a slice into sub-slices over chunks of an array. So split_chunks(i0, CHUNK) would become split(i0, [slice(0, CHUNK), slice(CHUNK, 2*CHUNK), ...]).

  • Related to that, some functions to generate slices over chunks like in the example above would be useful. This is especially since doing this over a single dimension isn't so bad, but it gets worse once your chunks are multi-dimensional (my main use-case is HDF5, which deals in chunks. See https://www.pytables.org/usersguide/optimization.html#understanding-chunking).

  • len(i1): the maximum number of elements that can be indexed by i1.

  • i1.reduce(shape): assume that i1 will be indexed into an array of shape shape and reduce it (e.g., if any indices go beyond the extent of the array, reduce them down). After this, for instance, len(i1.reduce(shape)) would be more accurate for arrays of shape shape.

  • Concatenation, so i1 + i2 produces a single index so that a[i1 + i2] gives all the elements of a[i1] and a[i2]. This could also have a strict mode so that for instance, it only works if the end result is a valid slice.

  • It would implement __index__ so you can use it directly as an index. It would also have a method to convert it back to a normal set of index objects (tuple of slices, Ellipsis, None, and ndarrays).

For testing, I would use a combination of exhaustive tests and test using hypothesis. Basically, create an np.arange and test if the index is correct against it. That way we are assured we are keeping to numpy semantics. Testing is very important because this logic has a lot of corner cases and it's easy to get it wrong. We should aim for full test coverage, not just of lines but of cases.

To start out with, there would be a lot of NotImplementedErrors. We would then build things as we need them.

@asmeurer
Copy link
Author

asmeurer commented Mar 6, 2020

@rgommers pointed me to https://numpy.org/neps/nep-0021-advanced-indexing.html. It's something we should consider supporting as well. CC @saulshanabrook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment