Skip to content

Instantly share code, notes, and snippets.

@seberg
Last active August 29, 2015 14:24
Show Gist options
  • Save seberg/976373b6a2b7c4188591 to your computer and use it in GitHub Desktop.
Save seberg/976373b6a2b7c4188591 to your computer and use it in GitHub Desktop.
NEP: New ways to index NumPy arrays

New ways to index NumPy arrays

Author: Sebastian Berg
Contact: sebastian@sipsolutions.net
Date: 2015-07-14
Status: draft

Executive summary

Advanced indexing with multiple array indices is typically confusing to both new, and in many cases also old, users of NumPy. To avoid this problem and allow for more and clearer features, we propose (see below for explanations):

  1. Introduce arr.oindex[indices] which allows advanced indices, but uses outer indexing logic.
  2. Introduce arr.vindex[indices] which use the current "vectorized"/broadcasted logic but with two differences from fancy indexing:
    1. Boolean indices always use the outer indexing logic. (Multi dimensional booleans should be allowed).
    2. The integer index result dimensions are always the first axes of the result array. No transpose is done, even for a single integer array index.
  3. Vanilla indexing on the array will only give warnings and eventually errors either:
    • when there is ambiguity between legacy fancy and outer indexing (note that arr[[1, 2], :, 0] is such a case, an integer can be a "second" integer index array),
    • when any integer index array is present (possibly additional for more then one boolean index array).

These constraints are sufficient for making indexing generally consistent with expectations.

Note that all things mentioned here apply both for assignment as well as subscription.

Motivation

Old style advanced indexing with multiple array (boolean or integer) indices, also called "fancy indexing", tends to be very confusing for new users. While fancy (or legacy) indexing is useful in many cases one would naively assume that the result of multiple 1-d ranges is analogous to multiple slices along each dimension (also called "outer indexing").

However, legacy fancy indexing with multiple arrays broadcasts these arrays into a single index over multiple dimensions. There are three main points of confusion when multiple array indices are involved:

  1. Most new users will usually expect outer indexing (consistent with slicing). This is also the most common way of handling this in other packages or languages.
  2. The axes introduced by the array indices are at the front, unless all array indices are consecutive, in which case one can deduce where the user "expects" them to be:
    • arr[:, [0, 1], :, [0, 1]] will have the first dimension shaped 2.
    • arr[:, [0, 1], [0, 1]] will have the second dimension shaped 2.
  3. When a boolean array index is mixed with another boolean or integer array, the result is very hard to understand (the boolean array is converted to integer array indices and then broadcast), and hardly useful. There is no well defined broadcast for booleans, so that boolean indices are logically always "outer" type indices.

Proposed rules

From the three problems noted above some expectations for NumPy can be deduced:

  1. There should be a prominent outer/orthogonal indexing method such as arr.oindex[indices].
  2. Considering how confusing fancy indexing can be, it should only occur explicitly (e.g. arr.vindex[indices])
  3. A new arr.vindex[indices] method, would not be tied to the confusing transpose rules of fancy indexing (which is for example needed for the simple case of a single advanced index). Thus, it no transposing should be done. The axes of the advanced indices are always inserted at the front, even for a single index.
  4. Boolean indexing is conceptionally outer indexing. A broadcasting together with other advanced indices in the manner of legacy "fancy indexing" is generally not helpful or well defined. A user who wishes the "nonzero" plus broadcast behaviour can thus be expected to do this manually. Using this rule, a single boolean index can index into multiple dimensions at once.

All other rules for indexing are identical.

Open Questions

  1. Especially for the new indexing attributes oindex and vindex, a case could be made to not implicitly add an Ellipsis index if necessary. This helps finding bugs since a too high dimensional array can be caught.
  2. The names oindex and vindex are just suggestions at the time of writing this, another name NumPy has used for something like oindex is np.ix_.
  3. It would be possible to limit the use of boolean indices in vindex, assuming that they are rare and to some degree special.
  4. oindex and vindex could always return copies, even when no array operation occurs. One argument for using the same rules is that this way oindex can be used as a general index replacement.

Necessary changes to NumPy

Implement arr.oindex and arr.vindex objects to allow these indexing operations and create warnings (and eventually deprecate) ambiguous direct indexing operations on arrays.

Examples

Since the various kinds of indexing is hard to grasp in many cases, these examples hopefully give some more insights. Note that they are all in terms of shape. All original dimensions start with 5, advanced indexing inserts less long dimensions. (Note that ... or Ellipsis mostly inserts as many slices as needed to index the full array). These examples may be hard to grasp without working knowledge of advanced indexing as of NumPy 1.9.

Example array:

>>> arr = np.ones((5, 6, 7, 8))

Legacy fancy indexing

Single index is transposed (this is the same for all indexing types):

>>> arr[[0], ...].shape
(1, 6, 7, 8)
>>> arr[:, [0], ...].shape
(5, 1, 7, 8)

Multiple indices are transposed if consecutive:

>>> arr[:, [0], [0], :].shape  # future error
(5, 1, 7)
>>> arr[:, [0], :, [0]].shape  # future error
(1, 5, 6)

It is important to note that a scalar is integer array index in this sense (and gets broadcasted with the other advanced index):

>>> arr[:, [0], 0, :].shape  # future error (scalar is "fancy")
(5, 1, 7)
>>> arr[:, [0], :, 0].shape  # future error (scalar is "fancy")
(1, 5, 6)

Single boolean index can act on multiple dimensions (especially the whole array). It has to match (as of 1.10. a deprecation warning) the dimensions. The boolean index is otherwise identical to (multiple consecutive) integer array indices:

>>> # Create boolean index with one True value for the last two dimensions:
>>> bindx = np.zeros((7, 8), dtype=np.bool_)
>>> bindx[[0, 0]] = True
>>> arr[:, 0, bindx].shape
(5, 1)
>>> arr[0, :, bindx].shape
(1, 6)

The combination with anything that is not a scalar is confusing, e.g.:

>>> arr[[0], :, bindx].shape  # bindx result broadcasts with [0]
(1, 6)
>>> arr[:, [0, 1], bindx]  # IndexError

Outer indexing

Multiple indices are "orthogonal" and their result axes are inserted at the same place (they are not broadcasted):

>>> arr.oindex[:, [0], [0, 1], :].shape
(5, 1, 2, 8)
>>> arr.oindex[:, [0], :, [0, 1]].shape
(5, 1, 7, 2)
>>> arr.oindex[:, [0], 0, :].shape
(5, 1, 8)
>>> arr.oindex[:, [0], :, 0].shape
(5, 1, 7)

Boolean indices results are always inserted where the index is:

>>> # Create boolean index with one True value for the last two dimensions:
>>> bindx = np.zeros((7, 8), dtype=np.bool_)
>>> bindx[[0, 0]] = True
>>> arr.oindex[:, 0, bindx].shape
(5, 1)
>>> arr.oindex[0, :, bindx].shape
(6, 1)

Nothing changed in the presence of other advanced indices since:

>>> arr.oindex[[0], :, bindx].shape
(1, 6, 1)
>>> arr.oindex[:, [0, 1], bindx]
(5, 2, 1)

Vectorized/inner indexing

Multiple indices are broadcasted and iterated as one like fancy indexing, but the new axes area always inserted at the front:

>>> arr.vindex[:, [0], [0, 1], :].shape
(2, 5, 8)
>>> arr.vindex[:, [0], :, [0, 1]].shape
(2, 5, 7)
>>> arr.vindex[:, [0], 0, :].shape
(1, 5, 8)
>>> arr.vindex[:, [0], :, 0].shape
(1, 5, 7)

Boolean indices results are always inserted where the index is, exactly as in oindex given how specific they are to the axes they operate on:

>>> # Create boolean index with one True value for the last two dimensions:
>>> bindx = np.zeros((7, 8), dtype=np.bool_)
>>> bindx[[0, 0]] = True
>>> arr.vindex[:, 0, bindx].shape
(5, 1)
>>> arr.vindex[0, :, bindx].shape
(6, 1)

But other advanced indices are again transposed to the front:

>>> arr.vindex[[0], :, bindx].shape
(1, 6, 1)
>>> arr.vindex[:, [0, 1], bindx]
(2, 5, 1)

Related Questions

There exist a further indexing or indexing like method. That is the inverse of a command such as np.argmin(arr, axis=axis), to pick the specific elements along an axis given an array of (at least typically) the same size.

Doing such a thing with the indexing notation is not quite straight forward since the axis on which to pick elements has to be supplied. One plausible solution would be to create a function (calling it pick here for simplicity):

np.pick(arr, index_arr, axis=axis)

where index_arr has to be the same shape as arr except along axis. One could imagine that this can be useful together with other indexing types, but such a function may be sufficient and extra information needed seems easier to pass using a function convention. Another option would be to allow an argument such as compress_axes=None (just to have some name) which maps the axes from the index array to the new array with None signaling a new axis. Also keepdims could be added as a simple default. (Note that the use of axis is not compatible to np.take for an index_arr which is not zero or one dimensional.)

Another solution is to provide functions or features to the arg*``functions to map this to the equivalent ``vindex indexing operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment