Skip to content

Instantly share code, notes, and snippets.

@ahaldane
Last active December 2, 2017 17:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ahaldane/c3f9bcf1f62d898be7c7 to your computer and use it in GitHub Desktop.
Save ahaldane/c3f9bcf1f62d898be7c7 to your computer and use it in GitHub Desktop.
object array docs (future ideas)

Object Arrays

Object arrays are ndarrays with a datatype of np.object whose elements are Python objects, enabling use of numpy's vectorized operations and broadcasting rules with arbitrary Python types. Object arrays have certain special rules to resolve ambiguities that arise between python types and numpy types, described here.

Envisioned uses of object arrays include:

  • Creating ndarrays whose elements are other ndarrays of varying length
  • Creating ndarrays containing number-like Python objects, for example mpmath's multiprecision types, or Python's built-in arbitrary precision integers or Decimal type.

Object arrays are often useful for storing python string types, because it allows arbitrary string lenths (while a numpy array's string length is fixed), and because if the strings in the array are repeated python only stores a string once and only references it upon repeated use (string interning), saving memory).

(Add a note here about how in many cases a "proper" solution is to create a new dtype, but object arrays can be a quick workaround)

Creating object arrays

Object arrays can be created using np.array and explicitly supplying np.object as the dtype argument.

>>> a = array([1,2,3], dtype=np.object)

Note that unlike normal coercion rules, numpy will not attempt to create an object array unless the dtype of np.object is explicitly supplied[, or unless the supplied data contains a python integer larger than is representable with the largest numpy integer type?]. This is to prevent the common error of mistakenly supplying subsequences of different lengths, which the normal coercion rules would convert to an object array.

In deviation from normal unpacking rules, for datatypes of np.object, np.array will only descend into subsequences of the input sequence if the subsequences are of python List type, and not any other sequence type such as np.ndarray, and the lists are of equal length. This resolves ambiguity in the amount of nesting desired: An object [[1,2],[3,4]] will thus be interpreted as an object array with shape (2,2) containing Python integers, and not as an object array of shape (2,) containing python lists. Creating an object array containing equal sized Python lists is more complicated, but may be accomplished in two steps:

>>> a = empty(2, dtype=np.object)
>>> a[:] = [[1,2,3],[4,5,6]]

Numpy defines an additional object type, np.pytype, which is used to cast to built-in Python types. Viewing an ndarray as np.object dtype will create an object array but will not cast values to Python types, while doing so with np.pytype will also cast to python native types, by calling .item on each element.

>>> def printresult(v):
...     print(v.dtype, type(v[0]))
>>> printresult( np.arange(10).astype(np.object) )
np.object, numpy.int64
>>> printresult( np.arange(10).astype(np.pytype) )
np.object, int

This is useful to take advantage of properties of Python's native types, such as its multiprecision integers.

Numpy handles integers larger than its largest integer type by using object arrays. Operations involving large Python integers will often automatically coerce to object type:

>>> np.array([2**128])
array([340282366920938463463374607431768211456L], dtype=object)
>>> np.array([0], dtype=np.int64) + 2**128
array([340282366920938463463374607431768211456L], dtype=object)

Nested sequences

Nesting ndarrays (and other sequence objects such as lists) inside of object arrays can be tricky because numpy will attempt to broadcast assignment operations involving two ndarrays. As a special case for object arrays, values may be assigned to each index individually to avoid broadcasting:

>>> a = array([8,9], dtype=np.object)
>>> a[:] = array([1,2]) # broadcasts
>>> a[0] = array([1,2]) # does not broadcast

This also applies to fields of object type of structured scalars:

>>> a = np.empty(2, dtype='O,i8')
>>> a['f0'][0] = arange(3) 
>>> a[0]['f0'] = arange(3)

This is also how one can create object arrays containing equal size lists:

>>> a = empty(2, dtype=np.object)
>>> a[0] = [1,2,3]
>>> a[1] = [4,5,6]

Viewing Object Arrays

Viewing object arrays as a different type is not allowed, as it could result in modification of the underlying object pointer. Similarly, viewing a non-object array as an object array is not allowed.

>>> a = np.array([1,2,3], dtype=np.object)
>>> a.view(np.int64)
TypeError: Cannot change data-type for object array.
>>> np.array([1,2,3], dtype='i4').view(np.object)
TypeError

The array may still be cast to another type using astype as usual.

ufuncs and object arrays

Ufuncs operate specially on objects arrays, since the objects contained in the array may not be numpy types. To evaluate a ufunc numpy tries a series of strategies in order for each element of the array (for unary ufuncs) or for each pair of elements from the arrays (for binary ufuncs):

  1. If there is a Python "Special method" which corresponds to the ufunc numpy uses it to evaluate the ufunc. This step only applies to the ufuncs add, subtract, multiply, divide, true_divide, floor_divide, remainder, negative, [positive], power, mod, absolute, bitwise_and, bitwise_or, bitwise_xor, invert, left_shift, right_shift, greater, greater_equal, less, less_equal, not_equal, equal.

  2. If all elements passed to the ufunc are one of: a numpy scalar, a python bool, int, long, float or complex, numpy handles evaluation of the ufunc. Unary ufuncs will return a numpy scalar if the input element was a numpy scalar and a python type otherwise. Binary ufuncs will return a numpy scalar if either input element was a numpy scalar, and a Python type otherwise. Note that multiprecision Python integers are evaluated specially to give a multiprecision result.

  3. If the first element has a method with the same name as the ufunc then that method will be called, eg elem.sqrt(). Binary ufuncs such as logaddexp(x,y) will call x.logaddexp(y).

  4. For a small number of ufuncs, notably np.minimum and np.maximum, numpy implements a fallback implementation using pure python code shown in the table below. For the remaining ufuncs a TypeError is raised.

This provides a rough way of creating new numeric types compatible with numpy, as you can define a class which implements all ufuncs missing from step 1 as methods, and then create an object array containing elements of your type. However, note that creating a user-defined type (see "User-Defined Types") is often preferrable. User defined types will be much faster and you will have control over casting and coercion.

Internally in step 2, numpy evaluates ufuncs involving Python float or complex by converting to the equivalent numpy type, computing the ufunc, and converting back to the python type. However, it uses a custom ufunc implementation for Python int and long (not documented here, see _objectmath.py in numpy source) to handle python integers larger than the maximum integer numpy types can represent.

The ufunc implementations for ufuncs evaluated in step 4 are in the following table:

Numpy Ufunc Python Implementation
np.logical_and(x, y) bool(x and y)
np.logical_or(x, y) bool(x or y)
np.logical_xor(x, y) bool(x or y) and not bool(x and y)
np.logical_not(x) bool(not x)
np.maximum(x, y), np.fmax(x, y) max(x, y)
np.minimum(x, y), np.fmin(x, y) min(x, y)
np.degrees(x), np.rad2deg(x) x*180/np.pi
np.radians(x), np.deg2rad(x) x*np.pi/180
np.square(x) x*x

Misc Notes/Scratch (remove later)

Relevant issues/PRs:

explicitly specifying dtype=object:

Nesting issues:

Casting issues:

Note that with the plans above

>>> np.array([0,1,2], dtype=np.object)

will give an object array containing Python ints. To get an object array containing numpy types using np.array one must do something like

>>> np.array([np.int64(x) for x in [0,1,2]], dtype=np.object)

but really it's easier to do

>>> np.arange(3).astype(np.object)
>>> type(_[0])
numpy.int64

The general idea is that if you use np.array with dtype=np.object it will take your supplied objects exactly as they are, no conversion.

On the 'pytype' type: What are the alternatives? Maybe this can already be done with vectorize? Defining itemize = vectorize(lambda x: x.item()), it looks like itemize(arr.astype(np.object)) might work, but this casts to int for some reason.

But actually maybe having a special pytype type also makes it clearer that some kind of casting is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment