public

A alternative-NEP on masking and missing values

  • Download Gist
masked-arrays-and-missing-values-aNEP.rst
reStructuredText

A alternative-NEP on masking and missing values

The principle of this aNEP is to separate the APIs for masking and for missing values, according to

  • The current implementation of masked arrays
  • Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.

Authors:

  • Matthew Brett

Initialization

First, missing values can be set and be displayed as np.NA, NA:

>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
array([1., 2., NA, 7.], dtype='NA[<f8]')

As the initialization is not ambiguous, this can be written without the NA dtype:

>>> np.array([1.0, 2.0, np.NA, 7.0])
array([1., 2., NA, 7.], dtype='NA[<f8]')

Masked values can be set and be displayed as np.MASKED, MASKED:

>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
array([1., 2., MASKED, 7.], masked=True)

As the initialization is not ambiguous, this can be written without masked=True:

>>> np.array([1.0, 2.0, np.MASKED, 7.0])
array([1., 2., MASKED, 7.], masked=True)

Ufuncs

By default, NA values propagate:

>>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
>>> np.sum(na_arr)
NA('float64')

unless the skipna flag is set:

>>> np.sum(na_arr, skipna=True)
10.0

By default, masking does not propagate:

>>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
>>> np.sum(masked_arr)
10.0

unless the propmsk flag is set:

>>> np.sum(masked_arr, propmsk=True)
MASKED

An array can be masked, and contain NA values:

>>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])

In the default case, the behavior is obvious:

>>> np.sum(both_arr)
NA('float64')

It's also obvious what to do with skipna=True:

>>> np.sum(both_arr, skipna=True)
10.0
>>> np.sum(both_arr, skipna=True, propmsk=True)
MASKED

To break the tie between NA and MSK, NAs propagate harder:

>>> np.sum(both_arr, propmsk=True)
NA('float64')

Assignment

is obvious in the NA case:

>>> arr = np.array([1.0, 2.0, 7.0])
>>> arr[2] = np.NA
TypeError('dtype does not support NA')
>>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
>>> na_arr[2] = np.NA
>>> na_arr
array([1., 2., NA], dtype='NA[<f8]')

Direct assignnent in the masked case is magic and confusing, and so happens only via the mask:

>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>>> masked_arr[2] = np.NA
TypeError('dtype does not support NA')
>>> masked_arr[2] = np.MASKED
TypeError('float() argument must be a string or a number')
>>> masked_arr.visible[2] = False
>>> masked_arr
array([1., 2., MASKED], masked=True)

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.