njsmith/masked-arrays-and-missing-values-aNEP.rst

## masked-arrays-and-missing-values-aNEP.rst

      
    Raw
  

              masked-arrays-and-missing-values-aNEP.rst
            
          
    A alternative-NEP on masking and missing values

The principle of this aNEP is to separate the APIs for masking and for missing values, according to

The current implementation of masked arrays
Nathaniel Smith's proposal.

This discussion is only of the API, and not of the implementation.
Authors:

Matthew Brett

Initialization

First, missing values can be set and be displayed as np.NA, NA:
>>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]')
array([1., 2., NA, 7.], dtype='NA[<f8]')
As the initialization is not ambiguous, this can be written without the NA dtype:
>>> np.array([1.0, 2.0, np.NA, 7.0])
array([1., 2., NA, 7.], dtype='NA[<f8]')
Masked values can be set and be displayed as np.MASKED, MASKED:
>>> np.array([1.0, 2.0, np.MASKED, 7.0], masked=True)
array([1., 2., MASKED, 7.], masked=True)
As the initialization is not ambiguous, this can be written without masked=True:
>>> np.array([1.0, 2.0, np.MASKED, 7.0])
array([1., 2., MASKED, 7.], masked=True)
Ufuncs

By default, NA values propagate:
>>> na_arr = np.array([1.0, 2.0, np.NA, 7.0])
>>> np.sum(na_arr)
NA('float64')
unless the skipna flag is set:
>>> np.sum(na_arr, skipna=True)
10.0
By default, masking does not propagate:
>>> masked_arr = np.array([1.0, 2.0, np.MASKED, 7.0])
>>> np.sum(masked_arr)
10.0
unless the propmsk flag is set:
>>> np.sum(masked_arr, propmsk=True)
MASKED
An array can be masked, and contain NA values:
>>> both_arr = np.array([1.0, 2.0, np.MASKED, np.NA, 7.0])
In the default case, the behavior is obvious:
>>> np.sum(both_arr)
NA('float64')
It's also obvious what to do with skipna=True:
>>> np.sum(both_arr, skipna=True)
10.0
>>> np.sum(both_arr, skipna=True, propmsk=True)
MASKED
To break the tie between NA and MSK, NAs propagate harder:
>>> np.sum(both_arr, propmsk=True)
NA('float64')
Assignment

is obvious in the NA case:
>>> arr = np.array([1.0, 2.0, 7.0])
>>> arr[2] = np.NA
TypeError('dtype does not support NA')
>>> na_arr = np.array([1.0, 2.0, 7.0], dtype='NA[f8]')
>>> na_arr[2] = np.NA
>>> na_arr
array([1., 2., NA], dtype='NA[<f8]')
Direct assignnent in the masked case is magic and confusing, and so happens only via the mask:
>>> masked_array = np.array([1.0, 2.0, 7.0], masked=True)
>>> masked_arr[2] = np.NA
TypeError('dtype does not support NA')
>>> masked_arr[2] = np.MASKED
TypeError('float() argument must be a string or a number')
>>> masked_arr.visible[2] = False
>>> masked_arr
array([1., 2., MASKED], masked=True)