njsmith/miniNEP2-NA-dtype.rst

## miniNEP2-NA-dtype.rst

      
    Raw
  

              miniNEP2-NA-dtype.rst
            
          
    ####################################### miniNEP 2: NA support via special dtypes #######################################
To try and make more progress on the whole missing values/masked arrays/... debate, it seems useful to have a more technical discussion of the pieces which we can agree on. This is the second, which attempts to nail down the details of how NAs can be implemented using special dtype's.
Table of contents


Rationale

An ordinary value is something like an integer or a floating point number. A missing value is a placeholder for an ordinary value that is for some reason unavailable. For example, in working with statistical data, we often build tables in which each row represents one item, and each column represents properties of that item. For instance, we might take a group of people and for each one record height, age, education level, and income, and then stick these values into a table. But then we discover that our research assistant screwed up and forgot to record the age of one of our individuals. We could throw out the rest of their data as well, but this would be wasteful; even such an incomplete row is still perfectly usable for some analyses (e.g., we can compute the correlation of height and income). The traditional way to handle this would be to stick some particular meaningless value in for the missing data, e.g., recording this person's age as 0. But this is very error prone; we may later forget about these special values while running other analyses, and discover to our surprise that babies have higher incomes than teenagers. (In this case, the solution would be to just leave out all the items where we have no age recorded, but this isn't a general solution; many analyses require something more clever to handle missing values.) So instead of using an ordinary value like 0, we define a special "missing" value, written "NA" for "not available".
There are several possible ways to represent such a value in memory. For instance, we could reserve a specific value (like 0, or a particular NaN, or the smallest negative integer) and then ensure that this value is treated specially by all arithmetic and other operations on our array. Another option would be to add an additional mask array next to our main array, use this to indicate which values should be treated as NA, and then extend our array operations to check this mask array whenever performing computations. Each implementation approach has various strengths and weaknesses, but here we focus on the former (value-based) approach exclusively and leave the possible addition of the latter to future discussion. The core advantages of this approach are (1) it adds no additional memory overhead, (2) it is straightforward to store and retrieve such arrays to disk using existing file storage formats, (3) it allows binary compatibility with R arrays including NA values, (4) it is compatible with the common practice of using NaN to indicate missingness when working with floating point numbers, (5) the dtype is already a place where `weird things can happen' -- there are a wide variety of dtypes that don't act like ordinary numbers (including structs, Python objects, fixed-length strings, ...), so code that accepts arbitrary numpy arrays already has to be prepared to handle these (even if only by checking for them and raising an error). Therefore adding yet more new dtypes has less impact on extension authors than if we change the ndarray object itself.
The basic semantics of NA values are as follows. Like any other value, they must be supported by your array's dtype -- you can't store a floating point number in an array with dtype=int32, and you can't store an NA in it either. You need an array with dtype=NAint32 or something (exact syntax to be determined). Otherwise, NA values act exactly like any other values. In particular, you can apply arithmetic functions and so forth to them. By default, any function which takes an NA as an argument always returns an NA as well, regardless of the values of the other arguments. This ensures that if we try to compute the correlation of income with age, we will get "NA", meaning "given that some of the entries could be anything, the answer could be anything as well". This reminds us to spend a moment thinking about how we should rephrase our question to be more meaningful. And as a convenience for those times when you do decide that you just want the correlation between the known ages and income, then you can enable this behavior by adding a single argument to your function call.
For floating point computations, NAs and NaNs have (almost?) identical behavior. But they represent different things -- NaN an invalid computation like 0/0, NA a value that is not available -- and distinguishing between these things is useful because in some situations they should be treated differently. (For example, an imputation procedure should replace NAs with imputed values, but probably should leave NaNs alone.) And anyway, we can't use NaNs for integers, or strings, or booleans, so we need NA anyway, and once we have NA support for all these types, we might as well support it for floating point too for consistency.
General strategy

Numpy already has a general mechanism for defining new dtypes and slotting them in so that they're supported by ndarrays, by the casting machinery, by ufuncs, and so on. In principle, we could implement NA-dtypes just using these existing interfaces. But we don't want to do that, because defining all those new ufunc loops etc. from scratch would be a huge hassle, especially since the basic functionality needed is the same in all cases. So we need some generic functionality for NAs -- but it would be better not to bake this in as a single set of special "NA types", since users may well want to define new custom dtypes that have their own NA values, and have them integrate well the rest of the NA machinery. Our strategy, therefore, is to avoid the mid-layer mistake by exposing some code for generic NA handling in different situations, which dtypes can selectively use or not as they choose.

Some example use cases:

We want to define a dtype that acts exactly like an int32, except that the most negative value is treated as NA.
We want to define a parametrized dtype to represent categorical data, and the bit-pattern to be used for NA depends on the number of categories defined, so our code needs to play an active role handling it rather than simply deferring to the standard machinery.
We want to define a dtype that acts like an length-10 string and supports NAs. Since our string may hold arbitrary binary values, we want to actually allocate 11 bytes for it, with the first byte a flag indicating whether this string is NA and the rest containing the string content.
We want to define a dtype that allows multiple different types of NA data, which print differently and can be distinguished by the new ufunc that we define called is_na_of_type(...), but otherwise takes advantage of the generic NA machinery for most operations.


dtype C-level API extensions

The PyArray_Descr struct gains the following new fields:
void * NA_value;
PyArray_Descr * NA_extends;
int NA_extends_offset;
The following new flag values are defined:
NPY_NA_AUTO_ARRFUNCS
NPY_NA_AUTO_CAST
NPY_NA_AUTO_UFUNC
NPY_NA_AUTO_UFUNC_CHECKED
NPY_NA_AUTO_ALL /* the above flags OR'ed together */
The PyArray_ArrFuncs struct gains the following new fields:
void (*isna)(void * src, void * dst, npy_intp n, void * arr);
void (*clearna)(void * data, npy_intp n, void * arr);
We add at least one new convenience macro:
#define NPY_NA_SUPPORTED(dtype) ((dtype)->f->isna != NULL)
The general idea is that anywhere where we used to call a dtype-specific function pointer, the code will be modified to instead:


Check for whether the relevant NPY_NA_AUTO... bit is enabled, the NA_extends field is non-NULL, and the function pointer we wanted to call is NULL.
If these conditions are met, then use isna to identify which entries in the array are NA, and handle them appropriately. Then look up whatever function we were going to call using this dtype on the NA_extends dtype instead, and use that to handle the non-NA elements.


For more specifics, see following sections.
Note that if NA_extends points to a parametrized dtype, then the dtype object it points to must be fully specified. For example, if it is a string dtype, it must have a non-zero elsize field.
In order to handle the case where the NA information is stored in a field next to the real' data, the NA_extends_offset field is set to a non-zero value; it must point to the location within each element of this dtype where some data of the NA_extends` dtype is found. For example, if we have are storing 10-byte strings with an NA indicator byte at the beginning, then we have:
elsize == 11
NA_extends_offset == 1
NA_extends->elsize == 10
When delegating to the NA_extends dtype, we offset our data pointer by NA_extends_offset (while keeping our strides the same) so that it sees an array of data of the expected type (plus some superfluous padding). This is basically the same mechanism that record dtypes use, IIUC, so it should be pretty well-tested.
When delegating to a function that cannot handle misbehaved' source data (see the PyArray_ArrFuncs documentation for details), then we need to check for alignment issues before delegating (especially with a non-zero NA_extends_offset`). If there's a problem, when we need to `clean up' the source data first, using the usual mechanisms for handling misaligned data. (Of course, we should usually set up our dtypes so that there aren't any alignment issues, but someone screws that up, or decides that reduced memory usage is more important to them then fast inner loops, then we should still handle that gracefully, as we do now.)
The NA_value and clearna fields are used for various sorts of casting. NA_value is a bit-pattern to be used when, for example, assigning from np.NA. clearna can be a no-op if elsize and NA_extends->elsize are the same, but if they aren't then it should clear whatever auxiliary NA storage this dtype uses, so that none of the specified array elements are NA.
Core dtype functions

The following functions are defined in PyArray_ArrFuncs. The special behavior described here is enabled by the NPY_NA_AUTO_ARRFUNCS bit in the dtype flags, and only enabled if the given function field is not filled in.
getitem: Calls isna. If isna returns true, returns np.NA. Otherwise, delegates to the NA_extends dtype.
setitem: If the input object is np.NA, then runs memcpy(self->NA_value, data, arr->dtype->elsize);''. Otherwise, callsclearna, and then delegates to theNA_extendsdtype.copyswapn,copyswap: FIXME: Not sure whether there's any special handling to use for these?compare: FIXME: how should this handle NAs? R's sort function *discards* NAs, which doesn't seem like a good option.argmax: FIXME: what is this used for? If it's the underlying implementation for np.max, then it really needs some way to get a skipna argument. If not, then the appropriate semantics depends on what it's supposed to accomplish...dotfunc: QUESTION: is it actually guaranteed that everything has the same dtype? FIXME: same issues as forargmax.scanfunc: This one's ugly. We may have to explicitly override it in all of our special dtypes, because assuming that we want the option of, say, having the token "NA" represent an NA value in a text file, we need some way to check whether that's there before delegating. Butungetcis only guaranteed to let us put back 1 character, and we need 2 (or maybe 3 if we actually check for "NA "). The other option would be to read to the next delimiter, check whether we have an NA, and if not then delegate tofromstrinstead ofscanfunc, but according to the current API, each dtype might in principle use a totally different rule for defining `the next delimiter'. So... any ideas? (FIXME)fromstr: Easy -- check for "NA ", if present then assignNA_value, otherwise callclearnaand delegate.nonzero: FIXME: again, what is this used for? (It seems redundant with using the casting machinery to cast to bool.) Probably it needs to be modified so that it can return NA, though...fill: Useisnato check if either of the first two values is NA. If so, then fill the rest of the array withNA_value. Otherwise, callclearnaand then delegate.fillwithvalue: Guess this can just delegate?sort,argsort: These should probably arrange to sort NAs to a particular place in the array (either the front or the back -- any opinions?)scalarkind: FIXME: I have no idea what this does.castdict,cancastscalarkindto,cancastto: See section on casting below.  ------- Casting -------  FIXME: this really needs attention from an expert on numpy's casting rules. But I can't seem to find the docs that explain how casting loops are looked up and decided between (e.g., if you're casting from dtype A to dtype B, which dtype's loops are used?), so I can't go into details. But those details are tricky and they matter...  But the general idea is, if you have a dtype withNPY_NA_AUTO_CASTset, then the following conversions are automatically allowed:   * Casting from the underlying type to the NA-type: this is performed by the usualclearna+ potentially-strided copy dance. Also,isnais called to check that none of the regular values have been accidentally converted into NA; if so, then an error is raised.   * Casting from the NA-type to the underlying type: allowed in principle, but ifisnareturns true for any of the values that are to be converted, then again, an error is raised. (If you want to get around this, usenp.view(array_with_NAs, dtype=float).)   * Casting between the NA-type and other types that do not support NA: this is allowed if the underlying type is allowed to cast to the other type, and is performed by combining a cast to or from the underlying type (using the above rules) with a cast to or from the other type (using the underlying type's rules).   * Casting between the NA-type and other types that do support NA: if the other type has NPY_NA_AUTO_CAST set, then we use the above rules plus the usual dance withisnaon one array being converted toNA_valueelements in the other. If only one of the arrays has NPY_NA_AUTO_CAST set, then it's assumed that that dtype knows what it's doing, and we don't do any magic. (But this is one of the things that I'm not sure makes sense, as per my caveat above.)  ------ Ufuncs ------  All ufuncs gain an additional optional keyword argument,skipNA=, which defaults to False.  IfskipNA == True, then the ufunc machinery *unconditionally* callsisnafor any dtype where NPY_NA_SUPPORTED(dtype) is true, and then acts as if any values for which isna returns True were masked out in thewhere=argument (see miniNEP 1 for the behavior ofwhere=). If awhere=argument is also given, then it acts as if theisnavalues had be ANDed out of thewhere=mask, though it does not actually modify the mask. Unlike the other changes below, this is performed *unconditionally* for any dtype which has anisnafunction defined; the NPY_NA_AUTO_UFUNC flag is *not* checked.  If NPY_NA_AUTO_UFUNC is set, then ufunc loop lookup is modified so that whenever it checks for the existence of a loop on the current dtype, and does not find one, then it also checks for a loop on theNA_extendsdtype. If that loop is found, then it uses it in the normal way, with the exceptions that (1) it is only called for values which are not NA according toisna, (2) if the output array has NPY_NA_AUTO_UFUNC set, thenclearnais called on it before calling the ufunc loop, (3) pointer offsets are adjusted byNA_extends_offsetbefore calling the ufunc loop. In addition, if NPY_NA_AUTO_UFUNC_CHECK is set, then after evaluating the ufunc loop we callisnaon the *output* array, and if there are any NAs in the output which were not in the input, then we raise an error. (The intention of this is to catch cases where, say, we represent NA using the most-negative integer, and then someone's arithmetic overflows to create such a value by accident.)  FIXME: We should go into more detail here about how NPY_NA_AUTO_UFUNC works when there are multiple input arrays, of which potentially some have the flag set and some do not.  -------- Printing --------  FIXME: There should be some sort of mechanism by which values which are NA are automatically repr'ed as NA, but I don't really understand how numpy printing works, so I'll let someone else fill in this section.  -------- Indexing --------  Scalar indexing likea[12]goes via thegetitemfunction, so according to the proposal as described above, if a dtype delegatesgetitem, then scalar indexing on NAs will return the objectnp.NA. (If it doesn't delegategetitem, of course, then it can return whatever it wants.)  This seems like the simplest approach, but an alternative would be to add a special case to scalar indexing, where if anNPY_NA_AUTO_INDEXflag were set, then it would callisnaon the specified element. If this returned false, it would callgetitemas usual; otherwise, it would return a 0-d array containing the specified element. The problem with this is that it breaks expressions likeif a[i] is np.NA: .... (Of course, there is nothing nearly so convenient as that for NaN values now, but then, NaN values don't have their own global singleton.) So for now we stick to scalar indexing just returningnp.NA, but this can be revisited if anyone objects.  ********************************* Python API for generic NA support *********************************  NumPy will gain a global singleton called numpy.NA, similar to None, but with semantics reflecting its status as a missing value. In particular, trying to treat it as a boolean will raise an exception, and comparisons with it will produce numpy.NA instead of True or False. These basics are adopted from the behavior of the NA value in the R project. To dig deeper into the ideas, http://en.wikipedia.org/wiki/Ternary_logic#Kleene_logic provides a starting point.  Most operations onnp.NA(e.g.,__add__,__mul__) are overridden to unconditionally returnnp.NA.  The automagic dtype detection used for expressions likenp.asarray([1, 2, 3]),np.asarray([1.0, 2.0. 3.0])will be extended to recognize thenp.NAvalue, and use it to automatically switch to a built-in NA-enabled dtype (which one being determined by the other elements in the array). A simplenp.asarray([np.NA])will use an NA-enabled float64 dtype (which is analogous to what you get fromnp.asarray([])). Note that this means that expressions likenp.log(np.NA)will work: firstnp.NAwill be coerced to a 0-d NA-float array, and thennp.logwill be called on that.  Python-level dtype objects gain the following new fields::    NA_supported   NA_valueNA_supportedis a boolean which simply exposes the value of theNPY_NA_SUPPORTEDflag; it should be true if this dtype allows for NAs, false otherwise. [FIXME: would it be better to just key this off the existence of theisnafunction? Even if a dtype decides to implement all other NA handling itself, it still has to defineisnain order to makeskipNA=work correctly.]NA_valueis a 0-d array of the given dtype, and its sole element contains the same bit-pattern as the dtype's underlyingNA_valuefield. This makes it possible to determine the default bit-pattern for NA values for this type (e.g., withnp.view(mydtype.NA_value, dtype=int8)).  We *do not* expose theNA_extendsandNA_extends_offsetvalues at the Python level, at least for now; they're considered an implementation detail (and it's easier to expose them later if they're needed then unexpose them if they aren't).  Two new ufuncs are defined:np.isNAreturns a logical array, with true values where-ever the dtype'sisnafunction returned true.np.isnumberis only defined for numeric dtypes, and returns True for all elements which are not NA, and for whichnp.isfinitewould return True.  ***************** Builtin NA dtypes *****************  The above describes the generic machinery for NA support in dtypes. It's flexible enough to handle all sorts of situations, but we also want to define a few generally useful NA-supporting dtypes that are available by default.  For each built-in dtype, we define an associated NA-supporting dtype, as follows::  floats: the associated dtype uses a specific NaN bit-pattern to indicate NA (chosen for R compatibility)  complex: we do whatever R does (FIXME: look this up -- two NA floats, probably?)  signed integers: the most-negative signed value is used as NA (chosen for R compatibility)  unsigned integers: the most-positive value is used as NA (no R compatibility possible).  strings: the first byte (or, in the case of unicode strings, first 4 bytes) is used as a flag to indicate NA, and the rest of the data gives the actual string. (no R compatibility possible)  objects: Two options (FIXME): either we don't include an NA-ful version, or we use np.NA as the NA bit pattern.  boolean: we do whatever R does (FIXME: look this up -- 0 == FALSE, 1 == TRUE, 2 == NA?)  Each of these dtypes is trivially defined using the above machinery, and are what are automatically used by the automagic type inference machinery (fornp.asarray([True, np.NA, False]), etc.).  They can also be accessed via a new functionnp.withNA, which takes a regular dtype (or an object that can be coerced to a dtype, like 'float') and returns one of the above dtypes. IdeallywithNAshould also take some optional arguments that let you describe which values you want to count as NA, etc., but I'll leave that for a future draft (FIXME).  FIXME: Ifdis one of the above dtypes, then shouldd.typereturn?  The NEP also contains a proposal for a somewhat elaborate domain-specific-language for describing NA dtypes. I'm not sure how great an idea that is. (I have a bias against using strings as data structures, and find the already existing strings confusing enough as it is -- also, apparently the NEP version of numpy uses strings like 'f8' when printing dtypes, while my numpy uses object names like 'float64', so I'm not sure what's going on there.withNA(float64, arg1=value1)seems like a more pleasant way to print a dtype than "NA[f8,value1]", at least to me.) But if people want it, then cool.  -------------- Type hierarchy  --------------  FIXME: how should we do subtype checks, etc., for NA dtypes? What doesissubdtype(withNA(float), float)return? How aboutissubdtype(withNA(float), np.floating)``?
Serialization

FIXME: How are dtypes stored in .npz or pickle files? Do we need to do anything special here?