ahaldane/structures.mkd

## structures.mkd

      
    Raw
  

              structures.mkd
            
          
    Future Improvements for Structured Arrays?

To add some context to PR #6053, here are a other potential improvements to structured arrays we could make. I think with improvements like these structured arrays could become much more reliable.
structured assignment speedup

Structure assignment is slow because it goes through the 'wrong' path in mapiter_set. It uses copyswapn when dtype_transfer would be much faster, since copyswapn iterates through the field dict for every element. See #1984. This should be a somewhat straightforward fix.
structure comparison & ufuncs

If this PR gets merged in its current form, it might make sense to update the comparison operators for structures. Eg,
>>> a = np.zeros(2, dtype=[('a', 'i4'), ('b', 'i4')])
>>> b = np.zeros(2, dtype=[('x', 'i4'), ('y', 'i4')])
>>> a == b

Currently this returns False, but it might make more sense to return an array of True after this PR, so that code like "a[:] = b; a == b" gives true. #5011 also notes some weirdness in > and < operators that might be fixed, and the void comparison code seems to have accumulated some entropy (eg, this error is hard to reach). #2676 is related and might be fixed at the same time.
Relatedly, a somewhat more far-out idea is to allow ufuncs on structures, which would work field-by-field. Eg, a+b would add the fields together. It only works if the dtypes are identical. I guess also casting would not be allowed, eg sqrt(a) with a defined as above would not work, though it would be possible if the fields were of type 'f4'. But maybe there are problems with this idea.
issues creating/updating structured datatypes

There are a number of related problems in structure datatype creation. First of all, in some cases it is possible to create size-0 fields, leading to segfaults. See #2196 and the 4 related issues listed there. Probably we want to fix the segfaults just in case, but also disallow size-0 fields (since I think allowing them opens yet other cans of worms - eg, what is the value of a 0-sized field?). Also, as seen in #5224 #4084 #663 np.dtype does not respect the align keyword in some cases, and neither does np.require.
numpy can't load/save many structured arrays

Because of the way np.load and np.save treat padding bytes, it is currently not possible to load/save structured arrays with such padding bytes (a common case). See #2215 #3176. There are lots of comments in npy_io.py of the form "# XXX we don't treat padding bytes correctly, will fix in the future". I think the fix is that np.load needs custom code to convert the descr to a dtype, which should not try to create fields if the dtype description has an empty string for field name. I don't think a new format version is even needed.
By the way, this problem might be a symptom of trying to use dtype.descr in an unintended way: dtype.descr is described as the "Array-interface compliant full description of the data-type", yet npy_io uses it for a purpose unrelated to the "Array Interface", to construct the .npy format.