Skip to content

Instantly share code, notes, and snippets.

@ahaldane
Last active August 29, 2015 14:26
Show Gist options
  • Save ahaldane/2bfd2293affcb9cff68b to your computer and use it in GitHub Desktop.
Save ahaldane/2bfd2293affcb9cff68b to your computer and use it in GitHub Desktop.
Future Improvements for Structured Arrays?

Future Improvements for Structured Arrays?

To add some context to PR #6053, here are a other potential improvements to structured arrays we could make. I think with improvements like these structured arrays could become much more reliable.

structured assignment speedup

Structure assignment is slow because it goes through the 'wrong' path in mapiter_set. It uses copyswapn when dtype_transfer would be much faster, since copyswapn iterates through the field dict for every element. See #1984. This should be a somewhat straightforward fix.

structure comparison & ufuncs

If this PR gets merged in its current form, it might make sense to update the comparison operators for structures. Eg,

>>> a = np.zeros(2, dtype=[('a', 'i4'), ('b', 'i4')])
>>> b = np.zeros(2, dtype=[('x', 'i4'), ('y', 'i4')])
>>> a == b

Currently this returns False, but it might make more sense to return an array of True after this PR, so that code like "a[:] = b; a == b" gives true. #5011 also notes some weirdness in > and < operators that might be fixed, and the void comparison code seems to have accumulated some entropy (eg, this error is hard to reach). #2676 is related and might be fixed at the same time.

Relatedly, a somewhat more far-out idea is to allow ufuncs on structures, which would work field-by-field. Eg, a+b would add the fields together. It only works if the dtypes are identical. I guess also casting would not be allowed, eg sqrt(a) with a defined as above would not work, though it would be possible if the fields were of type 'f4'. But maybe there are problems with this idea.

issues creating/updating structured datatypes

There are a number of related problems in structure datatype creation. First of all, in some cases it is possible to create size-0 fields, leading to segfaults. See #2196 and the 4 related issues listed there. Probably we want to fix the segfaults just in case, but also disallow size-0 fields (since I think allowing them opens yet other cans of worms - eg, what is the value of a 0-sized field?). Also, as seen in #5224 #4084 #663 np.dtype does not respect the align keyword in some cases, and neither does np.require.

numpy can't load/save many structured arrays

Because of the way np.load and np.save treat padding bytes, it is currently not possible to load/save structured arrays with such padding bytes (a common case). See #2215 #3176. There are lots of comments in npy_io.py of the form "# XXX we don't treat padding bytes correctly, will fix in the future". I think the fix is that np.load needs custom code to convert the descr to a dtype, which should not try to create fields if the dtype description has an empty string for field name. I don't think a new format version is even needed.

By the way, this problem might be a symptom of trying to use dtype.descr in an unintended way: dtype.descr is described as the "Array-interface compliant full description of the data-type", yet npy_io uses it for a purpose unrelated to the "Array Interface", to construct the .npy format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment