ahaldane/structured_array_motivation.md

## structured_array_motivation.md

      
    Raw
  

              structured_array_motivation.md
            
          
    Vision For Structured Array Cleanup

Structured arrays are a numpy feature allowing interpretation of structured (composed from multiple datatypes) data organized like "structs" in the C language. While the basic idea and functionality are useful, structured arrays have not received as much attention as other parts of numpy and as a result some of their behavior is self-contradictory, buggy, or undocumented.
Different users have also used structured arrays for different purposes, which may have led to the self-contradictory behavior: The original intended use appears to be for interpreting binary data blobs, but some users want to use structured arrays as a "pandas-lite" for manipulating tabular data. We have tried to discourage the latter behavior recently.
The purpose of this document is to better specify what we want structured arrays to do within numpy, what problems currently exist, and propose how structured arrays should be fixed.
Proposed Changes to Behavior

There are two main ways structured arrays are being changed: Casting/assignment and multifield-views. These are somewhat independent changes, we can do one without the other. Currently casting/assignment changes are released as of numpy 1.14, however the multifield-view changes have been put off for later.
The main documentation for structured arrays before 1.13 is at: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html
The new documentation as of 1.15 is at https://docs.scipy.org/doc/numpy-1.15.0/user/basics.rec.html
Table of contents


Casting/assignment~~ (Released in numpy 1.14, helper functions to-be released)
Multifield-views
Save/load of aligned/padded structures (now fixed)
Structured assignment speedup
structure comparison & ufuncs

Casting/assignment

The problem with casting/assignment in <1.13 is that different kinds of assignment were inconsistent with each other and with tuple assignment. The casting/assignment behavior was almost completely undocumented before numpy 1.14.
Tuple assignment (no change)

To start, let us recall how assignment to a structured array works from a tuple:
>>> arr = np.empty(1, dtype=[('foo', 'i8'), ('bar', 'i8')])
>>> arr[0] = (2, 3)  # tuple assignment
>>> arr
array([(2, 3)],
      dtype=[('foo', '<i8'), ('bar', '<i8')])

This follows a sensible "tuple-unpacking" behavior, which I have also called "assign-by-position". The first field is
set to the first element of the tuple, the second field to the second element, and so on. Assigning from a tuple with the wrong number of elements is disallowed.
Structure-to-structure assignment (needs change)

Now let us examine assignment from one structured array to another in numpy 1.13, or casting from one structured type to another.
We will create an array filled with ones, and then assign to it:
arr = np.ones(1, dtype=[('foo', 'i8'), ('bar', 'i8')])

The following table illustrates that we get contradictory behavior in numpy 1.13 for different values assigned in different ways. The "result" column is the shorthand value for what the user sees.
Table:


description
val
result of arr[0] = val (item assignment)
result of arr[:1] = val (vector assignment)


identical (no change)
array([(2,3)], dtype=[('foo', 'i8'), ('bar', 'i8')])
(2, 3)
(2, 3)


reordered fields
array([(2,3)], dtype=[('bar', 'i8'), ('foo', 'i8')])
(2, 3)
(3, 2)


renamed field
array([(2,3)], dtype=[('foo', 'i8'), ('baz', 'i8')])
(2, 3)
(2, 0)


missing field
array([(2,)],  dtype=[('bar', 'i8')])
(2, 0)
(0, 2)


changed dtype
array([(2,3)], dtype=[('foo', 'f8'), ('bar', 'f8')])
(4607182418800017408, 4611686018427387904)
(2, 3)


Vector-assignment occurs "by-fieldname", meaning that the field on the rhs gets assigned the field with the same name on the lhs. Fields which are missing are assigned a value of 0. This "assign-by-fieldname" behavior was not the original behavior, it was introduced in numpy 1.6 in PR #36 as one of ~150 commits there, without any mention or discussion.
Item-assignment, on the other hand, copies byte-by byte without casting, and pads any trailing space with 0. It is extremely bug-prone and caused all sorts of segfaults and the like. It incidentally follows the "tuple unpacking" behavior in many cases, eg in the first three rows of the table.
Changes made in numpy 1.14

Clearly a fix is needed. One might argue that all casting between structures should be disallowed, since C does not allow casting between structs, but that seems harder to backtrack on. Barring that, at least we should make the two kinds of assignment consistent. I decided to try to make everything go back to the old "tuple-unpacking" behavior in all cases in #6053. I think it is much better because it mirrors the tuple->structure assignment users are familiar with, and it avoids the "overwrite-with-0" behavior with missing or renamed fields. This overwrite-with-0 behavior has been the source of confusion, see #7058, and I find it ugly to have a special value (0) appear - why non nan or 1?
This is a back-incompatible change for users using structured arrays since numpy 1.6.
fallout from the changes and proposed fixes

The first breakage report on the mailing list actually was about the change "item-assignment" (not vector-assignment):
https://mail.python.org/pipermail/numpy-discussion/2018-January/077641.html
The old item-assignment behavior is pretty unambiguously buggy and dangerous in my opinion, I think the benefit of that particular change clearly outweighs the back-compat break. A break in item-assignment behavior was going to be necessary no matter what we did. Other user-reports I would file in the same category are: #10036, #10752, #10789, all were making use the the dangerous behavior of the old item-assignment.
However we had one reported code-breakage related to vector-assignment style casting, which used to use "cast-by-fieldname". Discussed at the end of #6053, we have a user who was using numpy's structured casting code to add/remove fields from a structured array, like adding/removing columns from a table:
data = np.zeros(2,  dtype=[('a', 'i1'), ('zzz', 'i1'), ('b', 'i4')])
required_data = np.require(data[:], dtype=[('a', 'i1'), ('b', 'i4')])
The idea is to use np.require to select fields in a structured array. My justification for us breaking this code is as folows: First, this is an undocumented use of np.require as far as I see, which seems to be unintentional on our part. Second, this kind of code is likely to be buggy because of the "overwrite-with-0" behavior: If you np.require with a non-existent field name, you end up with a field filled with "0". Third, this is "pandas-like" usage which we do not want to encourage. Selecting a subset of fields is usually bettern done with multi-field indexing, though as pointed out in the issue this does not work for selecting a subset of nested fields, for nested structs.
Suggested resolution and remaining bugs

Reading over the above, I think the change still seems justified despite some code breakage, so we should keep it. It is already in numpy 1.14 anyway, it might cause even more confusion to go back.
I suggest that to help user whose code broke we add a helper function which gives the old behavior. This is done with the added np.require_fields added in #11526 (along with other changes, see below), which is still open.
Finally, there are still unfixed (old) bugs related to assignment ug subarrays in structured arrays, see #9313 #11946.
Multifield-views

Separately from the changes above, we also have worked toward making multi-field indexes of structured arrays return a view instead of a copy. Motivation for this change is best explained in this mailing list post:
https://mail.python.org/pipermail/numpy-discussion/2018-January/077624.html
We tried to go ahead with this change in 1.14.0, but it caused enough breakage in people's code that we reverted it in 1.14.1 and have not put it back in yet.
There were two main ways people's code broke:
First, indirectly, due to exposed bugs in np.save and np.load for structured arrays, as covered in more detail in the next section. If those functions are fixed up, which should be done in any case because those bugs are present even without the multifield copy->view change, then they no longer present an obstacle for the multifield-view change. The multifield view change should only be done after those issues are fixed.
Second, due to a use-case involving datatype-viewing of a structured array obtained from a multifield index. We had a long and useful discussion of the code breakage on the mailing list, particularly in statsmodels, see:
https://mail.python.org/pipermail/numpy-discussion/2018-January/077641.html
(also briefly https://mail.python.org/pipermail/numpy-discussion/2018-January/077592.html)
The way forward

One proposal in that discussion was to add new helper functions to account for the use-cases we would break. These are added in the open PR #11526.
It is still possible we should scrap the multifield copy->view change altogether.
It looks like we will not have time to go ahead with this change for 1.16 because the np.save/load fixes are not done. Nevertheless, we should decide whether to go ahead with this change or not soon, and if so, include the helper functions of #11526 in 1.16 so people have a chance to update their code in preparation.
Save/load of aligned/padded structures

Note: This is now fixed in #12358
This is a long-standing bug, which came to the fore as a result of the attempt to add multi-field views described above. The issue is that np.save and np.load are unable to correctly save structured arrays which contain any padding bytes. This has been the case for most "aligned" structured types, for instance np.ones(3, dtype=np.dtype('u1,f4', align=True)) because these have extra padding bytes.
The multifield copy->view change above further exposed this problem because the views produced now often contain padding bytes.
The relevant issue is #8100, but there are scattered discussions in other github issues linking to that one.
See also #10387, which is closed because we reverted the 1.14.0 behavior, but will need to be reopened once we revert the reversion.
An open PR with some initial work to fix np.save is at #10931 but it neeeds a lot of work. See comments there for the way forward.
Structured assignment speedup

Another problem, not discussed on github anywhere I think, is that structure assignment is slow because it goes through the 'wrong' path in mapiter_set. It uses copyswapn when dtype_transfer would be much faster, since copyswapn iterates through the field dict for every element. See #1984. This is not urgent since it is only a performance issue.
structure comparison & ufuncs

There are current FutureWarnings and Deprecations for comparison operators involving structured arrays, for instance
>>> a = np.zeros(2, dtype=[('a', 'i4'), ('b', 'i4')])
>>> b = np.zeros(2, dtype=[('x', 'i4'), ('y', 'i4')])
>>> a == b

Currently this returns a scalar bool value, but the plan is to return an ndarray of bool dtype in the future.
Relevant PRs: #5964 #9153 #10235
description	val	result of arr[0] = val (item assignment)	result of arr[:1] = val (vector assignment)
identical (no change)	`array([(2,3)], dtype=[('foo', 'i8'), ('bar', 'i8')])`	(2, 3)	(2, 3)
reordered fields	`array([(2,3)], dtype=[('bar', 'i8'), ('foo', 'i8')])`	(2, 3)	(3, 2)
renamed field	`array([(2,3)], dtype=[('foo', 'i8'), ('baz', 'i8')])`	(2, 3)	(2, 0)
missing field	`array([(2,)], dtype=[('bar', 'i8')])`	(2, 0)	(0, 2)
changed dtype	`array([(2,3)], dtype=[('foo', 'f8'), ('bar', 'f8')])`	(4607182418800017408, 4611686018427387904)	(2, 3)