Skip to content

Instantly share code, notes, and snippets.

@ahaldane
ahaldane / structured_array_motivation.md
Last active November 20, 2018 18:11
Structured array change notes

Vision For Structured Array Cleanup

Structured arrays are a numpy feature allowing interpretation of structured (composed from multiple datatypes) data organized like "structs" in the C language. While the basic idea and functionality are useful, structured arrays have not received as much attention as other parts of numpy and as a result some of their behavior is self-contradictory, buggy, or undocumented.

Different users have also used structured arrays for different purposes, which may have led to the self-contradictory behavior: The original intended use appears to be for interpreting binary data blobs, but some users want to use structured arrays as a "pandas-lite" for manipulating tabular data. We have tried to discourage the latter behavior recently.

The purpose of this document is to better specify what we want structured arrays to do within numpy, what problems currently exist, and propose how structured arrays should be fixed.

@ahaldane
ahaldane / numpy_coercion.rst
Last active June 15, 2019 00:00
C-style type coercion

Using C-style type coercion rules in Numpy

This document explores using C-like type-coercion for + - * // in numpy.

Motivation: Currently, when two dtypes are involved in a binary operation numpy's principle is that "the output dtype's range covers the range of both input dtypes", and when a single dtype is involved there is never any cast. One often-surprising consequence of this is that "np.uint64 + np.int64" gives an "np.float64". This is different from C-style coercion. The current numpy coercion rules lead to unexpected behaviors like this one, which we often get questions about on github and the mailing list. See the issues collected in numpy/numpy#12525

Why switch to C-style coercion specifically? Because numpy is written in C and is designed around lowlevel C types like uint8, uint32, float64, etc, and the C language has already defined coercion rules for these types. C-style coercion has gone through 60 years of trial by fire

@ahaldane
ahaldane / structuredoc.rst
Last active October 12, 2016 16:28
structure docs

Structured Arrays

Introduction

Numpy allows creation of arrays with a "structured" datatype composed of

This PR defines a new indexing function "split_classes" to accompany the others, which, every once in a while, I've wished existed. It splits up elements from one array based on the 'classification' provided by another array. In its simplest form, it does this:

def split_classes(c, v):
    return [v[c == u] for u in unique(c)]

This implemenation has nagged me though because of performance: If c contains n unique values, this loops through the entire c and v arrays n times each, and creates n intermediate boolean arrays. For large v,c,n I've been hit by performance.

This PR gives a performance improvement by computing everything in a single pass with no intermediate boolean arrays, and for conveniance also allows choice of axis.

@ahaldane
ahaldane / npy_alignment.mkd
Last active September 26, 2015 02:05
Numpy 1.10 Alignment Notes

These are notes on how memory alignment currently works in numpy.

Numpy Alignment Goals

There are three use-cases related to memory alignment in numpy I see:

  1. Creating structured datatypes with fields aligned like in a C-struct.
  2. Speeding up copy operations by using word/double-word assignment in instead of memcopy
  3. Guaranteeing safe aligned access for ufuncs/setitem/casting code

Alignment variables

@ahaldane
ahaldane / structures.mkd
Last active August 29, 2015 14:26
Future Improvements for Structured Arrays?

Future Improvements for Structured Arrays?

To add some context to PR #6053, here are a other potential improvements to structured arrays we could make. I think with improvements like these structured arrays could become much more reliable.

structured assignment speedup

Structure assignment is slow because it goes through the 'wrong' path in mapiter_set. It uses copyswapn when dtype_transfer would be much faster, since copyswapn iterates through the field dict for every element. See #1984. This should be a somewhat straightforward fix.

structure comparison & ufuncs

@ahaldane
ahaldane / objectarrays.mkd
Last active December 2, 2017 17:01
object array docs (future ideas)

Object Arrays

Object arrays are ndarrays with a datatype of np.object whose elements are Python objects, enabling use of numpy's vectorized operations and broadcasting rules with arbitrary Python types. Object arrays have certain special rules to resolve ambiguities that arise between python types and numpy types, described here.

Envisioned uses of object arrays include:

  • Creating ndarrays whose elements are other ndarrays of varying length
  • Creating ndarrays containing number-like Python objects, for example mpmath's multiprecision types, or Python's built-in arbitrary precision integers or Decimal type.