This PR defines a new indexing function "split_classes" to accompany the others, which, every once in a while, I've wished existed. It splits up elements from one array based on the 'classification' provided by another array. In its simplest form, it does this:
def split_classes(c, v):
return [v[c == u] for u in unique(c)]
This implemenation has nagged me though because of performance: If c
contains n unique values, this loops through the entire c and v arrays n times each, and creates n intermediate boolean arrays. For large v,c,n I've been hit by performance.
This PR gives a performance improvement by computing everything in a single pass with no intermediate boolean arrays, and for conveniance also allows choice of axis.
split_classes
might be (roughly) thought of as a generalization of compress
, which itself is a generalization of extract
, which is a generalization of boolean indexing. They often give the same result:
a = np.random.rand(100)
a[a > 0.5]
extract(a > 0.5, a)
compress(a > 0.5, a)
split_classes(a > 0.5, a)[1]
A few example uses:
from numpy.random import rand, choice, randint
# Example 1
data = rand(100,2)
lo, hi = split_classes(data[:,0] > 0.5, data)
# Example 2
classes = (data[:,0] < 0.5) + 2*(data[:,1] < 0.5)
group1, group2, group3, group4 = split_classes(classes, data)
# Example 3
years = [2010, 2011, 2012, 2013, 2014]
data = array([(choice(years), rand()) for i in range(100)], dtype=[('year', 'i4'), ('x', 'f4')])
for cat_data in split_classes(data['year'], data):
print sum(cat_data['x'])
# Example 4
L = 100
seqs = randint(0, 4, size=(1000, L)) # represents a DNA multiple sequence alignment
phenotype = rand(len(seqs))
signal = [[np.mean(c) for c in split_classes(seqs[:,i], phenotype)] for i in range(L)]
A few related stackoverflow questions:
- http://stackoverflow.com/questions/4373631/sum-array-by-number-in-numpy
- http://stackoverflow.com/questions/31483912/split-numpy-array-according-to-values-in-the-array-a-condition/31484134#31484134
- http://stackoverflow.com/questions/28599405/splitting-an-array-into-two-smaller-arrays-in-python
- http://stackoverflow.com/questions/31863083/python-split-numpy-array-based-on-values-in-the-array
- http://stackoverflow.com/questions/7662458/how-to-split-an-array-according-to-a-condition-in-numpy