dariusf/pandas.md

## pandas.md

      
    Raw
  

              pandas.md
            
          
Pandas

Mental model

Pandas is oriented around performing SIMD operations on arrays, and its API directly reflects this. Some operations may be less intuitive than they are in the SQL or stream processing worlds. Operations like apply and iter* which involve user-defined Python are generally slow.
Basics

import pandas as pd
import numpy as np

# by column
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})

# by row
df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
CSV

# if the headers lie about the number of columns, ignore them and supply new ones
df = pd.read_csv(f1 skiprows=[0], names=['A', 'B', 'C'])

# if the data has merged cells
df.fillna(method='ffill', inplace=True)
Selection

>>> df[df['A'] >= 2]
   A  B
1  2  7
2  3  8
loc selects by index (aka labels). iloc selects by ordinal (aka locations).
Projection

>>> p aa.loc[:, 'A':'B']
   A  B
0  2  7
1  3  8
Index

An index is like an id column in the SQL world. It's an extra column that is by default a range from 0 to length-1, and hidden when printing.
After selection, the index column is unchanged and may have to be reset.
>>> a = df[df['A'] >= 2]
>>> p a
   A  B
1  2  7
2  3  8
>>> a.reset_index(inplace=True)
>>> a
   index  A  B
0      1  2  7
1      2  3  8
>>> del a['index']
>>> a
   A  B
0  2  7
1  3  8
An existing column can be set as the index. A multi-index is a composite key, ordered to be hierarchical.
>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.set_index(['A', 'B'])
     C
A B
1 2  8
  5  7
  5  6
3 8  4
Joins

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
df2 = pd.DataFrame({'A': [1], 'C': [5]})
df1.merge(df2, on='A') # inner
df1.merge(df2, on='A', how='outer')
Union

>>> pd.concat([df1, df2])
   A    B    C
0  1  6.0  NaN
1  2  7.0  NaN
2  3  8.0  NaN
0  1  NaN  5.0
Distinct

>>> pd.concat([df1, df1]).drop_duplicates()
   A  B
0  1  6
1  2  7
2  3  8
Group by

>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.groupby('A').sum()
    B   C
A
1  12  21
3   8   4
>>> df.groupby('A').apply(print)
   A  B  C
0  1  2  8
1  1  5  7
2  1  5  6
   A  B  C
3  3  8  4
Empty DataFrame
Columns: []
Index: []
Iteration

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> for k in df.itertuples(): print(k)
Pandas(Index=0, A=1, B=6)
Pandas(Index=1, A=2, B=7)
Pandas(Index=2, A=3, B=8)
Printing

df.head()
df.tail()
df.to_string()
Map/reduce

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> df.apply(lambda x: -x) # unary functions are mapped
   A  B
0 -1 -6
1 -2 -7
2 -3 -8
>>> df.apply(np.sum) # binary functions are folded. default is axis 0 (rows)
A     6
B    21
dtype: int64
>>> df.apply(sum, axis=1) # columns
0     7
1     9
2    11
agg and transform