Pandas is oriented around performing SIMD operations on arrays, and its API directly reflects this. Some operations may be less intuitive than they are in the SQL or stream processing worlds. Operations like apply
and iter*
which involve user-defined Python are generally slow.
import pandas as pd
import numpy as np
# by column
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
# by row
df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
# if the headers lie about the number of columns, ignore them and supply new ones
df = pd.read_csv(f1 skiprows=[0], names=['A', 'B', 'C'])
# if the data has merged cells
df.fillna(method='ffill', inplace=True)
>>> df[df['A'] >= 2]
A B
1 2 7
2 3 8
loc
selects by index (aka labels). iloc
selects by ordinal (aka locations).
>>> p aa.loc[:, 'A':'B']
A B
0 2 7
1 3 8
An index is like an id column in the SQL world. It's an extra column that is by default a range from 0 to length-1, and hidden when printing.
After selection, the index column is unchanged and may have to be reset.
>>> a = df[df['A'] >= 2]
>>> p a
A B
1 2 7
2 3 8
>>> a.reset_index(inplace=True)
>>> a
index A B
0 1 2 7
1 2 3 8
>>> del a['index']
>>> a
A B
0 2 7
1 3 8
An existing column can be set as the index. A multi-index is a composite key, ordered to be hierarchical.
>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.set_index(['A', 'B'])
C
A B
1 2 8
5 7
5 6
3 8 4
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
df2 = pd.DataFrame({'A': [1], 'C': [5]})
df1.merge(df2, on='A') # inner
df1.merge(df2, on='A', how='outer')
>>> pd.concat([df1, df2])
A B C
0 1 6.0 NaN
1 2 7.0 NaN
2 3 8.0 NaN
0 1 NaN 5.0
>>> pd.concat([df1, df1]).drop_duplicates()
A B
0 1 6
1 2 7
2 3 8
>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.groupby('A').sum()
B C
A
1 12 21
3 8 4
>>> df.groupby('A').apply(print)
A B C
0 1 2 8
1 1 5 7
2 1 5 6
A B C
3 3 8 4
Empty DataFrame
Columns: []
Index: []
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> for k in df.itertuples(): print(k)
Pandas(Index=0, A=1, B=6)
Pandas(Index=1, A=2, B=7)
Pandas(Index=2, A=3, B=8)
df.head()
df.tail()
df.to_string()
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> df.apply(lambda x: -x) # unary functions are mapped
A B
0 -1 -6
1 -2 -7
2 -3 -8
>>> df.apply(np.sum) # binary functions are folded. default is axis 0 (rows)
A 6
B 21
dtype: int64
>>> df.apply(sum, axis=1) # columns
0 7
1 9
2 11