Skip to content

Instantly share code, notes, and snippets.

@dariusf
Created August 15, 2023 08:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dariusf/bb03514c720f7c260d89fa1d3346551c to your computer and use it in GitHub Desktop.
Save dariusf/bb03514c720f7c260d89fa1d3346551c to your computer and use it in GitHub Desktop.

Pandas

Mental model

Pandas is oriented around performing SIMD operations on arrays, and its API directly reflects this. Some operations may be less intuitive than they are in the SQL or stream processing worlds. Operations like apply and iter* which involve user-defined Python are generally slow.

Basics

import pandas as pd
import numpy as np

# by column
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})

# by row
df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

CSV

# if the headers lie about the number of columns, ignore them and supply new ones
df = pd.read_csv(f1 skiprows=[0], names=['A', 'B', 'C'])

# if the data has merged cells
df.fillna(method='ffill', inplace=True)

Selection

>>> df[df['A'] >= 2]
   A  B
1  2  7
2  3  8

loc selects by index (aka labels). iloc selects by ordinal (aka locations).

Projection

>>> p aa.loc[:, 'A':'B']
   A  B
0  2  7
1  3  8

Index

An index is like an id column in the SQL world. It's an extra column that is by default a range from 0 to length-1, and hidden when printing.

After selection, the index column is unchanged and may have to be reset.

>>> a = df[df['A'] >= 2]
>>> p a
   A  B
1  2  7
2  3  8
>>> a.reset_index(inplace=True)
>>> a
   index  A  B
0      1  2  7
1      2  3  8
>>> del a['index']
>>> a
   A  B
0  2  7
1  3  8

An existing column can be set as the index. A multi-index is a composite key, ordered to be hierarchical.

>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.set_index(['A', 'B'])
     C
A B
1 2  8
  5  7
  5  6
3 8  4

Joins

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
df2 = pd.DataFrame({'A': [1], 'C': [5]})
df1.merge(df2, on='A') # inner
df1.merge(df2, on='A', how='outer')

Union

>>> pd.concat([df1, df2])
   A    B    C
0  1  6.0  NaN
1  2  7.0  NaN
2  3  8.0  NaN
0  1  NaN  5.0

Distinct

>>> pd.concat([df1, df1]).drop_duplicates()
   A  B
0  1  6
1  2  7
2  3  8

Group by

>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.groupby('A').sum()
    B   C
A
1  12  21
3   8   4
>>> df.groupby('A').apply(print)
   A  B  C
0  1  2  8
1  1  5  7
2  1  5  6
   A  B  C
3  3  8  4
Empty DataFrame
Columns: []
Index: []

Iteration

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> for k in df.itertuples(): print(k)
Pandas(Index=0, A=1, B=6)
Pandas(Index=1, A=2, B=7)
Pandas(Index=2, A=3, B=8)

Printing

df.head()
df.tail()
df.to_string()

Map/reduce

>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> df.apply(lambda x: -x) # unary functions are mapped
   A  B
0 -1 -6
1 -2 -7
2 -3 -8
>>> df.apply(np.sum) # binary functions are folded. default is axis 0 (rows)
A     6
B    21
dtype: int64
>>> df.apply(sum, axis=1) # columns
0     7
1     9
2    11

agg and transform

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment