Skip to content

Instantly share code, notes, and snippets.

@BenjaminWolfe
Last active April 19, 2021 17:50
Show Gist options
  • Save BenjaminWolfe/cc1af6e14ad3fab91da4032484e8ddee to your computer and use it in GitHub Desktop.
Save BenjaminWolfe/cc1af6e14ad3fab91da4032484e8ddee to your computer and use it in GitHub Desktop.
This is a nice test case to learn to use %timeit, as well as np.random.seed. I needed to do something like df.query, but with a series. Turns out that in at least simple cases it can be easy and fast with .loc and a lambda function.
import numpy as np, pandas as pd
df_len = 1000 # integer multiple of 4
np.random.seed(42)
# create a random data frame
df = pd.DataFrame(
{
"group_a": np.random.randint(0, df_len / 4, size=df_len),
"group_b": np.random.randint(0, df_len / 4, size=df_len),
"binary": np.random.randint(0, 2, size=df_len),
}
)
print(df)
# look for non-unique values
def get_non_unique(df=df, groupby=["group_a", "group_b"], check="binary"):
counts = df.groupby(groupby)[check].nunique()
return counts[counts > 1]
print(get_non_unique())
# look for non-unique values, but chain using .loc[] and a lambda function
def get_non_unique_chained(df=df, groupby=["group_a", "group_b"], check="binary"):
return df.groupby(groupby)[check].nunique().loc[lambda x: x > 1]
print(get_non_unique_chained())
# check that the two get the same result
assert sum(get_non_unique() != get_non_unique_chained()) == 0
# time the two. I was concerned that a lambda function would be non-vectorized and slower.
# it appears that, in at least some simple cases, performance is equivalent.
%timeit get_non_unique()
%timeit get_non_unique_chained()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment