Last active
April 19, 2021 17:50
-
-
Save BenjaminWolfe/cc1af6e14ad3fab91da4032484e8ddee to your computer and use it in GitHub Desktop.
This is a nice test case to learn to use %timeit, as well as np.random.seed. I needed to do something like df.query, but with a series. Turns out that in at least simple cases it can be easy and fast with .loc and a lambda function.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import numpy as np, pandas as pd | |
df_len = 1000 # integer multiple of 4 | |
np.random.seed(42) | |
# create a random data frame | |
df = pd.DataFrame( | |
{ | |
"group_a": np.random.randint(0, df_len / 4, size=df_len), | |
"group_b": np.random.randint(0, df_len / 4, size=df_len), | |
"binary": np.random.randint(0, 2, size=df_len), | |
} | |
) | |
print(df) | |
# look for non-unique values | |
def get_non_unique(df=df, groupby=["group_a", "group_b"], check="binary"): | |
counts = df.groupby(groupby)[check].nunique() | |
return counts[counts > 1] | |
print(get_non_unique()) | |
# look for non-unique values, but chain using .loc[] and a lambda function | |
def get_non_unique_chained(df=df, groupby=["group_a", "group_b"], check="binary"): | |
return df.groupby(groupby)[check].nunique().loc[lambda x: x > 1] | |
print(get_non_unique_chained()) | |
# check that the two get the same result | |
assert sum(get_non_unique() != get_non_unique_chained()) == 0 | |
# time the two. I was concerned that a lambda function would be non-vectorized and slower. | |
# it appears that, in at least some simple cases, performance is equivalent. | |
%timeit get_non_unique() | |
%timeit get_non_unique_chained() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment