Skip to content

Instantly share code, notes, and snippets.

@iaindillingham
Last active March 25, 2022 16:25
Show Gist options
  • Save iaindillingham/4903394b65dc3bad3b54e0eb1cde7ea5 to your computer and use it in GitHub Desktop.
Save iaindillingham/4903394b65dc3bad3b54e0eb1cde7ea5 to your computer and use it in GitHub Desktop.
Join Strategies
# A common pattern when using OpenSAFELY for time series analysis is to extract one
# cohort for slow-to-extract variables that we don't expect to change over time, and
# multiple cohorts (e.g. by week or by month) for fast-to-extract variables that we
# expect to change over time. Each "fast" cohort is then joined to the "slow" cohort for
# analysis.
#
# In this gist, we compare the memory profiles of two join strategies found in the
# OpenSAFELY documentation: a map strategy and a merge strategy. We find that on a
# dataset with an order of magnitude difference between the population size and the
# sample size, the map strategy uses roughly 2.9 times more memory than the merge
# strategy.
#
# https://www.opensafely.org/
# https://docs.opensafely.org/
import sys
import pandas
from memory_profiler import profile
from numpy import random
rng = random.default_rng(seed=1)
def get_all_patients(n=1_000_000):
"""Gets a set of all patients from a slow cohort-extractor extract."""
return pandas.DataFrame(
{
"ethnicity": rng.integers(1, 5, size=n, endpoint=True),
},
index=pandas.RangeIndex(n, name="patient_id"),
).reset_index()
def get_some_patients(n=100_000):
"""Gets a subset of some patients from a fast cohort-extractor extract."""
return pandas.DataFrame(
{
"age": rng.integers(100, size=n),
"sex": rng.choice(["F", "M"], size=n),
},
index=pandas.RangeIndex(n, name="patient_id"),
).reset_index()
@profile
def with_map(all_patients, some_patients):
"""Joins with the map strategy."""
mapping = dict(zip(all_patients["patient_id"], all_patients["ethnicity"]))
return some_patients["patient_id"].map(mapping)
@profile
def with_merge(all_patients, some_patients):
"""Joins with the merge strategy."""
return some_patients.merge(all_patients, how="left", on="patient_id")
if __name__ == "__main__":
try:
strategy = sys.argv[1]
except IndexError:
strategy = None
if strategy not in ["with_map", "with_merge"]:
print("Please supply a valid strategy: either 'with_map' or 'with_merge'")
sys.exit(1)
all_patients = get_all_patients()
print(f"There are {len(all_patients):,} patients in the population.")
some_patients = get_some_patients()
print(f"There are {len(some_patients):,} patients in the sample.")
if strategy == "with_map":
with_map(all_patients, some_patients)
else:
with_merge(all_patients, some_patients)
@iaindillingham
Copy link
Author

Profiles

There are 1,000,000 patients in the population.
There are 100,000 patients in the sample.
Filename: /Users/iaindillingham/Code/iaindillingham/gist/join_strategies/join_strategies.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    45    109.8 MiB    109.8 MiB           1   @profile
    46                                         def with_map(all_patients, some_patients):
    47                                             """Joins with the map strategy."""
    48    214.3 MiB    104.5 MiB           1       mapping = dict(zip(all_patients["patient_id"], all_patients["ethnicity"]))
    49    260.4 MiB     46.1 MiB           1       some_patients["patient_id"].map(mapping)
There are 1,000,000 patients in the population.
There are 100,000 patients in the sample.
Filename: /Users/iaindillingham/Code/iaindillingham/gist/join_strategies/join_strategies.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    52    108.4 MiB    108.4 MiB           1   @profile
    53                                         def with_merge(all_patients, some_patients):
    54                                             """Joins with the merge strategy."""
    55    167.1 MiB     58.7 MiB           1       some_patients.merge(all_patients, how="left", on="patient_id")

(104.5 + 46.1) / 58.7 = 2.6

The functions with_map and with_merge return None in 587f376 (profiles above). Does this affect the profiles? In 8edffed (profiles below), these functions return the result of either the map or the merge operation.

There are 1,000,000 patients in the population.
There are 100,000 patients in the sample.
Filename: /Users/iaindillingham/Code/iaindillingham/gist/join_strategies/join_strategies.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    45    109.4 MiB    109.4 MiB           1   @profile
    46                                         def with_map(all_patients, some_patients):
    47                                             """Joins with the map strategy."""
    48    213.9 MiB    104.5 MiB           1       mapping = dict(zip(all_patients["patient_id"], all_patients["ethnicity"]))
    49    260.0 MiB     46.2 MiB           1       return some_patients["patient_id"].map(mapping)
There are 1,000,000 patients in the population.
There are 100,000 patients in the sample.
Filename: /Users/iaindillingham/Code/iaindillingham/gist/join_strategies/join_strategies.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    52    109.5 MiB    109.5 MiB           1   @profile
    53                                         def with_merge(all_patients, some_patients):
    54                                             """Joins with the merge strategy."""
    55    184.6 MiB     75.1 MiB           1       return some_patients.merge(all_patients, how="left", on="patient_id")

(104.5 + 46.2) / 75.1 = 2.0

Returning the results of the operations does affect the profiles. Nevertheless, the map strategy still uses twice as much memory as the merge strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment