Skip to content

Instantly share code, notes, and snippets.

@sebastianschramm
Created August 26, 2022 14:53
Show Gist options
  • Save sebastianschramm/14ff6cf350f2a0d060375e644df46851 to your computer and use it in GitHub Desktop.
Save sebastianschramm/14ff6cf350f2a0d060375e644df46851 to your computer and use it in GitHub Desktop.
An easy way to parallelize pandas.apply processing
import pandas as pd
from pandarallel import pandarallel
from sklearn.datasets import fetch_20newsgroups
def preprocess_text(row: pd.Series) -> float:
return [word.lower() for word in row.text.split()]
def get_data() -> pd.DataFrame:
return pd.DataFrame(fetch_20newsgroups(subset="train").data, columns=["text"])
if __name__ == "__main__":
data = get_data()
# standard pandas way of apply
processed_text = data.apply(preprocess_text, axis=1)
# multicore processing with pandarallel and progress bars
pandarallel.initialize(nb_workers=2, progress_bar=True)
parallel_processed_text = data.parallel_apply(preprocess_text, axis=1)
# make sure we are getting the same results in both cases
pd.testing.assert_series_equal(processed_text, parallel_processed_text)
@sebastianschramm
Copy link
Author

dependencies: python3.9, pandarallel==1.6.3, pandas==1.4.3, scikit-learn==1.1.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment