Skip to content

Instantly share code, notes, and snippets.

@kvnkho
Last active April 18, 2021 21:24
Show Gist options
  • Save kvnkho/90ff4fd6876caf0d9f919ee87f91839d to your computer and use it in GitHub Desktop.
Save kvnkho/90ff4fd6876caf0d9f919ee87f91839d to your computer and use it in GitHub Desktop.
Comparing Pandas and Spark
# Comparison of creating inferred_state column
area_to_state = {"217": "IL", "312": "IL", "415": "CA", "352": "FL"}
# Pandas implementation
df['inferred_state'] = df['home_state']\
.fillna(df['work_state'])\
.fillna(df['phone'].str.slice(0,3).map(area_to_state))
# Spark implementation
from pyspark.sql.functions import coalesce, col, substring, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*area_to_state.items())])
df = df.withColumn('inferred_state',
coalesce('home_state',
'work_state',
mapping_expr.getItem(substring(col("phone"), 0, 3))
)
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment