Skip to content

Instantly share code, notes, and snippets.

@isteves
Last active May 25, 2022 11:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save isteves/240ec11255f4662ad06a06ff7575f6b4 to your computer and use it in GitHub Desktop.
Save isteves/240ec11255f4662ad06a06ff7575f6b4 to your computer and use it in GitHub Desktop.
PySpark tricks

PySpark tricks

"Exploding" aggregations

If you want to do the same aggregation to many columns you can write it this way to be more succinct:

cols_min = ["size", "age"]

df \
  .groupBy("grouping_col") \
  .agg(*(min(col(c)).alias('min_' + c) for c in cols_min))

This is like doing summarize_at() or summarize(across()) in the tidyverse

Working with date strings

from datetime import datetime, timedelta
def add_days_str(date_str, days):
  new_date = datetime.strptime(date_str, "%Y-%m-%d") +  timedelta(days=days)
  return new_date.strftime("%Y-%m-%d")

add_days_str("2021-08-01", -30)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment