isteves/pyspark_tricks.md

## pyspark_tricks.md

      
    Raw
  

              pyspark_tricks.md
            
          
    PySpark tricks

"Exploding" aggregations

If you want to do the same aggregation to many columns you can write it this way to be more succinct:
cols_min = ["size", "age"]

df \
  .groupBy("grouping_col") \
  .agg(*(min(col(c)).alias('min_' + c) for c in cols_min))
This is like doing summarize_at() or summarize(across()) in the tidyverse
Working with date strings

from datetime import datetime, timedelta
def add_days_str(date_str, days):
  new_date = datetime.strptime(date_str, "%Y-%m-%d") +  timedelta(days=days)
  return new_date.strftime("%Y-%m-%d")

add_days_str("2021-08-01", -30)