Skip to content

Instantly share code, notes, and snippets.

@gstaubli
Created March 4, 2018 04:38
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gstaubli/db7ef6af8b49a07960539e1d32c6aa65 to your computer and use it in GitHub Desktop.
Save gstaubli/db7ef6af8b49a07960539e1d32c6aa65 to your computer and use it in GitHub Desktop.
PySpark Pandas UDF
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd
df = spark.read\
.option("header", "true")\
.option("inferSchema", "true")\
.csv("yellow_tripdata_2017-06.csv")
def timestamp_to_epoch(t):
return t.dt.strftime("%s").apply(str) # <-- pandas.Series calls
f_timestamp_copy = pandas_udf(timestamp_to_epoch, returnType=StringType())
df = df.withColumn("timestamp_copy", f_timestamp_copy(F.col("tpep_pickup_datetime")))
df.select('timestamp_copy').distinct().count() #=> 2340959 - 9-10 minute runtime (!!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment