Skip to content

Instantly share code, notes, and snippets.

@rikturr
Created July 30, 2020 20:11
Show Gist options
  • Save rikturr/bac3da99f390b07718bd5694a9075d16 to your computer and use it in GitHub Desktop.
Save rikturr/bac3da99f390b07718bd5694a9075d16 to your computer and use it in GitHub Desktop.
spark_features
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline
taxi = taxi.withColumn('pickup_weekday', F.dayofweek(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_hour', F.hour(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_minute', F.minute(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_week_hour', ((taxi.pickup_weekday * 24) + taxi.pickup_hour).cast(DoubleType()))
taxi = taxi.withColumn('store_and_fwd_flag', F.when(taxi.store_and_fwd_flag == 'Y', 1).otherwise(0))
taxi = taxi.withColumn('label', taxi.total_amount)
taxi = taxi.fillna(-1)
assembler = VectorAssembler(
inputCols=features,
outputCol='features',
)
pipeline = Pipeline(stages=[assembler])
assembler_fitted = pipeline.fit(taxi)
X = assembler_fitted.transform(taxi)
X.cache()
X.count()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment