Skip to content

Instantly share code, notes, and snippets.

@lakshay-arora
Created November 4, 2019 07:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lakshay-arora/1da756bf915d7d42ed7add60664d245d to your computer and use it in GitHub Desktop.
Save lakshay-arora/1da756bf915d7d42ed7add60664d245d to your computer and use it in GitHub Desktop.
# define stage 1 : transform the column category_1 to numeric
stage_1 = StringIndexer(inputCol= 'category_1', outputCol= 'category_1_index')
# define stage 2 : transform the column category_2 to numeric
stage_2 = StringIndexer(inputCol= 'category_2', outputCol= 'category_2_index')
# define stage 3 : one hot encode the numeric category_2 column
stage_3 = OneHotEncoderEstimator(inputCols=['category_2_index'], outputCols=['category_2_OHE'])
# setup the pipeline
pipeline = Pipeline(stages=[stage_1, stage_2, stage_3])
# fit the pipeline model and transform the data as defined
pipeline_model = pipeline.fit(sample_df)
sample_df_updated = pipeline_model.transform(sample_df)
# view the transformed data
sample_df_updated.show()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment