Skip to content

Instantly share code, notes, and snippets.

@AdroitAnandAI
Created June 6, 2021 07:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AdroitAnandAI/7f24414a7ed53cb81f036a7635e6ac68 to your computer and use it in GitHub Desktop.
Save AdroitAnandAI/7f24414a7ed53cb81f036a7635e6ac68 to your computer and use it in GitHub Desktop.
ML Pipeline
# Configure ML pipeline with three stages: tokenizer, CountVec, and LR
# https://spark.apache.org/docs/latest/ml-pipeline.html
#Refer: https://spark.apache.org/docs/latest/ml-features#tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")
#Refer: https://spark.apache.org/docs/latest/ml-features.html#countvectorizer
cv = CountVectorizer(inputCol=tokenizer.getOutputCol(), \
outputCol="features", minDF=2.0)
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, cv, lr])
# Fit the pipeline to training documents.
model = pipeline.fit(training)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment