Skip to content

Instantly share code, notes, and snippets.

@mostafam
Created August 13, 2020 07:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mostafam/6b65c8bf1a1e1f9f27c27d65c7998255 to your computer and use it in GitHub Desktop.
Save mostafam/6b65c8bf1a1e1f9f27c27d65c7998255 to your computer and use it in GitHub Desktop.
scala> val df = spark.createDataFrame(
Seq((0, "john.doe@gmail.com", "John"), (1, "JackieChan234@xyz.com","jack"), (2, "ping_pong@missed.org","Al"))
).toDF("id", "email", "first_name")
scala> df.show(false)
+---+---------------------+----------+
|id |email |first_name|
+---+---------------------+----------+
|0 |john.doe@gmail.com |John |
|1 |JackieChan234@xyz.com|jack |
|2 |ping_pong@missed.org |Al |
+---+---------------------+----------+
scala> val stringChecker = new StringChecker(uid = "string_checker", model = new StringCheckerModel(caseSensitive = false)).setInputCols("email", "first_name").setOutputCol("is_it_there?")
scala> val pipeline = new Pipeline().setStages(Array(stringChecker))
scala> val pipelineModel = pipeline.fit(df)
scala> pipelineModel.transform(df).show(false)
+---+---------------------+----------+------------+
|id |email |first_name|is_it_there?|
+---+---------------------+----------+------------+
|0 |john.doe@gmail.com |John |1.0 |
|1 |JackieChan234@xyz.com|jack |1.0 |
|2 |ping_pong@missed.org |Al |0.0 |
+---+---------------------+----------+------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment