Skip to content

Instantly share code, notes, and snippets.

@MLnick
Created August 18, 2016 07:55
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save MLnick/9000e78577f4b3f3168cd1716008e202 to your computer and use it in GitHub Desktop.
Save MLnick/9000e78577f4b3f3168cd1716008e202 to your computer and use it in GitHub Desktop.
Using SQLTransformer to join DataFrames in ML Pipeline
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.ml.feature.SQLTransformer
import org.apache.spark.ml.feature.SQLTransformer
scala> val df1 = spark.createDataFrame(Seq((1, "foo"), (2, "bar"), (3, "baz"))).toDF("id", "text")
df1: org.apache.spark.sql.DataFrame = [id: int, text: string]
scala> df1.createOrReplaceTempView("table1")
scala> val df2 = spark.createDataFrame(Seq((3, 4), (2, 2), (3, 5))).toDF("id", "num")
df2: org.apache.spark.sql.DataFrame = [id: int, num: int]
scala> val st = new SQLTransformer()
st: org.apache.spark.ml.feature.SQLTransformer = sql_35c801a50fba
scala> st.setStatement("SELECT t1.id, t1.text, th.num FROM __THIS__ th JOIN table1 t1 ON th.id == t1.id")
res1: st.type = sql_35c801a50fba
scala> st.transform(df2).show
+---+----+---+
| id|text|num|
+---+----+---+
| 3| baz| 4|
| 2| bar| 2|
| 3| baz| 5|
+---+----+---+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment