Skip to content

Instantly share code, notes, and snippets.

@geoHeil
Created April 16, 2016 14:22
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save geoHeil/3dff11860ae042792cea6970447c4592 to your computer and use it in GitHub Desktop.
Save geoHeil/3dff11860ae042792cea6970447c4592 to your computer and use it in GitHub Desktop.
Spark window function fails in SQL
import org.apache.spark.sql.{ Row, SQLContext }
import org.apache.spark.sql.expressions.Window
val sql = new org.apache.spark.sql.SQLContext(sc)
val dataset = Seq(
("Thin", "cell phone", 6000),
("Normal", "tablet", 1500),
("Mini", "tablet", 5500),
("Ultra thin", "cell phone", 5000),
("Very thin", "cell phone", 6000),
("Big", "tablet", 2500),
("Bendable", "cell phone", 3000),
("Foldable", "cell phone", 3000),
("Pro", "tablet", 4500),
("Pro2", "tablet", 6500))
val df = dataset.toDF("product", "category", "revenue")
df.show
df.where('category === "tablet").show
val overCategory = Window.partitionBy('category).orderBy('revenue.desc)
val rank = dense_rank.over(overCategory)
val ranked = df.withColumn("rank", dense_rank.over(overCategory))
ranked.show
ranked.where('rank <= 2).show
################
Plain SQL fails
################
df.registerTempTable("productRevenue")
sql.sql("SELECT product,category,revenue FROM (SELECT product,category,revenue,dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue) tmp WHERE rank <= 2")
sql.sql("SELECT product,category,revenue,dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue")
@geoHeil
Copy link
Author

geoHeil commented Apr 16, 2016

Although scala works the SQL string fails. what is wrong?
Error:

sql.sql("SELECT product,category,revenue,dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue")
java.lang.RuntimeException: [1.51] failure: ``union'' expected but `(' found

SELECT product,category,revenue,dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank FROM productRevenue
                                                  ^
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
    at org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
    at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
    at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
    at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
    at org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)

@geoHeil
Copy link
Author

geoHeil commented Apr 16, 2016

I just found https://forums.databricks.com/questions/7409/can-not-run-code-in-blog-post-introducing-window-f.html apparently it only works in a Hive-context. Stil need to verify this.

@geoHeil
Copy link
Author

geoHeil commented Apr 16, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment