Skip to content

Instantly share code, notes, and snippets.

@pavlov99
Created September 17, 2016 09:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pavlov99/5f03e6cdee660f69235cff44b68b164d to your computer and use it in GitHub Desktop.
Save pavlov99/5f03e6cdee660f69235cff44b68b164d to your computer and use it in GitHub Desktop.
import org.apache.spark.sql.Window
val competitorWindow = Window
.partitionBy("date", "competitor")
.orderBy(levenshtein($"competitor", $"short_name"))
val scheduleRich = schedule
.join(
teams, levenshtein($"competitor", $"short_name") < 5, "left_outer"
)
.withColumn("_rank", row_number().over(competitorWindow))
.filter($"_rank" === 1)
.drop("_rank")
scheduleRich.drop("short_name").show(2)
+----------+----------+---------------+--------+------------------+
| date|competitor| team|division| conference|
+----------+----------+---------------+--------+------------------+
|2016-10-05| Anaheim| Anaheim Ducks| Pacific|Western Conference|
|2016-10-09| Anaheim| Anaheim Ducks| Pacific|Western Conference|
+----------+----------+---------------+--------+------------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment