Skip to content

Instantly share code, notes, and snippets.

@jaceklaskowski
Last active June 15, 2017 11:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jaceklaskowski/d6e819e4a2a1b85ce3468141619008bd to your computer and use it in GitHub Desktop.
Save jaceklaskowski/d6e819e4a2a1b85ce3468141619008bd to your computer and use it in GitHub Desktop.
Day 4

Exercise

Develop a Spark standalone application (using IntelliJ IDEA) with Spark MLlib and LogisticRegression to classify emails.

Think about command line and what parameters you'd like to accept for various use cases.

TIP Use scopt

  1. libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"

Exercise

Find the record with the maximum value (in a column) per category.

val records = Seq(
  (0, "hello", 100, 0),
  (1, "world", 200, 0),
  (2, "witaj swiecie", 150, 1)).toDF("id", "token", "level", "category")
scala> records.show
+---+-------------+-----+--------+
| id|        token|level|category|
+---+-------------+-----+--------+
|  0|        hello|  100|       0|
|  1|        world|  200|       0|
|  2|witaj swiecie|  150|       1|
+---+-------------+-----+--------+

Traditional Approach (groupBy + join)

val q = records.groupBy("category").agg(max("level") as "max_level").join(records, "category").where($"max_level" === $"level")

// Solution 1
q.select(records.columns.head, records.columns.tail: _*).show

// Solution 2
val cols = records.columns.map(name => col(name))
q.select(cols: _*).show

Window Aggregate

scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window

scala> val categories = Window.partitionBy("category")
categories: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5e8d66d9

scala> records.select($"*", max("level") over categories as "max_level").filter($"max_level" === $"level").show
+---+-------------+-----+--------+---------+
| id|        token|level|category|max_level|
+---+-------------+-----+--------+---------+
|  2|witaj swiecie|  150|       1|      150|
|  1|        world|  200|       0|      200|
+---+-------------+-----+--------+---------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment