Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
sample spark2 application demonstrating dataset api
[root@rkk1 Spark2StarterApp]# /usr/hdp/current/spark2-client/bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/30 18:01:48 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.26.81.127:4040
Spark context available as 'sc' (master = local[*], app id = local-1480528906336).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0.2.5.0.0-1133
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import spark.implicits._
import spark.implicits._
scala> val dsArr = spark.read.text("/tmp/sample_07.txt").as[String]
dsArr: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val words = dsArr.flatMap(value => value.split("\\s+"))
words: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val groupedWords = words.groupByKey(_.toLowerCase)
groupedWords: org.apache.spark.sql.KeyValueGroupedDataset[String,String] = org.apache.spark.sql.KeyValueGroupedDataset@5d9627d3
scala> val counts = groupedWords.count()
counts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
scala> counts.show()
+----------+--------+
| value|count(1)|
+----------+--------+
| 13-2081| 1|
| 349140| 1|
| 45300| 1|
|electrical| 8|
| 49290| 1|
| 77930| 1|
| 9030| 2|
| art| 1|
| 52800| 1|
|paramedics| 1|
| 49630| 1|
| 39-4021| 1|
| travel| 3|
| 43-5111| 1|
| 47-3013| 1|
| 1090| 1|
| 49-2095| 1|
| 18130| 1|
| still| 1|
| surveying| 1|
+----------+--------+
only showing top 20 rows
scala>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment