Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Java WordCount on Spark using Dataset
//tested on spark 2.0
import org.apache.spark.sql.*;
import java.util.Arrays;
public class WordCount {
public static void main(String[] args) {
if (args.length < 2) {
System.err.println("Please provide the full path of input file and output dir as arguments");
SparkSession spark = SparkSession
Dataset<String> df =[0]).as(Encoders.STRING());
Dataset<String> words = df.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split(" ")).iterator();
}, Encoders.STRING())
.filter(s -> !s.isEmpty())
.coalesce(1); //one partition (parallelism level)
//words.printSchema(); // { value: string (nullable = true) }
Dataset<Row> t = words.groupBy("value") //<k, iter(V)>
t = t.sort(functions.desc("count"));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.