Skip to content

Instantly share code, notes, and snippets.

View jaceklaskowski's full-sized avatar
:octocat:
Enjoying developer life...

Jacek Laskowski jaceklaskowski

:octocat:
Enjoying developer life...
View GitHub Profile

Exercise

Develop a Spark standalone application (using IntelliJ IDEA) with Spark MLlib and LogisticRegression to classify emails.

Think about command line and what parameters you'd like to accept for various use cases.

TIP Use scopt

  1. libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"

IDEA:

  • breaks on demand given the number of exercises
  • break man who says we should have one
val wholeJsonRDD = sc.wholeTextFiles("input.json").map(_._2)
val mySchema = new StructType().add($"n".int)
wholeJsonRDD.toDF.withColumn("json", from_json($"value", mySchema)).show(truncate = false)

val jsonDF=spark.read.json("output.json")
@jaceklaskowski
jaceklaskowski / spark-exercises.md
Last active June 26, 2022 12:04
Spark Exercises

Exercise 1

Union only those rows (from large table) with keys in left small table, i.e. union two dataframes together but only those with the key in my small table.

Exercise 2

Aggregation on an array of nested json = How to sum the quantities across all lines for a given order (which would give 1 + 3 = 4 for the below sample dataset):

{
@jaceklaskowski
jaceklaskowski / blockchain.md
Last active March 14, 2018 15:17
Blockchains, Cryptoeconomics, Ethereum, Litecoin, Bitcoin, IOTA
@jaceklaskowski
jaceklaskowski / parquet.md
Last active December 26, 2017 19:16
Parquet

Parquet

Introduction

  • Stores schema information along with the data
  • Columnar storage/file format
    • "reference file format on Hadoop HDFS"
    • "read-optimized view of data"
  • excellent for local file storage on HDFS (instead of external databases).
  • writing very large datasets to disk
@jaceklaskowski
jaceklaskowski / scala-something.md
Created February 24, 2017 22:42
Scala SOMETHING

Notes

  • Combinators == building blocks
  • functions and higher order functions
  • composition
  • immutable == less moving parts to worry about
  • a routine job == a boilerplate == a boring stuff
  • a context (so the job varies)
  • abstracting away == happening behind the scenes == cutting down repetitive code == eliminating boilerplate
  • "The code becomes small, succinct, and more readable"
@jaceklaskowski
jaceklaskowski / sparkathon-agenda.md
Last active October 10, 2017 20:28
Sparkathon in Warsaw - Development Activities
@jaceklaskowski
jaceklaskowski / dcos.md
Last active June 30, 2018 12:13
Introduction to DC/OS
@jaceklaskowski
jaceklaskowski / spark-AlreadyExistsException.md
Created August 17, 2016 02:31
ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
scala> Seq(A(4)).toDS
16/08/16 19:26:26 ERROR RetryingHMSHandler: AlreadyExistsException(message:Database default already exists)
	at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.create_database(HiveMetaStore.java:891)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
	at com.sun.proxy.$Proxy14.create_database(Unknown Source)
@jaceklaskowski
jaceklaskowski / exercise-meetup.md
Last active August 3, 2016 20:01
Exercise for meetup today