Skip to content

Instantly share code, notes, and snippets.

With Spark on Kubernetes

Jacek Laskowski jaceklaskowski

With Spark on Kubernetes
View GitHub Profile
jaceklaskowski /
Last active Apr 22, 2021
Hadoop Properties for Spark in Cloud (s3a, buckets)

Hadoop Properties for Spark in Cloud

The following is a list of Hadoop properties for Spark to use HDFS more effective.

spark.hadoop.-prefixed Spark properties are used to configure a Hadoop Configuration that Spark broadcast to tasks. Use spark.sparkContext.hadoopConfiguration to review the properties.

  • spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2

Google Cloud Storage

hbase-connectors git:(master) mvn -Dspark.version=2.4.0 -Dscala.version=2.11.12 -Dscala.binary.version=2.11 clean install

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] osx
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.14
jaceklaskowski /
Last active Jan 19, 2019
Kafka Streams Workshop


Exercise: KStream.transformValues

Use KStream.transformValues

Exercise: Using Materialized

val materialized =[String, Long, ByteArrayKeyValueStore]("poznan-state-store")
  case class Person(id: Long, name: String)

  class PersonSerializer extends Serializer[Person] {
    override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = {}

    override def serialize(topic: String, data: Person): Array[Byte] = {
      println(s">>> serialize($topic, $data)")
jaceklaskowski /
Last active Sep 11, 2019
Learning R with RStudio on macOS
jaceklaskowski /
Created Dec 16, 2017
Anatolyi - Facebook Profiles
// Let's create a sample dataset with just a single line, i.e. facebook profile
val facebookProfile = "ActivitiesDescription:703 likes, 0 talking about this, 4 were here; Category:;; Hours:Mon-Fri: 8:00 am - 5:00 pm; Likes:703; Link:; Location:165 W Wieuca Rd NE, Ste 310, Atlanta, Georgia; Name:PV Heating & Air; NumberOfPictures:0; NumberOfReviews:26; Phone:(404) 798-9672; ShortDescription:We specialize in residential a/c, heating, indoor air quality & home performance.; Url:; Visitors:4"
val fbs = Seq(facebookProfile).toDF("profile")

scala> = false)
spark.readStream.format("kafka").option("subscribe", "topic1").option("kafka.bootstrap.servers", "localhost:9092")'value cast "string", 'timestamp).groupBy(window($"timestamp", "5 seconds")).count().writeStream.outputMode(Complete).format("console").option("truncate", false).start


Develop a Spark standalone application (using IntelliJ IDEA) with Spark MLlib and LogisticRegression to classify emails.

Think about command line and what parameters you'd like to accept for various use cases.

TIP Use scopt

  1. libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"


  • breaks on demand given the number of exercises
  • break man who says we should have one
val wholeJsonRDD = sc.wholeTextFiles("input.json").map(_._2)
val mySchema = new StructType().add($"n".int)
wholeJsonRDD.toDF.withColumn("json", from_json($"value", mySchema)).show(truncate = false)


Exercise 1

Union only those rows (from large table) with keys in left small table, i.e. union two dataframes together but only those with the key in my small table.

Exercise 2

Aggregation on an array of nested json = How to sum the quantities across all lines for a given order (which would give 1 + 3 = 4 for the below sample dataset):