Skip to content

Instantly share code, notes, and snippets.

Avatar
🎯
With Spark on Kubernetes

Jacek Laskowski jaceklaskowski

🎯
With Spark on Kubernetes
View GitHub Profile
@jaceklaskowski
jaceklaskowski / hadoop-spark-properties.md
Last active Apr 22, 2021
Hadoop Properties for Spark in Cloud (s3a, buckets)
View hadoop-spark-properties.md

Hadoop Properties for Spark in Cloud

The following is a list of Hadoop properties for Spark to use HDFS more effective.

spark.hadoop.-prefixed Spark properties are used to configure a Hadoop Configuration that Spark broadcast to tasks. Use spark.sparkContext.hadoopConfiguration to review the properties.

  • spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2

Google Cloud Storage

View errors.md
hbase-connectors git:(master) mvn -Dspark.version=2.4.0 -Dscala.version=2.11.12 -Dscala.binary.version=2.11 clean install

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Detecting the operating system and CPU architecture
[INFO] ------------------------------------------------------------------------
[INFO] os.detected.name: osx
[INFO] os.detected.arch: x86_64
[INFO] os.detected.version: 10.14
@jaceklaskowski
jaceklaskowski / workshop.md
Last active Jan 19, 2019
Kafka Streams Workshop
View workshop.md

Workshop

Exercise: KStream.transformValues

Use KStream.transformValues

Exercise: Using Materialized

val materialized = Materialized.as[String, Long, ByteArrayKeyValueStore]("poznan-state-store")
View person.md
  case class Person(id: Long, name: String)

  class PersonSerializer extends Serializer[Person] {
    override def configure(configs: util.Map[String, _], isKey: Boolean): Unit = {}

    override def serialize(topic: String, data: Person): Array[Byte] = {
      println(s">>> serialize($topic, $data)")
      s"${data.id},${data.name}".map(_.toByte).toArray
    }
@jaceklaskowski
jaceklaskowski / rstudio-macos.md
Last active Sep 11, 2019
Learning R with RStudio on macOS
View rstudio-macos.md
@jaceklaskowski
jaceklaskowski / anatolyi.md
Created Dec 16, 2017
Anatolyi - Facebook Profiles
View anatolyi.md
// Let's create a sample dataset with just a single line, i.e. facebook profile
val facebookProfile = "ActivitiesDescription:703 likes, 0 talking about this, 4 were here; Category:; Email:joe@pvhvac.com; Hours:Mon-Fri: 8:00 am - 5:00 pm; Likes:703; Link:https://www.facebook.com/pvhvac; Location:165 W Wieuca Rd NE, Ste 310, Atlanta, Georgia; Name:PV Heating & Air; NumberOfPictures:0; NumberOfReviews:26; Phone:(404) 798-9672; ShortDescription:We specialize in residential a/c, heating, indoor air quality & home performance.; Url:http://www.pvhvac.com; Visitors:4"
val fbs = Seq(facebookProfile).toDF("profile")

scala> fbs.show(truncate = false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
View structured-streaming.md
spark.readStream.format("kafka").option("subscribe", "topic1").option("kafka.bootstrap.servers", "localhost:9092").load.select('value cast "string", 'timestamp).groupBy(window($"timestamp", "5 seconds")).count().writeStream.outputMode(Complete).format("console").option("truncate", false).start
View spark-day4.md

Exercise

Develop a Spark standalone application (using IntelliJ IDEA) with Spark MLlib and LogisticRegression to classify emails.

Think about command line and what parameters you'd like to accept for various use cases.

TIP Use scopt

  1. libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.1.1"
View notes.md

IDEA:

  • breaks on demand given the number of exercises
  • break man who says we should have one
val wholeJsonRDD = sc.wholeTextFiles("input.json").map(_._2)
val mySchema = new StructType().add($"n".int)
wholeJsonRDD.toDF.withColumn("json", from_json($"value", mySchema)).show(truncate = false)

val jsonDF=spark.read.json("output.json")
View spark-exercises.md

Exercise 1

Union only those rows (from large table) with keys in left small table, i.e. union two dataframes together but only those with the key in my small table.

Exercise 2

Aggregation on an array of nested json = How to sum the quantities across all lines for a given order (which would give 1 + 3 = 4 for the below sample dataset):

{