Skip to content

Instantly share code, notes, and snippets.

@ioleo
Forked from jaceklaskowski/sparkathon-agenda.md
Created September 12, 2017 18:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ioleo/0a10b2d50e09effabd631ee0af38a21b to your computer and use it in GitHub Desktop.
Save ioleo/0a10b2d50e09effabd631ee0af38a21b to your computer and use it in GitHub Desktop.
Sparkathon in Warsaw - Development Activities

Spark-a-thon - Development Activities

Tuesday, September 12, 2017

Sparkathon - Developing Spark Structured Streaming Apps in Scala

Spark SQL

  1. Multiple groupBy or groupByKey aggregations in a batch structured query

Spark Structured Streaming

  1. Developing custom Sink (using StreamSinkProvider)
  2. Answering the question from StackOverflow — How to count items per time window?
  3. Understanding OutputMode
  4. Using flatMapGroupsWithState operator to mimic the output modes: Complete, Append and Update
  5. Multiple flatMapGroupsWithState in a streaming query
  6. Multiple groupBy or groupByKey aggregations in a streaming structured query
  7. Streaming aggregation with Append output mode requires watermark

Aug 22nd

  1. Using flatMapGroupsWithState operator
  2. Developing custom StreamSinkProvider (with particular focus on OutputMode)

Apr 26th

  1. Creating custom Encoder
  2. Extend Dataset API to support GROUPING SETS (similarly to cube and rollup)
    • it's supported currently only in SQL mode

Spark SQL

  1. Creating custom Encoder
  2. Custom format, i.e. spark.read.format(...) or spark.write.format(...)
  3. Multiline JSON reader / writer
  4. SQLQueryTestSuite - this is a very fresh thing in Spark 2.0 to write tests for Spark SQL
  1. http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
  2. ExecutionListenerManager
  3. (done) Developing a custom RuleExecutor and enabling it in Spark

Structured Streaming

  1. Developing a custom StreamSourceProvider
  2. Migrating TextSocketStream to SparkSession (currently uses SQLContext)
  3. Developing Sink and Source for Apache Kafka
  4. JDBC support (with PostgreSQL as the database)

Spark MLlib

  1. Creating custom Transformer
  • Example: Tokenizer
  • Jonatan + Kuba + lejdis (Justyna + Magda)
  • Problem to zapis Pipeline z tym Transformera, odczyt i użycie.
  1. Spark MLlib 2.0 Activator

Core

  1. Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.

Misc / Next Meetup / Backlog

  1. Develop a new Scala-only TCP-based Apache Kafka client
  2. Working on Issues reported in TensorFrames.
  3. Review open issues in Spark's JIRA and pick one to work on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment