Skip to content

Instantly share code, notes, and snippets.

@jaceklaskowski
Last active October 10, 2017 20:28
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save jaceklaskowski/aea9596c22f25f7e43d3786352863032 to your computer and use it in GitHub Desktop.
Save jaceklaskowski/aea9596c22f25f7e43d3786352863032 to your computer and use it in GitHub Desktop.
Sparkathon in Warsaw - Development Activities

Spark-a-thon — Development Activities

Spark Structured Streaming

  1. Developing custom Source to handle formats like XML
  2. Explore what's stored in checkpointLocation
  3. How is streaming groupBy different with Append and Complete output modes?
  4. Answering the question on StackOverflow — How to count items per time window?
  5. Answering the question on StackOverflow — How to save streaming aggregation in Complete output mode to parquet?
  6. Understanding OutputMode
  7. Using flatMapGroupsWithState operator to mimic the output modes: Complete, Append and Update
  8. Multiple flatMapGroupsWithState in a streaming query
  9. Multiple groupBy or groupByKey aggregations in a streaming structured query
  10. Streaming aggregation with Append output mode requires watermark

Spark SQL

  1. Multiple groupBy or groupByKey aggregations in a batch structured query

Misc

  1. https://github.com/typelevel/frameless

Tuesday, September 12, 2017

Sparkathon - Developing Spark Structured Streaming Apps in Scala

  1. Developing custom Sink (using StreamSinkProvider)

Aug 22nd

  1. Using flatMapGroupsWithState operator
  2. Developing custom StreamSinkProvider (with particular focus on OutputMode)

Apr 26th

  1. Creating custom Encoder
  2. Extend Dataset API to support GROUPING SETS (similarly to cube and rollup)
    • it's supported currently only in SQL mode

Spark SQL

  1. Creating custom Encoder
  2. Custom format, i.e. spark.read.format(...) or spark.write.format(...)
  3. Multiline JSON reader / writer
  4. SQLQueryTestSuite - this is a very fresh thing in Spark 2.0 to write tests for Spark SQL
  1. http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
  2. ExecutionListenerManager
  3. (done) Developing a custom RuleExecutor and enabling it in Spark

Structured Streaming

  1. Developing a custom StreamSourceProvider
  2. Migrating TextSocketStream to SparkSession (currently uses SQLContext)
  3. Developing Sink and Source for Apache Kafka
  4. JDBC support (with PostgreSQL as the database)

Spark MLlib

  1. Creating custom Transformer
  • Example: Tokenizer
  • Jonatan + Kuba + lejdis (Justyna + Magda)
  • Problem to zapis Pipeline z tym Transformera, odczyt i użycie.
  1. Spark MLlib 2.0 Activator

Core

  1. Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.

Misc / Next Meetup / Backlog

  1. Develop a new Scala-only TCP-based Apache Kafka client
  2. Working on Issues reported in TensorFrames.
  3. Review open issues in Spark's JIRA and pick one to work on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment