jaceklaskowski/sparkathon-agenda.md

## sparkathon-agenda.md

      
    Raw
  

              sparkathon-agenda.md
            
          
    Spark-a-thon — Development Activities

Spark Structured Streaming


 Developing custom Source to handle formats like XML

 Publish the streaming source / format to https://spark-packages.org/
 As a follow-up answer How to read streaming data in XML format from Kafka?


 Explore what's stored in checkpointLocation
 How is streaming groupBy different with Append and Complete output modes?
 Answering the question on StackOverflow — How to count items per time window?
 Answering the question on StackOverflow — How to save streaming aggregation in Complete output mode to parquet?
 Understanding OutputMode

flatMapGroupsWithState operator


 Using flatMapGroupsWithState operator to mimic the output modes: Complete, Append and Update
 Multiple flatMapGroupsWithState in a streaming query
 Multiple groupBy or groupByKey aggregations in a streaming structured query
 Streaming aggregation with Append output mode requires watermark

Spark SQL


 Multiple groupBy or groupByKey aggregations in a batch structured query

Misc


 https://github.com/typelevel/frameless

Tuesday, September 12, 2017

Sparkathon - Developing Spark Structured Streaming Apps in Scala

 Developing custom Sink (using StreamSinkProvider)

Aug 22nd


Using flatMapGroupsWithState operator
Developing custom StreamSinkProvider (with particular focus on OutputMode)

Apr 26th


Creating custom Encoder

for java.time.LocalDateTime


Extend Dataset API to support GROUPING SETS (similarly to cube and rollup)

it's supported currently only in SQL mode


Spark SQL


Creating custom Encoder

SPARK-17668 Support representing structs with case classes and tuples in spark sql udf inputs
Create an encoder between your custom domain object of type T and JSON or CSV
See Encoders for available encoders.
Read Encoders - Internal Row Converters
(advanced/integration) Create an encoder for Apache Arrow (esp. after the arrow-0.1.0 RC0 release candidate has recently been announced) and ARROW-288 Implement Arrow adapter for Spark Datasets.


Custom format, i.e. spark.read.format(...) or spark.write.format(...)
Multiline JSON reader / writer
SQLQueryTestSuite - this is a very fresh thing in Spark 2.0 to write tests for Spark SQL


Changelog


http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
ExecutionListenerManager
(done) Developing a custom RuleExecutor and enabling it in Spark

Answering Extending Spark Catalyst optimizer with own rules on StackOverflow
Sparkathon - Developing Spark Extensions in Scala on Sep 28th


Structured Streaming


Developing a custom StreamSourceProvider
Migrating TextSocketStream to SparkSession (currently uses SQLContext)
Developing Sink and Source for Apache Kafka
JDBC support (with PostgreSQL as the database)

Spark MLlib


Creating custom Transformer


Example: Tokenizer
Jonatan + Kuba + lejdis (Justyna + Magda)
Problem to zapis Pipeline z tym Transformera, odczyt i użycie.


Spark MLlib 2.0 Activator

Core


Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.

Misc / Next Meetup / Backlog


Develop a new Scala-only TCP-based Apache Kafka client

A Guide To The Kafka Protocol
KAFKA-3360 Add a protocol page/section to the official Kafka documentation
See Scala Kafka Client for inspiration yet it's just "a thin Scala wrapper over the official Apache Kafka Java Driver"


Working on Issues reported in TensorFrames.
Review open issues in Spark's JIRA and pick one to work on.