Sparkathon - Developing Spark Structured Streaming Apps in Scala
- Multiple
groupBy
orgroupByKey
aggregations in a batch structured query
- Developing custom
Sink
(using StreamSinkProvider) - Answering the question from StackOverflow — How to count items per time window?
- Understanding OutputMode
- flatMapGroupsWithState operator
- Using
flatMapGroupsWithState
operator to mimic the output modes:Complete
,Append
andUpdate
- Multiple
flatMapGroupsWithState
in a streaming query - Multiple
groupBy
orgroupByKey
aggregations in a streaming structured query - Streaming aggregation with
Append
output mode requires watermark
- Using
flatMapGroupsWithState
operator - Developing custom
StreamSinkProvider
(with particular focus onOutputMode
)
- Creating custom Encoder
- Extend Dataset API to support
GROUPING SETS
(similarly tocube
androllup
)- it's supported currently only in SQL mode
- Creating custom Encoder
- SPARK-17668 Support representing structs with case classes and tuples in spark sql udf inputs
- Create an encoder between your custom domain object of type
T
and JSON or CSV - See Encoders for available encoders.
- Read Encoders - Internal Row Converters
- (advanced/integration) Create an encoder for Apache Arrow (esp. after the arrow-0.1.0 RC0 release candidate has recently been announced) and ARROW-288 Implement Arrow adapter for Spark Datasets.
- Custom format, i.e.
spark.read.format(...)
orspark.write.format(...)
- Multiline JSON reader / writer
SQLQueryTestSuite
- this is a very fresh thing in Spark 2.0 to write tests for Spark SQL
- http://stackoverflow.com/questions/39073602/i-am-running-gbt-in-spark-ml-for-ctr-prediction-i-am-getting-exception-because
- ExecutionListenerManager
- (done) Developing a custom RuleExecutor and enabling it in Spark
- Answering Extending Spark Catalyst optimizer with own rules on StackOverflow
- Sparkathon - Developing Spark Extensions in Scala on Sep 28th
- Developing a custom StreamSourceProvider
- Migrating TextSocketStream to SparkSession (currently uses SQLContext)
- Developing Sink and Source for Apache Kafka
- JDBC support (with PostgreSQL as the database)
- Creating custom Transformer
- Example: Tokenizer
- Jonatan + Kuba + lejdis (Justyna + Magda)
- Problem to zapis Pipeline z tym Transformera, odczyt i użycie.
- Spark MLlib 2.0 Activator
- Monitoring executors (metrics, e.g. memory usage) using SparkListener.onExecutorMetricsUpdate.
- Develop a new Scala-only TCP-based Apache Kafka client
- A Guide To The Kafka Protocol
- KAFKA-3360 Add a protocol page/section to the official Kafka documentation
- See Scala Kafka Client for inspiration yet it's just "a thin Scala wrapper over the official Apache Kafka Java Driver"
- Working on Issues reported in TensorFrames.
- Review open issues in Spark's JIRA and pick one to work on.