loyvon/Introduction to Stream Data Processing.md

## Introduction to Stream Data Processing.md

      
    Raw
  

              Introduction to Stream Data Processing.md
            
          
    Introduction

This gist started with a collection of resources I was maintaining on stream data processing — also known as distributed logs, data pipelines, event sourcing, CQRS, and other names.
Over time the set of resources grew quite large and I received some interest in a more guided, opinionated path  for learning about stream data processing. So I added the reading list.
Please send me feedback!

  
## reading list.md

      
    Raw
  

              reading list.md
            
          
    Reading List

The resources doc has a lot of good stuff, but no guidance. This reading list is meant to be a more guided, opinionated path for learning about stream data processing.
Some works are accompanied by alternative options and/or responses, but those are completely
optional. If possible, try the main works first.
The Foundational Monograph

The Log: What every software engineer should know about real-time data's unifying abstraction by Jay Kreps (December 2013) kicked it all off for me. A seminal work.
Alternatives (optional)


Also published in book form by O’Reilly as I ❤️ Logs: Event Data, Stream Processing, and Data Integration (September 2014)
The Log: an epic software engineering article by Bryan Pendleton (January 2014) is a summary and analysis of Kreps’ monograph

Responses (optional)


The three eras of business data processing by Alex Dean (January 2014) builds on Kreps’ ideas to extrapolate the future of business data processing architecture, as compared to common architectures in the present and the past
Great reads of 2013: Jay Kreps on logs by Rafe Colburn (December 2013) is a quick reaction, very-high-level summary, and third-party validation to/of Krep’s ideas

The (Conference Talks made into Articles made into a) Book that Fills in all the Gaps

Making Sense of Stream Processing by Martin Kleppmann (March 2016) is a free ebook that compiles many of Kleppmann’s brilliant articles (based on his brilliant talks) on this topic.
This is a fantastic book that covers everything from theory to practice, history to the future. It’s
all broken down into small incremental ideas and clearly explained.
Alternatives (optional)

If you’d prefer to start with videos of Kleppmann’s talks, I recommend starting with these:

Turning the database inside out with Apache Samza describes how we might reimagine what a database is and reshape the entire Web application stack with event streams at every level. I heard this in person and it blew my mind.
Staying agile in the face of data deluge illustrates that “using the right tool for the right job” can lead to incredibly complex and fragile application architectures, and how streaming data can simplify.
Systems that enable data agility
Samza and the Unix philosophy of distributed systems
Data liberation and data integration with Kafka

For more, see Kleppmann’s playlist of all his talks on YouTube.
The Treatise (on Why and How Stream Data Processing Might be the Future of Application Development)

Introducing Kafka Streams: Stream Processing Made Simple by Jay Kreps (March 2016) Explains the why of the new Kafka Streams framework, and in doing so dives deep into what is all this stuff, really, and why does it matter, and what does it mean for application development — brilliant.
A Broader, Cogent, and Less Kafka-Centric Perspective

The world beyond batch: Streaming 101 by Tyler Akidau (August 2015) is a super-helpful alternative perspective that didn’t come out of LinkedIn but rather Google. Akidau has worked for years on data processing systems at Google, including MillWheel, and Cloud Dataflow, and Apache Beam. I haven’t yet read part 102 but suspect it will be similarly illuminating.

  
## resources.md

      
    Raw
  

              resources.md
            
          
    Resources on Stream Data Processing

The Monograph


The Log: What every software engineer should know about real-time data's unifying abstraction by Jay Kreps (December 2013)

The Book Version of the Monograph


Kreps expanded his monograph into I ❤️ Logs: Event Data, Stream Processing, and Data Integration which was published by O’Reilly in September 2014

Articles


Great reads of 2013: Jay Kreps on logs by Rafe Colburn (December 2013) is a quick reaction, very-high-level summary, and third-party validation to/of Krep’s ideas
The Log: an epic software engineering article by Bryan Pendleton (January 2014) is a summary and analysis of Krep’s monograph
The three eras of business data processing by Alex Dean (January 2014) builds on Kreps’ ideas to extrapolate the future of business data processing architecture, as compared to common architectures in the present and the past
Loving a Log-Oriented Architecture by Andrew Montalenti (December 2014) is a good summary/overview of this developing architectural style by a member of a team that has adopted it in earnest.
Stream Processing, Event Sourcing, Reactive, Cep… And Making Sense Of It All by Martin Kleppman is a great overview of the core concepts all these seemingly different ideas have in common, and how/when we might want to employ them.

Books


Making Sense of Stream Processing by Martin Kleppmann

March 2016
Free ebook that compiles many of Kleppmann’s brilliant articles (based on his brilliant talks) on this topic


Unified Log Processing by Alex Dean

To be published by Manning in Spring 2015
Currently available as an “early access edition”
The first chapter is available as a free PDF
Blog post about the book by the author


Designing Data-Intensive Applications by Martin Kleppmann

To be published by O’Reilly in 2015
Currently available as an “Early Release”
Blog post about the book by the author


Talks


Martin Kleppmann

Turning the database inside out with Apache Samza describes how we might reimagine what a database is, and the shape of the entire Web application stack with event streams at every level. And there’s also a written version.
Staying agile in the face of data deluge describes how using the right tool for the right job can lead to incredibly complex and fragile application architectures, and how streaming data might help simplify everything.
Scalable real-time data processing with Apache Samza from Feb 2015 at JFokus


Alex Dean

Why your company needs a unified log from October 2014 at Span takes a step back to look at different approaches to how data flows between systems and services across an entire organization, and how using a unified log makes many things much simpler.


Blogs


The Confluent Blog