Skip to content

Instantly share code, notes, and snippets.

@mikekaminskycc
Last active August 29, 2015 14:20
Show Gist options
  • Save mikekaminskycc/72073c87d9de08e8b508 to your computer and use it in GitHub Desktop.
Save mikekaminskycc/72073c87d9de08e8b508 to your computer and use it in GitHub Desktop.
Stream Processing Systems

#Streaming Processing Systems

Nomenclature:

  • Data Streaming Platform: The entire ecosystem of streaming data that includes the messaging system and the data distribution process.
  • Stream Processing System: The set of applications that are tasked with transforming streaming data en route between where the data were generated and where they are eventually placed for long-term storage and ad-hoc analyses.

Confluent has a nice post on what a stream data platform is for and what it is comprised of:

A stream data platform has two primary uses:

  1. Data Integration: The stream data platform captures streams of events or data changes and feeds these to other data systems such as relational databases, key-value stores, Hadoop, or the data warehouse.
  2. Stream processing: It enables continuous, real-time processing and transformation of these streams and makes the results available system-wide.

In its first role, the stream data platform is a central hub for data streams. Applications that integrate don't need to be concerned with the details of the original data source, all streams look the same. It also acts as a buffer between these systems-the publisher of data doesn't need to be concerned with the various systems that will eventually consume and load the data. This means consumers of data can come and go and are fully decoupled from the source. [Emphasis added]

Key features of a stream data platform:

  • It must be reliable enough to handle critical updates such as replicating the changelog of a database to a replica store like a search index, delivering this data in order and without loss.
  • It must support throughput high enough to handle large volume log or event data streams.
  • It must be able to buffer or persist data for long periods of time to support integration with batch systems such as Hadoop that may only perform their loads and processing periodically.
  • It must provide data with latency low enough for real-time applications.
  • It must be possible to operate it as a central system that can scale to carry the full load of the organization and operate with hundreds of applications built by disparate teams all plugged into the same central nervous system.
  • It has to support close integration with stream processing systems.

On using a stream processing system:

Where these frameworks really shine is in areas where there will be lots of complex transformations. If there will be only a small number of processes doing transformations the cost of adopting a complex framework may not pay off, and the framework may come with operational and performance costs of their own. However if there will be a large number of transformations, making these easier to write should justify the additional operational burden.

#Comparisons Streaming Big Data: Storm, Spark and Samza

Terminology Comparison Feature Comparison

The Real-Time Big Data Landscape

Different types of processing

##Storm

Streaming Big Data: Storm, Spark and Samza

  • ...Because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. Source

Samza vs. Storm

  • Storm allows you to choose the level of guarantee with which you want your messages to be processed:
    • At-most-once
    • At-least-once
    • Exactly-once (via Trident)
  • it has an impressive number of adopters, a strong feature set, and seems to be under active development. It integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).
  • Storm's approach of caching and batching state changes works well if the amount of state in each bolt is fairly small - perhaps less than 100kB. That makes it suitable for keeping track of counters, minimum, maximum and average values of a metric, and the like. However, if you need to maintain a large amount of state, this approach essentially degrades to making a database call per processed tuple, with the associated performance cost.
  • Storm is written in Java and Clojure but has good support for non-JVM languages. It follows a model similar to MapReduce Streaming: the non-JVM task is launched in a separate process, data is sent to its stdin, and output is read from its stdout.

##Spark Streaming

Samza vs. Spark

  • Spark Streaming does not gurantee at-least-once or at-most-once messaging semantics because in some situations it may lose data when the driver program fails (see fault-tolerance).
  • Spark streaming essentially is a sequence of small batch processes. With a fast execution engine, it can reach the latency as low as one second (from their paper). If the processing is slower than receiving, the data will be queued as DStreams in memory and the queue will keep increasing. In order to run a healthy Spark streaming application, the system should be tuned until the speed of processing is as fast as receiving.
  • Spark Streaming is written in Java and Scala and provides Scala, Java, and Python APIs.
  • Spark has an active user and developer community, and recently releases 1.0.0 version. Since Spark contains Spark Streaming, Spark SQL, MLlib, GraphX and Bagel, it's tough to tell what portion of companies on the list are actually using Spark Streaming, and not just Spark.

##Samza

Samza vs. Spark

  • Samza also allows you to define a deterministic ordering of messages between partitions using a MessageChooser. It provides an at-least-once message delivery guarantee. And it does not require operations to be deterministic.
  • In terms of data lost, there is a difference between Spark Streaming and Samza. If the input stream is active streaming system, such as Flume, Kafka, Spark Streaming may lose data if the failure happens when the data is received but not yet replicated to other nodes (also see SPARK-1647). Samza will not lose data when the failure happens because it has the concept of checkpointing that stores the offset of the latest processed message and always commits the checkpoint after processing the data. There is not data lost situation like Spark Streaming has. If a container fails, it reads from the latest checkpoint. When a Samza job recovers from a failure, it's possible that it will process some data more than once. This happens because the job restarts at the last checkpoint, and any messages that had been processed between that checkpoint and the failure are processed again. The amount of reprocessed data can be minimized by setting a small checkpoint interval period.
  • Samza is written in Java and Scala and has a Java API.

Samza vs. Storm

  • Samza also offers guaranteed delivery - currently only at-least-once delivery, but support for exactly-once semantics is planned.
  • Moreover, because Samza never processes messages in a partition out-of-order, it is better suited for handling keyed data. For example, if you have a stream of database updates - where later updates may replace earlier updates - then reordering the messages may change the final result. Provided that all updates for the same key appear in the same stream partition, Samza is able to guarantee a consistent state.
  • Rather than using a remote database for durable storage, each Samza task includes an embedded key-value store, located on the same machine. Reads and writes to this store are very fast, even when the contents of the store are larger than the available memory. Changes to this key-value store are replicated to other machines in the cluster, so that if one machine dies, the state of the tasks it was running can be restored on another machine.

Intro to Flink

  • Basically all systems keep some form of upstream backup and acknowledge (sets of) records. Apache Storm, for example, tracks records with unique IDs. Put simply, operators send messages acknowledging when a record has been processed. Records are dropped from the backup when the have been fully acknowledged.
  • Flink follows a more coarse-grained approach and acknowledges sequences of records instead of individual records. Periodically, the data sources inject “checkpoint barriers” into the data stream (see figure below). The barriers flow through the data stream, “pushing” records before them and triggering operators to emit all records that depend only on records before the barrier.
  • This is also the idea behind the Kappa architecture, which advocates using a stream processor for all processing. Flink is exactly this, a stream processor that surfaces both a batch and a streaming API, which are both executed by the underlying stream processor.
  • S4 is a general-purpose,near real-time, distributed, decentralized, scalable, event-driven, modular platform that allows programmers to easily implement applications for processing continuous unbounded streams of data.

John Hsieh of Cloudera

  • Specifically, S4 assumes losing data is ok (weaker reliability) and that nodes cannot be added or removed from a running cluster (manageability story not completely clear).
  • With Amazon Kinesis Client Library (KCL) for Java | Python | Ruby | Node.js | .NET and Amazon Kinesis Connector Library, you can easily build Amazon Kinesis Applications and integrate with other services and tools such as Amazon S3, Amazon Redshift, Amazon DynamoDB, and Elasticsearch.

  • Integrates with Storm

  • Enterprise, hadoop-centric streaming platform.

Datatorrent vs. Storm and Spark

*Most enterprises don’t have or want developers that are coding at the platform level. Imagine having your developers struggle with tuple level acking, configuration & distributed state management! Enterprises have strict business requirements for SLAs (no data loss, performance/latency and availability) and they want their developers to focus on solving their core business problem

  • s-Server appears to be the engine for actually performing the processing.
  • SQL-based language for processing data streams.
  • Can integrate with storm docs.
  • This release of Naiad is an alpha release. The amount of testing the code has been subjected to is limited, mostly to programs we have written.
  • The Naiad project is an investigation of data-parallel dataflow computation, like Dryad and DryadLINQ, but with a focus on low-latency streaming and cyclic computations. Naiad introduces a new computational model, timely dataflow, which combines low-latency asynchronous message flow with lightweight coordination when required. These primitives allow the efficient implementation of many dataflow patterns, from bulk and streaming computation to iterative graph processing and machine learning.
  • SEEP is an experimental parallel data processing system that is being developed by the Large-Scale Distributed Systems (LSDS) research group at Imperial College London. It is licensed under EPL (Eclipse Public License).
  • The SEEP system is under heavy development and should be considered an alpha release. This is not considered a "stable" branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment