Apache Samza is a distributed stream processing framework developed at LinkedIn. Rather than a framework, it might be better thought of as a skeleton for building streaming applications. Samza combines orchestration (via YARN), messaging (with consumers and producers for Kafka, and state checkpointing (with a JNI wrapper for LevelDB and log written to Kafka).
Samza snaps together via interfaces, and the supplied implementations are rather simple. It's expected that those building systems atop Samza will replace many if not all of them to suit the needs of their application. To best understand the API, take a glance at the Interface Hierarchy. Click through to a few of them – say, StreamJob, StreamTask, and StorageEngine. It's important to note that some of the specific guarantees that Samza offers are not easily expressed as interfaces (e.g., partitioned, ordered, replicated streams) and may extend into particular implementations. One must be cognizant of the impact alternate implementations might have on these guarantees. That said, there are few surprises in the supplied implementations of each, and providing custom implementations for persistence, messaging, and so on is straightforward. Samza's Concepts page provides a nice introduction.
Samza's design contrasts with Storm's; see Storm's class / interface hierarchy for example, which is extraordinarily rich. Unlike Storm, Samza owes much of its simplicity to the fact that it is not a framework in the conventional sense. It's a thoughtful series of interfaces that define an API upon which one can build well-factored distributed stream processing applications. Those who would prefer an explicit comparsion to Storm can find one here, but I find the class / interface hierarchy more instructive. If you're building a streaming analytics application and prefer simple designs to rich ones, Samza gives you a solid set of bones to dress up.
A Samza application can take about any shape. A simple example might look something like:
Most stream jobs will read from a distributed durable log such as Apache Kafka divided into several partitions. Individual Samza "TaskRunner" containers will consume from a stream/partition pair, apply a "process()" function to each message received, and periodically checkpoint the spot in the stream they've consumed from ("at-least-once" messaging by default). Samza's TaskRunner docs explain the consumption / processing lifecycle in detail.
Unlike Storm, Samza does not have a concept of declarative processing topologies constructed as a graph. Instead, each Samza job is defined independently as an input, processing stage, and output. You're welcome to chain as many of these phases together as you'd like, but there is no requirement to define them up front or operate them as a contiguous topology.
Samza's docs include a nice overview of what must be implemented to write a streaming application. Here's an overview.
In short, you've got a StreamTask
with a process()
method that's called with arriving messages. A MessageEnvelope wraps each message you receive, with the message itself typed according to your de/serializer. Samza provides basic support for windowing and periodic flushing of windowed state as well. As you generate output, pass it to a MessageCollector
for dispatch to the appropriate destination.
For a complete example, see the hello-samza project at http://samza.incubator.apache.org/startup/hello-samza/0.7.0.
Samza is interesting to me as it is representative of the processing model I'd sketched out with a few friends for a streaming engine designed for distributed computation mediated by durable queues and independent aggregation stages. I am excited for Storm's success and celebrate its adoption. At the same time, I'm also glad to see a more "skeletal" framework emerge designed to take care of little more than orchestrating a streaming map/reduce and providing a minimal but thoughtful API to build against. For that reason, I'm thrilled to see Samza developed at LinkedIn and released as an incubating Apache project. Cheers to the team at LinkedIn, the ASF, and all of the folks working to build a community of users around the project.
If stream processing interests you and you're open to checking out something new, give it a skim. The docs are in good shape for a young project, and the architecture is shaping up nicely. A few of the committers hang out in #samza on Freenode, so quick answers to questions that come up should be easy to come by.
Great Review!
But there's one thing about topics created by intermediate tasks that I am not sure how Linkedin guys overcome this.
http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/architecture.html
In the example above, the task will created many topics keyed by "message's member ID attribute", if there's millions of intermediate keys, how does Samza handle the topic limitations of Kafka? (Ref http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic )