Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@cscotta
Last active December 21, 2015 13:08
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save cscotta/815943046e0f9684020b to your computer and use it in GitHub Desktop.
Save cscotta/815943046e0f9684020b to your computer and use it in GitHub Desktop.

A Look at Apache Samza

C. Scott Andreas | Aug 21, 2013

Apache Samza is a distributed stream processing framework developed at LinkedIn. Rather than a framework, it might be better thought of as a skeleton for building streaming applications. Samza combines orchestration (via YARN), messaging (with consumers and producers for Kafka, and state checkpointing (with a JNI wrapper for LevelDB and log written to Kafka).

A "Skeleton," Not a Framework

Samza snaps together via interfaces, and the supplied implementations are rather simple. It's expected that those building systems atop Samza will replace many if not all of them to suit the needs of their application. To best understand the API, take a glance at the Interface Hierarchy. Click through to a few of them – say, StreamJob, StreamTask, and StorageEngine. It's important to note that some of the specific guarantees that Samza offers are not easily expressed as interfaces (e.g., partitioned, ordered, replicated streams) and may extend into particular implementations. One must be cognizant of the impact alternate implementations might have on these guarantees. That said, there are few surprises in the supplied implementations of each, and providing custom implementations for persistence, messaging, and so on is straightforward. Samza's Concepts page provides a nice introduction.

Samza's design contrasts with Storm's; see Storm's class / interface hierarchy for example, which is extraordinarily rich. Unlike Storm, Samza owes much of its simplicity to the fact that it is not a framework in the conventional sense. It's a thoughtful series of interfaces that define an API upon which one can build well-factored distributed stream processing applications. Those who would prefer an explicit comparsion to Storm can find one here, but I find the class / interface hierarchy more instructive. If you're building a streaming analytics application and prefer simple designs to rich ones, Samza gives you a solid set of bones to dress up.

Architecture

A Samza application can take about any shape. A simple example might look something like:

Diagram

Most stream jobs will read from a distributed durable log such as Apache Kafka divided into several partitions. Individual Samza "TaskRunner" containers will consume from a stream/partition pair, apply a "process()" function to each message received, and periodically checkpoint the spot in the stream they've consumed from ("at-least-once" messaging by default). Samza's TaskRunner docs explain the consumption / processing lifecycle in detail.

Unlike Storm, Samza does not have a concept of declarative processing topologies constructed as a graph. Instead, each Samza job is defined independently as an input, processing stage, and output. You're welcome to chain as many of these phases together as you'd like, but there is no requirement to define them up front or operate them as a contiguous topology.

Writing a Samza Application

Samza's docs include a nice overview of what must be implemented to write a streaming application. Here's an overview.

In short, you've got a StreamTask with a process() method that's called with arriving messages. A MessageEnvelope wraps each message you receive, with the message itself typed according to your de/serializer. Samza provides basic support for windowing and periodic flushing of windowed state as well. As you generate output, pass it to a MessageCollector for dispatch to the appropriate destination.

For a complete example, see the hello-samza project at http://samza.incubator.apache.org/startup/hello-samza/0.7.0.

Wrapping Up

Samza is interesting to me as it is representative of the processing model I'd sketched out with a few friends for a streaming engine designed for distributed computation mediated by durable queues and independent aggregation stages. I am excited for Storm's success and celebrate its adoption. At the same time, I'm also glad to see a more "skeletal" framework emerge designed to take care of little more than orchestrating a streaming map/reduce and providing a minimal but thoughtful API to build against. For that reason, I'm thrilled to see Samza developed at LinkedIn and released as an incubating Apache project. Cheers to the team at LinkedIn, the ASF, and all of the folks working to build a community of users around the project.

If stream processing interests you and you're open to checking out something new, give it a skim. The docs are in good shape for a young project, and the architecture is shaping up nicely. A few of the committers hang out in #samza on Freenode, so quick answers to questions that come up should be easy to come by.

@stonegao
Copy link

Great Review!

But there's one thing about topics created by intermediate tasks that I am not sure how Linkedin guys overcome this.

http://samza.incubator.apache.org/learn/documentation/0.7.0/introduction/architecture.html

The input topic is partitioned using Kafka. Each Samza process reads messages from one or more of the input topic's partitions, and emits them back out to a different Kafka topic keyed by the message's member ID attribute.

In the example above, the task will created many topics keyed by "message's member ID attribute", if there's millions of intermediate keys, how does Samza handle the topic limitations of Kafka? (Ref http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic )

@criccomini
Copy link

Hey Stone,

The docs you're referring to are a bit confusing. The phrase, "..topic keyed by the message's member ID," doesn't mean that you'll have one topic, or one partition, per member ID. It means that all messages for a single member ID will go to one of the topic's partitions (usually using a hash mapping such as hash(member id) % numTopicPartitions). Specifically, the message key is completely decoupled from the partition count of a topic. Jay explains this in a bit more detail here:

http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201308.mbox/browser

Sorry for the poor wording. We'll update the docs.

Cheers,
Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment