agarman/messagehub.md

## messagehub.md

      
    Raw
  

              messagehub.md
            
          
    MessageHub is Kafka
Kafka/MessageHub is a distributed log. It scales writes via partitioning data.
You can write a custom partitioner, though the DefaultPartitioner is often
sufficient for most applications.
The DefaultPartitioner uses the message key to determine which partition to write a message.
This allows control of where messages are written (important for ConsumerGroups…).
This also means that Kafka is susceptible to hot partitions, where more messages are
being routed to a partition than can be supported by the disk IO of the Kafka nodes hosting
that partition.
Writes are also replicated (recommended 3 replicas). If there’s not a quorum of ISR
(in-sync replicas) writes will be refused for that partition (Kafka maintains consistency
within data partitions).
There are two styles of consumers in Kafka:
1.) a stateless or low-level API where the consumer tracks it’s read position on the partition(s)
being consuming
2.) a high-level, consumer group API where Kafka tracks it’s read position but allows at most
1 consumer per partition.
The consumer group API looks like a queue, but it has many limitations not present in a queue:
1.) Messaging is only ordered within a partition…not across all topics.
2.) The dequeue is exclusive…a single consumer at a time. Which means if you need a shared
queue, you have to implement your own dequeue and routing.
It’s best to think of the ConsumerGroup API as a non-shared message iterator with checkpointing
support for fail-over/fault tolerance.
Also, a distinct advantage of Kafka ConsumerGroup versus a traditional queue is that you can rewind
to a previous offset or time. The data written to Kafka is immutable…
Though it’s immutable, there is retention settings and compaction. Retention removes old data,
compaction removes all but the newest values for a given message key.
Compaction: [foo|123],[foo|456],…[foo|789] becomes [foo|789]
That’s Kafka. Compose.io supports Scylla, RabbitMQ & Redis. Bluemix.net also has Cloudant & DashDB.
Depending upon use case any of these may be a better choice than Kafka. IMO, Bluemix.net needs
something equivalent to Google PubSub or Azure Queue Storage… and should avoid looking at AWS SQS
for inspiration (as that system dictates that all message consumers be idempotent).