Skip to content

Instantly share code, notes, and snippets.

@pk11

pk11/kafka-ws.md Secret

Last active May 5, 2021 09:43
Show Gist options
  • Save pk11/36b7526fb4a04badd7a7 to your computer and use it in GitHub Desktop.
Save pk11/36b7526fb4a04badd7a7 to your computer and use it in GitHub Desktop.
kafka / websockets - lessons learned

WebSockets - The good parts

  • if you want a non-blocking persistent TCP connection, this is your only choice
  • WebSocket protocol matured over time. Excellent tooling support:
  • chrome dev
  • versatile client and server libraries on all major platforms
  • good browser support

WebSockets - Gotchas

Kafka - The good parts

  • Consuming from a topic as a queue as well as an individual subscriber allowed us to do real-time pub-sub while consuming the same stream for analytics at the same time
  • Being able to consume from multiple AWS instances (as part of the same consumer group) without worrying about coordination or data loss is a huge plus
  • Scaling: adding a new server ([and moving partitions] (http://kafka.apache.org/documentation.html#basic_ops_cluster_expansion)) in AWS environment is straightforward and worked well in production
  • Throughput/Latency: we could get ~90-100ms latency consistently per event on our test cluster (6 brokers deployed on 6 d2.xlarge instances) for real-time streaming (in reality, this number is often lower due to batching). We tested this with 100 req/sec. In production, we have not experienced any issues yet.
  • For throughput testing we were using a topic with 400 partitions and replication factor of 1.

Kafka - Gotchas

  • Client APIs are a bit messy and in flux:
  • new publisher vs old publisher?
  • Where to store offset? Kafka vs ZK?
  • SimpleConsumer vs HighLevelConsumer (where SimpleConsumer is actually more complicated than HighLevelConsumer)
  • HighLevelConsumers can take a few seconds to connect (likely due to ZK)
  • Basic patterns are not encapsulated in the API, only loosely documented on the wiki and elsewhere
  • Our solution was to create our consumer library with an Rx interface and switch over to new publisher
  • Rebalancing, that can happen when a new consumer added to a consumer group or if a broker is added/removed, caused issues in our SimpleConsumer setup. We had to make sure we handle a broker crash correctly in our client library (retries+backoff), but when a rebalancing happened, it still slowed down real-time streaming for a few clients. We are looking into various ways to recover faster. Our HighLevelConsumers could successfully recover from a broker leader change or consumer rebalancing
  • Don’t confuse typical pub/sub topics with Kafka topics: topics in Kafka are expensive to create at runtime. In a pub/pub situation you’d always want to use partitions
  • Pub/sub use case seems to be less documented, not many war stories out there
  • Cleaning up subscribers (SimpleConsumers) is important, you will want to monitor your fetch load to spot leaks
  • Older producer API seemed to have a bug where it was throwing a queue full exception if it had a temporary communication error with the broker. Solution was to move to a newer Producer API which was recommended and also tune the config for Producer used by harness. eg. batch size, queue size etc.
  • Had to fine tune configuration to make sure correct buffering strategy is set and correct replication factor is chosen. Lots of trial and error.
  • Consider using a fallback strategy in a SimpleConsumer when calculating last seen offset. This is necessary in case a broker does not have data for the specified offset or finding the lead broker fails. Our strategy is to do retries with backoff and fall back to earliest offset in the given partition
  • Editing topics at runtime (i.e. changing number or partitions etc.) could cause rebalancing (adding new topics should be fine), HighLevelConsumers can survive this with appropriate settings but SimpleConsumers need to manage this situation manually
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment