pk11/kafka-ws.md Secret

## kafka-ws.md

      
    Raw
  

              kafka-ws.md
            
          
    WebSockets - The good parts


if you want a non-blocking persistent TCP connection, this is your only choice
WebSocket protocol matured over time. Excellent tooling support:
chrome dev
versatile client and server libraries on all major platforms
good browser support

WebSockets - Gotchas


Tomcat/JSR 356 specific issues:

[PathParam annotation captures incorrect value under high load] (https://bz.apache.org/bugzilla/show_bug.cgi?id=57969)
[RemoteEndpoint.Async#sendText(String, SendHandler) not thread safe] (https://bz.apache.org/bugzilla/show_bug.cgi?id=56026)
Exceptions are [swallowed] (https://apache.googlesource.com/tomcat70/+/trunk/java/org/apache/tomcat/util/ExceptionUtils.java) internally and CloseReason#getReasonPhrase is often null. As a result, we had no idea why the socket was closed
Clients were often not receiving onClose events when the connection went bad. We had to implement our connection health checker mechanism based on pong frames


Differentiating between text and byte frames can be confusing
Sending heartbeats is necessary in most cases which brings some additional complexity (handling events such as writing to a closed connection, merging the heartbeat stream with “real” one etc.)
AWS does not support WebSockets, the only solution right now is to use a TCP backdoor

Kafka - The good parts


Consuming from a topic as a queue as well as an individual subscriber allowed us to do real-time pub-sub while consuming the same stream for analytics at the same time
Being able to consume from multiple AWS instances (as part of the same consumer group) without worrying about coordination or data loss is a huge plus
Scaling: adding a new server ([and moving partitions] (http://kafka.apache.org/documentation.html#basic_ops_cluster_expansion)) in AWS environment is straightforward and worked well in production
Throughput/Latency: we could get ~90-100ms latency consistently per event on our test cluster (6 brokers deployed on 6 d2.xlarge instances) for real-time streaming (in reality, this number is often lower due to batching). We tested this with 100 req/sec. In production, we have not experienced any issues yet.
For throughput testing we were using a topic with 400 partitions and replication factor of 1.

Kafka - Gotchas


Client APIs are a bit messy and in flux:
new publisher vs old publisher?
Where to store offset? Kafka vs ZK?
SimpleConsumer vs HighLevelConsumer (where SimpleConsumer is actually more complicated than  HighLevelConsumer)
HighLevelConsumers can take a few seconds to connect (likely due to ZK)
Basic patterns are not encapsulated in the API, only loosely documented on the wiki and elsewhere
Our solution was to create our consumer library with an Rx interface and switch over to new publisher
Rebalancing, that can happen when a new consumer added to a consumer group or if a broker is added/removed, caused issues in our SimpleConsumer setup. We had to make sure we handle a broker crash correctly in our client library (retries+backoff), but when a rebalancing happened, it still slowed down real-time streaming for a few clients. We are looking into various ways to recover faster. Our HighLevelConsumers could successfully recover from a broker leader change or consumer rebalancing
Don’t confuse typical pub/sub topics with Kafka topics: topics in Kafka are expensive to create at runtime. In a pub/pub situation you’d always want to use partitions
Pub/sub use case seems to be less documented, not many war stories out there
Cleaning up subscribers (SimpleConsumers) is important, you will want to monitor your fetch load to spot leaks
Older producer API seemed to have a bug where it was throwing a queue full exception if it had a temporary communication error with the broker. Solution was to move to a newer Producer API which was recommended and also tune the config for Producer used by harness. eg. batch size, queue size etc.
Had to fine tune configuration to make sure correct buffering strategy is set and correct replication factor is chosen. Lots of trial and error.
Consider using a  fallback strategy in a SimpleConsumer when calculating last seen offset. This is necessary in case a broker does not have data for the specified offset or finding the lead broker fails. Our strategy is to do retries with backoff and fall back to earliest offset in the given partition
Editing topics at runtime (i.e. changing number or partitions etc.) could cause rebalancing (adding new topics should be fine), HighLevelConsumers can survive this with appropriate settings but SimpleConsumers need to manage this situation manually