iftheshoefritz/kafka concepts.org

## kafka concepts.org

      
    Raw
  

              kafka concepts.org
            
          
    ## Current ordering problems:

  the most complicated parts of the system are coupled to orders being placed
    
      e.g. order -> payment completed -> branch assignment -> driver assignment
      “everything initiated by an order” is hard to avoid in standard web apps where user input (requests) initiate all activity.
      This chain is too long, it’s hard to reason about
    
  
  ordering features fight for resources
    
      restaurant dashboard vs driver assignment vs order list vs driver list
        
          all have different purposes with different requirements in terms of:
            
              how often they need to be updated
              what data is queried
              how denormalised they are
            
          
          currently all constructed from the same data
            
              lots of competition
            
          
          all working very hard to transform normalised data for its own use case
        
      
  reads fight with writes
    
      e.g. drivers constantly updating their status affects whether driver can be chosen for an order
    
  
  ordering async features are only scalable by function, not by data
    
      can scale PaymentProcessor and DriverAssignmentJob easily
      not so easy to scale DriverAssignmentJob for Branch A separately from DriverAssignmentJob for Branch B
    
  
  understanding state changes that lead to a problem is hard
    
      requires current state in DB + extensive use of logs that might not be there
    
  
## Characteristics of event streaming architecture that can address the above:*

  events can drive small pieces of logic (easy to understand) that then create another event for a different piece of logic to pick up.
    
      creates natural decoupling
    
  
  normal/easy/deterministic to create different views of the same data
    
      stay up to date
      parallelisable
      each piece of the app becomes fast because it reads from a view that is optimised for it
    
  
  normal to have separate reads and writes
  scale based on different data
    
      e.g. branch A, B, C only have a trickle of data but D needs a dedicated consumer
    
  
  inspect how a piece of data reached its current state by looking at all the events that apply

Notes from various sources below:
## Why async jobs aren’t an answer

  https://3.basecamp.com/3091943/buckets/15558617/messages/3571977802
    
      things that Joe says you’d re-implement badly if you tried to do a kafka project with job queing:
        
          interwoven with explanations from https://ssudan16.medium.com/kafka-internals-47e594e3f006
          replaying
            
              replay events from the beginning for new apps
            
          
          archival
            
              https://www.confluent.co.uk/blog/okay-store-data-apache-kafka/
              by default kafka only stores events for a retention period
            
          
          partitioning
            
              topics are partitioned
              partition can be determined by producer, or it can be hash of the content
              partitions have leaders and replicas
              in-sync replicas is a list of replicas that are caught up with the leader
                
                  if the leader goes away, one of the ISRs is chosen to replace
                
              
              leaders can remove dodgy replica if it falls too far behind or doesn’t send heartbeats
            
          
          message identification
            
              messages uniquely identified by combination of topic, partition, and offset
              dunno why this is important and different from DJ
                
                  maybe messages retried get new IDs?
                
              
          time skew resolution
            
              different concepts of time: when the event was generated, when the event was ingested, when the event was processed
              not too sure of what this was
            
          
          consistency
            
              one consumer per partition means that messages for the same key will always be processed by the same consumer
              sequential processing for the same data means no race conditions (I think?)
              exactly/at-least/at-most once guarantee using metadata:
                
                  see https://ssudan16.medium.com/exactly-once-processing-in-kafka-explained-66ecc41a8548
                  producer ID
                  transactional ID
                  epoch number
                  sequence number
                  control message
                
              
          resiliency
            
              leader/replica are tracked and can be replaced if they become dodgy
              min insync replicas must be met before the message is commited from the write-ahead log
              when a broker doesn’t send a heartbeat to zookeeper, ZK notifies cluster controller to fetch all partitions for which BROKER was the leader (check that) and promote another broker to the new leader
            
          
          aggregation
            
              https://www.confluent.co.uk/blog/distributed-real-time-joins-and-aggregations-on-user-activity-events-using-kafka-streams/
              e.g. one set of events says “Alice clicked 13 times” and we can look up that Alice lives in Europe. We want to be able to answer “How many people who live in europe clicked on this? (and presumably go back and forward in time on this)
            
          
          compaction
            
              https://kafka.apache.org/documentation.html#compaction
              guarantee that at least the latest value for every key is saved
                
                  means ultimate storage has an upper bound based on “recent changes” + “number of keys”
                
              
          scaling
            
              topics split into partitions
              partitions can be split across multiple brokers
              each partition has one consumer within a consumer group
              multiple partitions can be consumed by a single consumer
                
                  as soon as more consumers are added the partitions are distributed between the consumers
                  max scaling within a group up to one consumer per partition
                
              
              more apps consuming events = more consumer groups
              long polling probably also helps scaling?
            
          
      also higher level tools for:
        
          aggregation
          batching
          consistency guarantees
          archival
        
      
## Martin Fowler event sourcing doc

  https://martinfowler.com/eaaDev/EventSourcing.html
    
      just adding events from processing point of view is unnecessary step
      does give you history of application
        
          this could be done just by logging
        
      
      gains with events
        
          complete rebuild: discard state and replay events from the beginning
          temporal query: start from the beginning and replay up to the time you want
          replay: reverse events back to a broken event, fix and replay
            
              this can also fix events out of order
            
          
      state storage
        
          getting state by replaying from the beginning is expensive
            
              can start from overnight snapshot and play events after that point in time to get to latest
              faster recovery from crash
            
          
      interacting with external systems
        
          when doing an event replay, we don’t want to (e.g.) send reports for those items that were sold again
            
              gateway between you and external payment system can decide if this update is because of event replay
              or notification to outside world can be scheduled using a different event (e.g. monthly) rather than being based on the individual items being sold
            
          
          getting data externally based on an old event might be a problem if the external system can’t give you data as of the date of the old event
            
              maybe gateway to external system can save responses so that they can be used again when replaying
            
          
      changes to existing code
        
          reprocess data using newly fixed code, but should old events have values based on code at the old time or should they be based on the new code?
        
      
      when to use AKA it’s difficult so there must be some kind of return
        
          complete audit log can be nice for support or debugging
            
              text logs can do some of this
              debugging by rewinding to the exact state at time X is different to debugging with logs
            
          
          can decouple things and parallelise
          parallel models
            
              you could take the current state and then simulate a christmas rush by playing lots of fake events
            
          
          retroactive events
            
              in a normal system, when some data is broken you have to figure out what the system did with that bad data and work out what it should have been based on the fixed data
              with event sourcing and retroactive events you can just throw an event on the log that fixes the data and get the fixed result “for free”
            
          
## New York Times case study
https://www.confluent.co.uk/blog/publishing-apache-kafka-new-york-times/


  NY Times publishing
  consumers:
    
      elasticsearch cluster that can be a bit behind but must be easily re-indexable when indexing requirements change
      list providers that need to be updated when a new piece of content matches their query or stops matching
      live content that must up to date asap with new content but only cares about latest content
      personalisation service that needs to reprocess content every time a personalisation algorithm changes
    
  
  API problems:
    
      every API has its own schema, same name different content, different name same content etc
      API structures are inconsistent
      consumers need knowledge of each different API’s quirks
      previously published content is hard to retrieve, difficult to reprocess everything
    
  
  DB as source of truth:
    
      problem
        
          hard to make big schema changes
          hard to replace without downtime
            
              snapshots immediately become outdated
              therefore hard to create derived store like search index
              clients of DBs end up writing to DB AND something else (e.g. search index)
                
                  consistency problem if one write fails and others succeed
                
              
  Log architectures
    
      problems with DB are result of DB being storage that is result of events like updates
      in log architecture the store IS the list of events
        
          use the log to create materialized custom data stores
          each application’s data store (probably traditional RDBMS) can store what it wants from the log in the structure that makes sense for that app
        
      
      distinction between getting all data and just getting updates goes away
        
          Start at the beginning or start at latest (it doesn’t matter) then keep consuming events after that
          therefore getting a “dump” is easy
        
      
  kafka uses a log architecture
    
      this is just an implementation detail when using Kafka as msg broker
        
          other services do just as well, e.g. SQS
        
      
      log architecture fundamental characteristic of using kafka when you need both of the following:
        
          ability to recreate a data store from scractch (with all data events ever)
          order of events matters
        
      
  NYT has a monolog of everything published since 1851
    
      recently published service only gets content from the last few hours
      “science section” service would build up its list from the beginning of time
      assets are normalised, e.g. an image can be in multiple articles, a tag can be on multiple articles
        
          image, tag, article are each separate events
        
      
      monolog is single partition topic
        
          want to know that assets referenced by article appear before article
        
      
  NYT also has a denormalised log
    
      when article is published, a single event containing the article and everything it references is published
        
          including assets that already existed on the normalised log
        
      
      denormalised log is built by a Denormaliser application that consumes the normalised log
        
          keeps local state of latest version of every article
          on create article or update of dependency asset, denormaliser publishes the article again to denormalised log
        
      
      denormalised log doesn’t need everything to be ordered completely, just different versions of the same article need to be kept in order
        
          so now they can put those articles into lots of different partitions, as long as the same article will always appear on the same partition
        
      
  advantages of Kafka at NYT:
    
      software is simpler because all content is consistent (i.e. events with particular structure from that one monolog)
      simpler deployments e.g. just replay all the content when an Elasticsearch analyzer changes instead of making in-place replacements
      can track updates to each individual piece of content easily
    
  
## “Turning the database inside out with Apache Samza”
https://www.youtube.com/watch?v=fU9hR3kiOK0


  typically web app = stateless client + stateless backend + shared global mutable DB
    
      but we try to get rid of shared global mutable state in code everywhere else
    
  
  alternatives can be derived by looking at 4 common use cases:
    
      Replication
        
          update cart set qty = 3 where customer_id = 123 and product_id = 999
          write query to leader
          then replicate:
            
              update row 8765 old = [123, 999, 1] new = [123, 999, 3]
            
          
          original update query was imperative
          replication update is an immutable fact: the customer 123 changed the row in their cart for product 999 from 1->3
        
      
      Secondary indexes
        
          CREATE INDEX ON cart(customer_id); CREATE INDEX ON cart(product_id)
          DB creates extra data structure which is a different representation of the same data
            
              usually some sort of key-value store
              automatically updates values
                
                  in a transactional way so index is always consistent
                
              
              no extra thought required from the user, the DB just knows how to do it
            
          
          giant indexes can be done concurrently
            
              CREATE INDEX CONCURRENTLY
              DB also just knows how to do this
            
          
      caching
        
          possible to do this in application layer
            
              r = cache.get(key); if !r r = db.get(key); cache.put(key, r);
                
                  invalidation: how do you know you have to update the cache?
                  race conditions
                  cold starts where DB gets overwhelmed because nothing is cached
                
              
          contrast this with ease of CREATE INDEX CONCURRENTLY
        
      
      Materialized view
        
          DB view that is persisted on disk
          differences with application level caching:
            
              more flexibility with application level caching
              much easier to create materialized view (CREATE MATERIALIZED VIEW)
              doesn’t take all the load off the DB
            
          
  All 4 above are derived data. Differ in how well they work:
    
      replication: mature
      secondary indexes: really good
      caching: complete mess
      materialized view: meh
    
  
  can we rethink materialized view so that it works better
    
      make replication stream a first class concept
      can write to it efficiently (always just append)
      can’t query randomly
      build materialized state using replication stream
        
          Apache Samza does this
            
              can build materialized view
                
                  in parallel (so it’s fast)
                  compaction (save disk space)
                  consistent with the base data set
                  immune to race conditions
                
              
          AKA Kappa architecture
        
      
  Benefits:
    
      Better data
        
          RDBMSes conflate concerns of reader and writer
          writes are easy because log is easy to write to
          reading can be from materialised views that are optimised for whatever reader wants to do
            
              as denormalised as you like
              denormalised data normally has issue of how/when to update based on dependencies
              having a pure function that takes stream and creates materialized view eliminates uncertainty
            
          
          events are useful for analytics
            
              not just “cart has X”, but also the user added X then removed X then added it again
            
          
          historical point-in-time queries
          recovery from human error
        
      
      Fully precomputed caches
        
          no cold start with lots of missing keys
          no invalidation
            
              logic for updating view based on stream is nicely encapsulated, doesn’t infest business logic everywhere in the system
            
          
          no race conditions
            
              stream processors operate sequentially
              still parallelisable
            
          
      Streams everywhere
        
          instead of request/response, do subscribe/notify
          subscribe/notify easier with streaming data