joeesteves/malga.md

## malga.md

      
    Raw
  

              malga.md
            
          
    Malga - Make Audit log Great Again

tl;dr;

In our effort to optimize KafkaEx configuration under the Make Kafka Great
initiative, we've identified excessive network resource usage by API consumers.
During deployment windows, network instability in message consumption is notable
due to frequent client disconnections, reconnects to Kafka triggering groups
rebalancing. This disrupts stability and extends the time to achieve a stable
Kafka state.


Source: DD Network dashboard 
The initiative explores applying Domain
Oriented Service Architecture (DOSA) to peripheral consumers/services.
Extracting these consumers into standalone services promises enhanced stability
and reliability by preventing Kafka disconnections during deployments. This
separation also allows for independent scaling and monitoring, boosting system
performance and efficiency.

 
   Introduction

What if we could apply DOSA (Domain Oriented Service Architecture) to peripheral consumers/services?

   By extracting these consumers as standalone services, we could achieve greater stability and reliability since they wouldn’t disconnect from Kafka during each deployment. Separating the consumer services would also provide us with more granular control, enabling independent scaling and monitoring. This approach could significantly enhance our system's performance and efficiency.
   
 
     Problem

Many of our Kafka Consumers are tightly coupled to our deployment cycle, causing
them to disconnect with each deployment. This leads to unnecessary downtime
spent on reconnection and rebalancing. While some consumers are not critical to
our core business, any bugs or issues directly impact them due to their
wide-reaching effects (high blast radius). Additionally, making changes to
non-core services within our development cycle is slower and riskier compared to
more loosely coupled services (e.g., Jamstack deployments take ~2 minutes vs
API/RS ~20 minutes).

Source: DataDog - NetworkMonitor after deploying API to staging

  
     Solution

By extracting consumers into separate services with clear responsibilities and
reduced scope, we align with our DOSA architecture vision. This strategic move
will effectively reduce the blast radius in case of failures and significantly
accelerate our time to production. Independent consumer workers will provide
greater control over escalation and observability. Additionally, this initiative
presents an opportunity to establish a robust audit log while defining a
reusable pattern for extracting consumers from APIs.

  
     Architecture


      graph TD;
    direction TB
    subgraph MALGA Diagram
        direction TB
        SR[Express App] -->|Serves REST endpoints| FE[React App]
        SR <-.->|Queries Messages from PostgreSQL| DB[(PostgreSQL)]
        subgraph MASR[MALGA SERVER]
            FE
            SR
            WS(Web Socket)
        end
        subgraph CO[Scalable Consumers]
            W1[Worker-1-KafkaJS]
            W2[Worker-2-KafkaJS]
            W3[Worker-3-KafkaJS]

        end
        CO -->|Pulls messages from topic audit.messages.v1| KA[(KAFKA)]
        CO -->|Process & Stores messages in PostgreSQL| DB
        CO -->|Brodcast changes| WS
        WS -->|Updates RealTime| FE
        subgraph PR[Producers]
            RS
            API
            Others
        end
        PR -->|Produces messages to topic audit.messages.v1| KA
    end

    style SR fill:#bc8f8f,stroke:#333,stroke-width:2px
    style FE fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W1 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W2 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W3 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style DB fill:#6e7f99,stroke:#333,stroke-width:2px
    style KA fill:#6e7f99,stroke:#333,stroke-width:2px
    style CO fill:#d2b48c,stroke:#333,stroke-width:2px;
    style PR fill:#9ec481,stroke:#333,stroke-width:2px;
    style MASR fill:#54b297,stroke:#333,stroke-width:2px;
    style WS fill:#bc8f8f,stroke:#333,stroke-width:2px

    
      Loading

  
     Goals

     
       Enhance consumer stability.
       Optimize the use of network resources.
       Minimize disconnections during deployment, ensuring reconnections only occur when changes or deployments affect the specific consumer.
     
  
     Additional Benefits

    
      As our integration with data science deepens, the significance of our audit log will markedly increase. Introducing an interactive reporting feature will not only boost efficiency but also elevate user satisfaction and overall happiness. 😃
      Adoption of a modern stack (compatible with the latest Kafka) with a strongly typed language, enhancing peace of mind for our engineering team.
      Alignment with our DOSA vision for scalable and resilient architecture.
      Accelerated development cycles, reducing time-to-market for new features and updates.
    

      graph TD;
    subgraph Benefits
        direction TB
        F1[1. Flexible Audit Log Report]
        F2[2. Modern Stack]
        F3[3. Aligned with DOSA ]
        F4[3. Reduce time-to-production ]

    end

    style F1 fill:#ffb3ba,stroke:#333,stroke-width:2px
    style F2 fill:#baffc9,stroke:#333,stroke-width:2px
    style F3 fill:#bae1ff,stroke:#333,stroke-width:2px
    style F4 fill:#baffc9,stroke:#333,stroke-width:2px

    
      Loading

  
     About Kafka Libraries

      
        Features descriptions (based on Sasa Juric talk)

Incremental Rebalance

Incremental rebalance is a feature in Apache Kafka that aims to optimize the rebalancing process of consumer group members when changes occur, such as when new consumers join the group or existing ones leave. Rebalancing refers to the process by which Kafka redistributes partitions among consumer group members to ensure that each partition is consumed by only one consumer within the group.
Before incremental rebalance, when any change occurred in the consumer group (e.g., a new consumer joined or an existing one left), Kafka would trigger a full rebalance. In a full rebalance, Kafka would reassign all partitions to all consumers in the group, regardless of whether they had changed or not. This could result in unnecessary partition reassignments and increased processing overhead, especially in large consumer groups or when changes were minor.
With incremental rebalance, Kafka leverages a more efficient algorithm that computes the minimal set of changes required to rebalance the group based on the actual changes in the group membership. This means that if only one consumer joins or leaves the group, Kafka will only reassign partitions for that specific consumer, rather than for all consumers in the group. This reduces the amount of work required during rebalancing, resulting in faster rebalances and reduced impact on overall system performance.
Incremental rebalance helps improve the efficiency and scalability of Kafka consumer groups, especially in scenarios where consumer group membership changes frequently or when dealing with large numbers of partitions and consumers.
Static Membership

A feature introduced in Apache Kafka to enhance the resilience and efficiency of consumer groups. In traditional consumer group implementations, when a consumer instance leaves or joins a consumer group, it triggers a rebalance, during which Kafka redistributes partitions among the remaining consumers. This process incurs some overhead, especially in large deployments or when consumer instances frequently join or leave the group.
Static membership addresses this issue by allowing consumer instances to maintain persistent, long-lived connections with the broker. With static membership, consumers register with the broker and maintain their assignment of partitions across rebalances, even if they temporarily disconnect or restart. When a consumer rejoins a group after a disconnection, it can reestablish its previous assignment without triggering a full rebalance.
Key features of static membership include:

  Persistent Group Membership: Consumers maintain their membership in a consumer group across restarts or temporary disconnections. This allows them to retain their partition assignments without triggering unnecessary rebalances.
  Session Timeout: Consumers have a session timeout during which they must send heartbeats to the broker to maintain their membership. If a consumer fails to send heartbeats within the session timeout period, the broker may consider it as disconnected and trigger a rebalance.
  Explicit Assignment Protocol: Consumers use an explicit assignment protocol to negotiate partition assignments with the broker. This allows them to specify their desired set of partitions, which helps minimize unnecessary partition movements during rebalances.

Static membership improves the stability and efficiency of consumer groups in Kafka, particularly in deployments with frequent consumer restarts or network disruptions. It reduces the overhead associated with rebalancing by allowing consumers to maintain their partition assignments across sessions, resulting in faster recovery times and reduced impact on overall system performance.
Drain

In the context of Apache Kafka, "drain" typically refers to the process of consuming or processing all the messages available in a Kafka topic or partition. This term is often used when discussing Kafka consumers or when managing Kafka streams or pipelines.
Here are a few scenarios where "drain" might be used:

  Consumer Group Draining: In consumer applications, "draining" can refer to the process of ensuring that all messages in a Kafka topic are consumed before shutting down the consumer group. This might involve stopping the consumption loop once all messages have been processed or setting up a mechanism to monitor the end of the topic.
  Stream Processing: When working with Kafka Streams or other stream processing frameworks, "draining" might refer to processing all records currently in a stream or a partition before stopping the processing job.
  Rebalancing and Partition Reassignment: During consumer group rebalancing, partitions may be reassigned to different consumers. In some cases, it's important to ensure that consumers finish processing any messages they have already fetched from their assigned partitions before they are reassigned to another consumer. This process is sometimes referred to as "draining" the partitions.

In essence, "drain" in Kafka refers to the orderly consumption or processing of messages to ensure that no messages are left unprocessed before shutting down or transitioning consumers, processing jobs, or partitions. It's a crucial aspect of maintaining data integrity and consistency in Kafka-based systems.
      
   
         Elixir/Erlang
         Node
       
       
         Brod
         KafkaEx
         erlkak
         kafkajs
         node-rdkafka
       
     
         Incremental Rebalance
         ❌
         ❌
         ❔
         ❌
         ❔
       
       
         Static Membership
         ❌
         ❌
         ❔
         ✅
         ❔
       
       
         Standalone Consumer
         ❌
         ✅
         ❔
         ❌
         ❌
       
       
         BackPressure
         ✅
         ❌
         ❌
         ✅ with max-inflight
         ✅ with streams
       
       
         Drain
         ❌
         ❌
         ✅
         ✅
         ❔ If shutdown with SIGTERM (to be confirmed)
       
     
Demo


    demo.mp4
    
  
   Development set-up

 
clone the repo
pnpm install
create a .env file with the following content

NODE_ENV=development

API_KEY=#RS API KEY
#REAL SERVER LOCAL URL
API_URL=http://localhost:3000/graphql
#LOCAL DATABASE URL
DATABASE_URL=postgresql://user:psw@0.0.0.0:5433/malga
#LOCAL KAFKA BROKERS
KAFKA_BROKERS=kafka:9092
KAFKA_CONSUMER_GROUP_ID=message.audit.v1.consumer.dev
PORT=4000
WS_URL=ws://localhost:4000


Ensure your DB have a malga DATABASE
To start pnpm devc
	Elixir/Erlang			Node
	Brod	KafkaEx	erlkak	kafkajs	node-rdkafka
Incremental Rebalance	❌	❌	❔	❌	❔
Static Membership	❌	❌	❔	✅	❔
Standalone Consumer	❌	✅	❔	❌	❌
BackPressure	✅	❌	❌	✅ with max-inflight	✅ with streams
Drain	❌	❌	✅	✅	❔ If shutdown with SIGTERM (to be confirmed)