Skip to content

Instantly share code, notes, and snippets.

@joeesteves
Last active July 10, 2024 19:55
Show Gist options
  • Save joeesteves/b8cb50fd241effda6b8db71b73f9239c to your computer and use it in GitHub Desktop.
Save joeesteves/b8cb50fd241effda6b8db71b73f9239c to your computer and use it in GitHub Desktop.

Malga - Make Audit log Great Again

tl;dr;

In our effort to optimize KafkaEx configuration under the Make Kafka Great initiative, we've identified excessive network resource usage by API consumers. During deployment windows, network instability in message consumption is notable due to frequent client disconnections, reconnects to Kafka triggering groups rebalancing. This disrupts stability and extends the time to achieve a stable Kafka state.


Source: DD Network dashboard

The initiative explores applying Domain Oriented Service Architecture (DOSA) to peripheral consumers/services. Extracting these consumers into standalone services promises enhanced stability and reliability by preventing Kafka disconnections during deployments. This separation also allows for independent scaling and monitoring, boosting system performance and efficiency.

Introduction

What if we could apply DOSA (Domain Oriented Service Architecture) to peripheral consumers/services?

By extracting these consumers as standalone services, we could achieve greater stability and reliability since they wouldn’t disconnect from Kafka during each deployment. Separating the consumer services would also provide us with more granular control, enabling independent scaling and monitoring. This approach could significantly enhance our system's performance and efficiency.

Problem

Many of our Kafka Consumers are tightly coupled to our deployment cycle, causing them to disconnect with each deployment. This leads to unnecessary downtime spent on reconnection and rebalancing. While some consumers are not critical to our core business, any bugs or issues directly impact them due to their wide-reaching effects (high blast radius). Additionally, making changes to non-core services within our development cycle is slower and riskier compared to more loosely coupled services (e.g., Jamstack deployments take ~2 minutes vs API/RS ~20 minutes).

image Source: DataDog - NetworkMonitor after deploying API to staging

Solution

By extracting consumers into separate services with clear responsibilities and reduced scope, we align with our DOSA architecture vision. This strategic move will effectively reduce the blast radius in case of failures and significantly accelerate our time to production. Independent consumer workers will provide greater control over escalation and observability. Additionally, this initiative presents an opportunity to establish a robust audit log while defining a reusable pattern for extracting consumers from APIs.

Architecture

graph TD;
    direction TB
    subgraph MALGA Diagram
        direction TB
        SR[Express App] -->|Serves REST endpoints| FE[React App]
        SR <-.->|Queries Messages from PostgreSQL| DB[(PostgreSQL)]
        subgraph MASR[MALGA SERVER]
            FE
            SR
            WS(Web Socket)
        end
        subgraph CO[Scalable Consumers]
            W1[Worker-1-KafkaJS]
            W2[Worker-2-KafkaJS]
            W3[Worker-3-KafkaJS]

        end
        CO -->|Pulls messages from topic audit.messages.v1| KA[(KAFKA)]
        CO -->|Process & Stores messages in PostgreSQL| DB
        CO -->|Brodcast changes| WS
        WS -->|Updates RealTime| FE
        subgraph PR[Producers]
            RS
            API
            Others
        end
        PR -->|Produces messages to topic audit.messages.v1| KA
    end

    style SR fill:#bc8f8f,stroke:#333,stroke-width:2px
    style FE fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W1 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W2 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style W3 fill:#bc8f8f,stroke:#333,stroke-width:2px
    style DB fill:#6e7f99,stroke:#333,stroke-width:2px
    style KA fill:#6e7f99,stroke:#333,stroke-width:2px
    style CO fill:#d2b48c,stroke:#333,stroke-width:2px;
    style PR fill:#9ec481,stroke:#333,stroke-width:2px;
    style MASR fill:#54b297,stroke:#333,stroke-width:2px;
    style WS fill:#bc8f8f,stroke:#333,stroke-width:2px
Loading

Goals

  1. Enhance consumer stability.
  2. Optimize the use of network resources.
  3. Minimize disconnections during deployment, ensuring reconnections only occur when changes or deployments affect the specific consumer.

Additional Benefits

  1. As our integration with data science deepens, the significance of our audit log will markedly increase. Introducing an interactive reporting feature will not only boost efficiency but also elevate user satisfaction and overall happiness. πŸ˜ƒ
  2. Adoption of a modern stack (compatible with the latest Kafka) with a strongly typed language, enhancing peace of mind for our engineering team.
  3. Alignment with our DOSA vision for scalable and resilient architecture.
  4. Accelerated development cycles, reducing time-to-market for new features and updates.
graph TD;
    subgraph Benefits
        direction TB
        F1[1. Flexible Audit Log Report]
        F2[2. Modern Stack]
        F3[3. Aligned with DOSA ]
        F4[3. Reduce time-to-production ]

    end

    style F1 fill:#ffb3ba,stroke:#333,stroke-width:2px
    style F2 fill:#baffc9,stroke:#333,stroke-width:2px
    style F3 fill:#bae1ff,stroke:#333,stroke-width:2px
    style F4 fill:#baffc9,stroke:#333,stroke-width:2px
Loading

About Kafka Libraries

Features descriptions (based on Sasa Juric talk)

Incremental Rebalance

Incremental rebalance is a feature in Apache Kafka that aims to optimize the rebalancing process of consumer group members when changes occur, such as when new consumers join the group or existing ones leave. Rebalancing refers to the process by which Kafka redistributes partitions among consumer group members to ensure that each partition is consumed by only one consumer within the group.

Before incremental rebalance, when any change occurred in the consumer group (e.g., a new consumer joined or an existing one left), Kafka would trigger a full rebalance. In a full rebalance, Kafka would reassign all partitions to all consumers in the group, regardless of whether they had changed or not. This could result in unnecessary partition reassignments and increased processing overhead, especially in large consumer groups or when changes were minor.

With incremental rebalance, Kafka leverages a more efficient algorithm that computes the minimal set of changes required to rebalance the group based on the actual changes in the group membership. This means that if only one consumer joins or leaves the group, Kafka will only reassign partitions for that specific consumer, rather than for all consumers in the group. This reduces the amount of work required during rebalancing, resulting in faster rebalances and reduced impact on overall system performance.

Incremental rebalance helps improve the efficiency and scalability of Kafka consumer groups, especially in scenarios where consumer group membership changes frequently or when dealing with large numbers of partitions and consumers.

Static Membership

A feature introduced in Apache Kafka to enhance the resilience and efficiency of consumer groups. In traditional consumer group implementations, when a consumer instance leaves or joins a consumer group, it triggers a rebalance, during which Kafka redistributes partitions among the remaining consumers. This process incurs some overhead, especially in large deployments or when consumer instances frequently join or leave the group.

Static membership addresses this issue by allowing consumer instances to maintain persistent, long-lived connections with the broker. With static membership, consumers register with the broker and maintain their assignment of partitions across rebalances, even if they temporarily disconnect or restart. When a consumer rejoins a group after a disconnection, it can reestablish its previous assignment without triggering a full rebalance.

Key features of static membership include:

  • Persistent Group Membership: Consumers maintain their membership in a consumer group across restarts or temporary disconnections. This allows them to retain their partition assignments without triggering unnecessary rebalances.
  • Session Timeout: Consumers have a session timeout during which they must send heartbeats to the broker to maintain their membership. If a consumer fails to send heartbeats within the session timeout period, the broker may consider it as disconnected and trigger a rebalance.
  • Explicit Assignment Protocol: Consumers use an explicit assignment protocol to negotiate partition assignments with the broker. This allows them to specify their desired set of partitions, which helps minimize unnecessary partition movements during rebalances.

Static membership improves the stability and efficiency of consumer groups in Kafka, particularly in deployments with frequent consumer restarts or network disruptions. It reduces the overhead associated with rebalancing by allowing consumers to maintain their partition assignments across sessions, resulting in faster recovery times and reduced impact on overall system performance.

Drain

In the context of Apache Kafka, "drain" typically refers to the process of consuming or processing all the messages available in a Kafka topic or partition. This term is often used when discussing Kafka consumers or when managing Kafka streams or pipelines.

Here are a few scenarios where "drain" might be used:

  • Consumer Group Draining: In consumer applications, "draining" can refer to the process of ensuring that all messages in a Kafka topic are consumed before shutting down the consumer group. This might involve stopping the consumption loop once all messages have been processed or setting up a mechanism to monitor the end of the topic.
  • Stream Processing: When working with Kafka Streams or other stream processing frameworks, "draining" might refer to processing all records currently in a stream or a partition before stopping the processing job.
  • Rebalancing and Partition Reassignment: During consumer group rebalancing, partitions may be reassigned to different consumers. In some cases, it's important to ensure that consumers finish processing any messages they have already fetched from their assigned partitions before they are reassigned to another consumer. This process is sometimes referred to as "draining" the partitions.

In essence, "drain" in Kafka refers to the orderly consumption or processing of messages to ensure that no messages are left unprocessed before shutting down or transitioning consumers, processing jobs, or partitions. It's a crucial aspect of maintaining data integrity and consistency in Kafka-based systems.

Elixir/Erlang Node
Brod KafkaEx erlkak kafkajs node-rdkafka
Incremental Rebalance ❌ ❌ ❔ ❌ ❔
Static Membership ❌ ❌ ❔ βœ… ❔
Standalone Consumer ❌ βœ… ❔ ❌ ❌
BackPressure βœ… ❌ ❌ βœ… with max-inflight βœ… with streams
Drain ❌ ❌ βœ… βœ… ❔ If shutdown with SIGTERM (to be confirmed)

Demo

demo.mp4

Development set-up

  1. clone the repo
  2. pnpm install
  3. create a .env file with the following content
NODE_ENV=development

API_KEY=#RS API KEY
#REAL SERVER LOCAL URL
API_URL=http://localhost:3000/graphql
#LOCAL DATABASE URL
DATABASE_URL=postgresql://user:psw@0.0.0.0:5433/malga
#LOCAL KAFKA BROKERS
KAFKA_BROKERS=kafka:9092
KAFKA_CONSUMER_GROUP_ID=message.audit.v1.consumer.dev
PORT=4000
WS_URL=ws://localhost:4000
  1. Ensure your DB have a malga DATABASE
  2. To start pnpm devc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment