Skip to content

Instantly share code, notes, and snippets.

@rupeshtiwari
Last active April 25, 2024 14:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rupeshtiwari/4f5433fa00dc71e071fccb6c140a74b3 to your computer and use it in GitHub Desktop.
Save rupeshtiwari/4f5433fa00dc71e071fccb6c140a74b3 to your computer and use it in GitHub Desktop.
Data Analytics Late Arrival Handling

Techniques to handle late arrival

Watermarks and Allowed Lateness are both vital techniques in managing late data in stream processing systems, but they serve slightly different purposes and are often used in conjunction to maximize data integrity and processing efficiency. Here’s an in-depth look at when and why you might choose to use each technique, or both together, along with real-world industry examples.

Watermarks

Purpose: Watermarks are primarily used to handle out-of-order data. They provide a way to estimate the "completeness" of data up to a certain point in time, based on event timestamps.

When to Use: Use watermarks when:

  • You expect data to arrive out of order.
  • You need a mechanism to know when to close a window and process its data.
  • You are processing data that relies heavily on event times for windows (such as logs or user interactions captured in real-time).

Example: In financial trading platforms, transactions and market data might come from different systems with slight delays. A watermark can delay processing until the system is reasonably sure that all data up to a certain timestamp has been received, ensuring that financial calculations like moving averages or stock tick analyses are accurate.

Allowed Lateness

Purpose: Allowed Lateness is used to include data that arrives after the window has technically closed but within an allowed time frame, ensuring that this data is still processed.

When to Use: Use allowed lateness when:

  • There are network or processing delays that cause data to arrive after its window has elapsed.
  • It’s critical not to lose any data that arrives late, as it still contributes valuable information to the aggregate results.

Example: In media streaming analytics, where viewer data (such as start and stop times) might be delayed due to device connectivity issues, allowed lateness ensures that late-arriving data still contributes to the viewership metrics for a specific show or event, even if it arrives after the initial real-time analysis window.

Combining Watermarks and Allowed Lateness

Best Practice: In many real-world applications, combining both watermarks and allowed lateness ensures robust data processing. This combination is especially useful in environments where data integrity and completeness are critical, and there is significant variability in when data might arrive.

How They Work Together:

  1. Watermarks signal when a window can be considered complete based on the typical order and timing of data arrivals.
  2. Allowed Lateness then handles any exceptions by allowing additional data to be processed if it arrives after the watermark has passed but within the specified lateness period.

Real-World Example: In ride-sharing apps like Uber or Lyft, real-time data processing is used for dynamic pricing and supply positioning. Here:

  • Watermarks manage when to calculate fares based on data that is expected to be complete up to a certain point.
  • Allowed Lateness permits late updates from drivers or riders (like trip cancellations or delays) to be factored into the analytics, ensuring that decisions are made on the most complete data set available.

Industry Best Practices

  1. Logging and Monitoring: Always implement comprehensive logging and monitoring of when watermarks are set and when late data arrives. This helps in tuning the parameters based on actual data flow patterns and system performance.

  2. Dynamic Adjustment: Some advanced systems dynamically adjust watermarks and the parameters for allowed lateness based on observed data characteristics and system load, optimizing both performance and data completeness.

  3. Fallback Strategies: Implement fallback strategies for cases when even allowed lateness cannot accommodate the late-arriving data, such as including it in a separate storage for later analysis or re-aggregation.

Real world example

Let's clarify the sequence of events in the 5:00-5:15 PM window for our ride-sharing scenario with precise timestamps for each event, followed by how late arrivals were handled. This will provide a clear timeline of actions and responses, ensuring all data is considered properly.

Detailed Timeline of Events:

Initial Events (5:00-5:15 PM Window):

  • 5:01 PM: 10 ride requests are made in Zone A.
  • 5:03 PM: 8 of these ride requests are accepted by drivers.
  • 5:10 PM: 2 additional ride requests are made.
  • 5:13 PM: 1 cancellation is reported for one of the initial 10 ride requests made at 5:01 PM.

Late Reports (After 5:15 PM):

Events Occurring in the 5:00-5:15 PM Window but Reported Late:

  • 5:18 PM: 2 acceptances are reported for the 2 additional ride requests made at 5:10 PM.
  • 5:22 PM: 1 cancellation is reported for a ride that was requested at 5:02 PM. This ride had been accepted earlier but the rider cancelled.

Explanation of Late Arrival Handling:

Each late report has an impact on the calculations and operations in the ride-sharing system. Here's how each late event is processed:

5:18 PM Late Acceptances:

  • Original Event Time: The rides were requested at 5:10 PM.
  • Reporting Time: The acceptances were reported at 5:18 PM.
  • Handling: Although these acceptances are reported after the typical end of the 5:00-5:15 PM window, they are included in this window's data because they fall within the allowed lateness period (which extends to 5:30 PM due to the 15-minute allowance).

5:22 PM Late Cancellation:

  • Original Event Time: The ride was requested at 5:02 PM.
  • Reporting Time: The cancellation was reported at 5:22 PM.
  • Handling: This cancellation affects an initially accepted ride. Although the report comes in after the window's end and after the watermark has passed, it is still processed within the 5:00-5:15 PM window due to the allowed lateness setting.

Revised Processing and Outcome:

  • Total Requests by 5:15 PM: 12 (10 original + 2 additional).
  • Successful Rides Before Late Reports: 7 (8 initial acceptances - 1 cancellation at 5:13 PM).
  • Effect of Late Reports:
    • Add Late Acceptances: +2 from 5:18 PM reports.
    • Subtract Late Cancellation: -1 from 5:22 PM report.
  • Final Count of Successful Rides for 5:00-5:15 PM Window: 8 (7 before late reports + 2 late acceptances - 1 late cancellation).

Conclusion:

The handling of late reports ensures that the ride-sharing platform maintains an accurate and up-to-date understanding of actual demand and supply conditions. Adjustments made due to late acceptances and cancellations directly influence fare adjustments and driver positioning strategies, ensuring that the dynamic pricing model reflects the true state of the market. This meticulous tracking and processing help optimize operations and customer satisfaction in real-time.

Real-World Implications of Processing Late Reports:

  1. Fare Adjustment:

    • After accounting for all late reports (both the acceptances and the cancellation), the final tally for the 5:00-5:15 PM window shows an adjusted number of successful rides. This accurate count ensures that dynamic pricing algorithms adjust fares based on true demand, not inflated or outdated figures.
  2. Driver Repositioning:

    • Understanding the actual ride completion rates and cancellations helps the system to direct drivers toward zones with unmet demand more accurately. For instance, if Zone A continues to show high demand even after some cancellations, more drivers can be directed there to capitalize on the higher fare rates.
  3. Driver Wait Times and Compensation:

    • In a highly realistic scenario, drivers waiting due to delayed ride acceptances or dealing with cancellations (like the one reported at 5:22 PM) need to be compensated. Systems might adjust to offer incentives or compensations for waits that lead to cancellations, ensuring driver satisfaction and retention.

Fare Adjustment Mechanism Explained

Fare adjustments in ride-sharing platforms are often based on a dynamic pricing model that reacts to real-time changes in demand and supply within specific areas or zones. The goal is to use price as a lever to balance supply and demand, thereby maximizing efficiency and earnings potential for drivers, while also managing wait times for riders.

Scenario Recap for 5:00-5:15 PM Window:

  • Initial Data:

    • 10 ride requests made.
    • 8 rides initially accepted.
    • 1 cancellation reported at 5:13 PM.
    • 2 additional requests made at 5:10 PM.
  • Late Reports:

    • 2 acceptances at 5:18 PM for the 5:10 PM requests.
    • 1 cancellation at 5:22 PM for a 5:02 PM request.

How Watermarks and Allowed Lateness Affect Fare Adjustment:

  1. Inclusion of Late Acceptances:

    • Before Late Report Handling: Prior to processing the late reports, the total number of accepted rides might have appeared as 8 (excluding the two late acceptances).
    • After Including Late Acceptances: With the inclusion of the two late acceptances reported at 5:18 PM, the number of successful rides increases to 10. This indicates a higher demand than initially calculated if these late reports were ignored.
  2. Impact of Late Cancellation:

    • Before Late Report Handling: The system might consider 10 rides successful if the late cancellation at 5:22 PM were excluded.
    • After Including Late Cancellation: With the late cancellation considered, the actual number of successful rides adjusts back down to 9. This adjustment prevents the system from overstating demand, which is crucial because overestimating demand could lead to unfairly high prices.

Real-World Impact on Fare Adjustment:

  • Accurate Demand Representation: By incorporating both the late acceptances and the late cancellation, the system maintains an accurate representation of demand. Accurately knowing that 9 rides were completed (not 8 or 10) allows the dynamic pricing algorithm to adjust fares based on true demand levels. Overestimation could lead to higher than necessary prices, potentially reducing rider demand, while underestimation could lead to missed opportunities for drivers and dissatisfaction due to low availability.

  • Dynamic Pricing Adjustment: If demand is accurately captured as being high (with the final tally of 9 successful rides in a busy period), the system may increase fares slightly to manage demand or attract more drivers to the area. Conversely, if demand were overestimated, the system might incorrectly raise prices, leading to a potential drop in ride requests.

  • Fairness and Efficiency: By ensuring that fare adjustments are based on accurate, timely data, the ride-sharing platform maintains fairness for both riders and drivers. Riders pay prices that genuinely reflect current conditions, and drivers are directed to areas with genuine demand, optimizing their time and earning potential.

Conclusion:

The processing of late reports through mechanisms like watermarks and allowed lateness is not just about data accuracy for its own sake. In dynamic and competitive environments like ride-sharing, these data processing strategies directly impact economic decisions and operational efficiency, playing a critical role in how resources are allocated and priced in real-time.


Apache Beam offers a comprehensive set of strategies to handle late arriving data in stream processing, which is crucial for applications requiring real-time or near-real-time data accuracy and timeliness. Here's a detailed explanation of each strategy and a guide on when to use which:

  1. Windowing: Segmenting data into finite chunks based on timestamps. Useful when data needs to be grouped into intervals (like hourly summaries).

    • Fixed Windows: Use when data can be divided into consistent, non-overlapping intervals.
    • Sliding Windows: Ideal when you need overlapping windows to provide a moving average.
    • Session Windows: Best for data that is sporadic or bursty, grouping interactions by user sessions based on activity bursts.
  2. Watermarks: A heuristic to estimate when all data for a certain period has been received, based on event time. Use watermarks to manage when a window can be considered complete and ready for processing.

  3. Triggers: Define specific conditions under which windowed data should be processed or output. Triggers are critical for handling cases where the watermark may not perfectly predict data completeness.

    • Event Time Triggers: Fire based on progress of time in the event data.
    • Processing Time Triggers: Fire based on the time the data is being processed.
    • Data-driven Triggers: Fire based on the arrival or accumulation of data.
  4. Allowed Lateness: Specifies how late data is allowed to be relative to the watermark before it is discarded. This is useful for dealing with data that arrives well beyond its expected time, ensuring the system's output remains correct and robust.

  5. State and Timers: Manage complex stateful processing, like tracking windows that might need updates based on late data. Timers can schedule future operations, useful for managing cleanup operations or late updates.


Explanation of Watermarks in Apache Beam:

Objective: To understand how watermarks function in processing windows, especially with late data.

Scenario Update:

Suppose a transaction is expected to occur at 9:50 AM but is reported at 10:10 AM. The processing window it belongs to is from 9:00 AM to 10:00 AM.

Watermarks Explained:

  • Watermarks are a mechanism used to manage the timing of when windows are processed and closed. They effectively allow a pipeline to handle events that are delayed or out of order by marking the point in time up to which the data is expected to be complete.

  • Setting Watermarks: In our example, if the watermark for the 9:00 AM to 10:00 AM window is set for 10:10 AM, it anticipates that events up until this time may still belong to the 9:00-10:00 AM window if they were delayed. Thus, a transaction that occurred at 9:50 AM but was reported at 10:10 AM would still be processed as part of the 9:00-10:00 AM window because it falls within the watermark's allowance.

Practical Application:

  • Event at 9:50 AM, Reported at 10:10 AM: The event's timestamp (9:50 AM) dictates which window it belongs to (9:00-10:00 AM). Even though the report comes in at 10:10 AM, if the watermark for the 9:00-10:00 AM window is set to wait until 10:10 AM or slightly later, this event is included in the processing for the 9:00-10:00 AM window. The watermark delays processing or closing of the window to allow for this late data.

  • Window Processing: The window is not finalized and processed until the watermark time is reached, ensuring all data up to that point can be included. In this case, even though the window technically spans 9:00-10:00 AM, processing will wait until the watermark condition (10:10 AM) is met.

Conclusion:

Watermarks allow a system to account for and include data that is reported after the window's end but belongs to that window due to its event timestamp. This ensures that all relevant data is considered in the analysis or aggregation tasks, enhancing the accuracy and reliability of the data processing system.

In summary, the watermark's role is crucial for handling data integrity issues in real-time streaming applications, ensuring that late arrivals due to network delays or other issues are still processed in the correct windows. This correct handling is essential for accurate real-time reporting and analysis, such as in financial transaction monitoring scenarios like the one outlined.


In addition to watermarks, Allowed Lateness is another key technique used in Apache Beam and similar streaming data processing frameworks to handle late data arrival. Here, I'll explain how Allowed Lateness works with detailed examples and timestamps.

Technique: Allowed Lateness

Objective: Manage and include events that arrive after the window has technically closed but within a permissible time frame.

Scenario:

John is still monitoring his daily transactions on Google Pay with a focus on ensuring that all transactions are recorded accurately, even if they are reported late.

Daily Transactions (with expected and actual reporting times):

  1. 8:00 AM: Breakfast purchase - $10 (Reported at 8:00 AM)
  2. 12:00 PM: Lunch purchase - $20 (Actual time 12:00 PM, reported at 12:15 PM due to a mobile network delay)
  3. 3:00 PM: Snack purchase - $5 (Actual time 3:00 PM, reported at 3:10 PM)

Daily Limit: $100

How Allowed Lateness Works:

  • Definition: Allowed Lateness specifies the amount of time after the end of a window during which late data is still considered for processing. This technique is particularly useful when dealing with streaming data that can be subject to delays, ensuring no data is lost simply due to its late arrival.

Implementation:

  • Windows: Assume 3-hour fixed windows for simplicity (e.g., 9 AM - 12 PM, 12 PM - 3 PM, etc.).
  • Allowed Lateness: Set to 15 minutes past the window end.

Detailed Example:

  • 12:00 PM Lunch Purchase:
    • Window: The 12:00 PM purchase falls into the 9 AM - 12 PM window.
    • Report Time: Reported at 12:15 PM.
    • Handling with Allowed Lateness: Even though the reporting time (12:15 PM) is after the window's end (12:00 PM), it is still within the 15-minute allowed lateness period. Therefore, this transaction is included in the 9 AM - 12 PM window for processing.

Practical Application:

  • 3:00 PM Snack Purchase:
    • Window: The 3:00 PM purchase falls into the 12 PM - 3 PM window.
    • Report Time: Reported at 3:10 PM.
    • Handling with Allowed Lateness: Similar to the lunch purchase, the snack purchase is reported 10 minutes after the window's end. Since this is within the 15-minute allowed lateness period, this transaction is also included in the 12 PM - 3 PM window.

Conclusion:

Allowed Lateness ensures that transactions like John’s lunch and snack purchases are accounted for in the correct processing windows despite being reported late. This technique is crucial for real-time financial applications where timely and accurate data aggregation is necessary for reporting and alerting purposes. It complements watermarks by providing an additional safeguard against data loss due to reporting delays, enhancing the robustness and reliability of the data processing system.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment