Skip to content

Instantly share code, notes, and snippets.

@FeepingCreature
Last active June 24, 2024 12:49
Show Gist options
  • Save FeepingCreature/a8099d2bcf850a9c388ed045fa3b5c0e to your computer and use it in GitHub Desktop.
Save FeepingCreature/a8099d2bcf850a9c388ed045fa3b5c0e to your computer and use it in GitHub Desktop.
Investigating an Event Queue Hang: The Code Works Correctly

Investigating an Event Queue Hang

We had a fun bug last week that almost doubles as a logic puzzle. It makes perfect sense in hindsight, but as I was staring at it, it seemed every part of it was nailed down tight - until I spotted the obvious, natural assumption that had been violated from the start.

A bit of background. We're working with microservices propagating state with event streams. For a typical stream, most events will concern unrelated objects, so we parallelize them: each incoming event is assigned a key, and we accept parallel events so long as their keys don't collide with any events currently in progress. Furthermore, because we track our position in the input stream as a single number, we can't retire events as soon as they are finished, but have to wait for all prior events to finish processing first. As a result, our event state looks like this:

struct EventInProgress
{
    ulong eventNumber;
    K key;
    bool done;
}
EventInProgress[] processingEvents;

Last week, one of our sysops contacted me to tell me that one of our microservices was hanging on a test system. This sort of thing does happen, so we attached gdb to find out what was going on. It seemed that at a certain point, the service had just signaled to the event database to stop sending it more events and, apparently, calmly and without any errors decided to stop doing its work entirely. With growing confusion, we chased ghosts for a few days - I thought I knew what the issue was and missed the critical backtrace due to an innocent-looking top function. (Always read every thread's backtrace!) Finally, we spotted that the service seemed to be waiting for the event handling threadpool to become idle.

"Fine," I thought, "let's add some logging to that and see what's stuffing it."

Nothing was stuffing it. The event handling threadpool was in fact completely idle. Baffled, I logged the list of processing events as well. It reported that several events were still in progress. But the threadpool was completely idle.

Now, this code is a bit indirect. It doesn't actually look at the threadpool at all. It just counts how many events in the process list are not marked as done, and blocks until this number falls below the threadpool limit. But every event that finished should mark itself as done.

Now it did not surprise me at all that this code was failing - threading code is iffy at the best of times, and this one was already uncomfortably involved, juggling several precise pieces of logic. I'd previously considered it as "finished": not in the sense that it didn't need changes, but that any further modifications would make it totally impossible to be confident in its function. But I still had to find the issue. And the code, though complex, did look correct.

The Riddle

To cut short a lot of debugging, and make the riddle more fun, I'll first explain to you how the logic works, and then list some possibilities that were not the culprit.

When we get an event, we wait for threadpool space to become free, all events with conflicting keys to finish, then add it to the events-in-progress and kick off a worker task. The worker task processes the event (dropping its lock in the meantime), then finds it in the list again by its number (since the list will have changed by this point) and marks it as done. The worker task then performs cleanup: it goes through the list in event number order, and flushes all events that are done to the stream tracker, dropping them from the array, until it finds one that still needs work. That's it.

Finally, here are things that are not the culprit:

  • The threading and synchronization worked fine. There are no threading errors: the code did exactly what it said.
  • The event tasks were not failing - we kill the application if this happens. At any rate, we didn't see a failure log.
  • The tasks weren't getting lost somewhere - the threadpool really was idle. It had finished all its work.
  • It was not a compiler, library, framework or hardware bug. All code worked exactly as described.

This stumped me for some time. Can you spot the source of the error? Take your time. It really is findable from this description. (Tested with a LLM!)

The Solution

Honestly, I didn't find it by thinking really hard but just by spamming debug prints until something jumped out at me.

The events that were marked as not done had been queued multiple times. (!)

That was the failed assumption; I assumed that the event database would only send monotonously increasing events numbers. In fact, in some network conditions, apparently it could reset the event stream to a prior offset and resend a few events. Then, it depends on precise timing:

  • Event 1 is accepted and goes in progress.
  • Event 2 is accepted, goes in progress and is finished. It is not flushed to the tracker because Event 1 is not done.
  • Event 2 is accepted again, goes in progress (because the previous instance of event 2 is done, so there's no key collision) and finishes.
  • The second Event 2 searches in the list to find its entry, but finds the entry of the first Event 2 task, still not flushed, and marks it as done a second time!
  • The second Event 2 entry is never removed. The event list gains a permanent resident.
  • Eventually, the number of zombie events exceeds the size of the threadpool and the process halts.

So our code in fact worked exactly right! It had simply relied on a background assumption that seemed so obvious, so common-sense, I hadn't even thought to validate it.

We updated the database code to reject events with lower number than the last, and the error disappeared.

@thinkingerrol
Copy link

I'd previously considered it as "finished": not in the sense that it didn't need changes, but that any further modifications would make it totally impossible to be confident in its function. But I still had to find the issue.

:DDD

@FeepingCreature
Copy link
Author

FeepingCreature commented Jun 24, 2024

Addendum: This description is a bit simplified. In the actual code, events that share a key with preceding events are queued immediately, but do wait for the preceding events to flush before running. So why didn't Event 2-the-first and Event 2-the-second get a key collision, keeping the second Event 2 back until the first was removed? Well...

Here's the thing. The code that checks for key overlaps only checks for events earlier than itself. So it has to also find its position in the event list... and runs into the exact same error again. :-)

@lxdlam
Copy link

lxdlam commented Jun 24, 2024

The idea of "conflicting keys" for grouping tasks is very unreliable. The idempotency problem is just one of them. If you're using a sequence generator, like PostgreSQL's sequence, and you involve some kind of transaction operations, the sequence may drops many entries. You may cycle back to the beginning quickly, and if you rely on the monotonic property of your ID, you will face a significant crash. I highly recommend the book DDIA for an overview of the principles of designing such systems.

Thank you for sharing! The detailed, context-rich walkthrough is awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment