FeepingCreature/event_queue_hang.md

## event_queue_hang.md

      
    Raw
  

              event_queue_hang.md
            
          
    Investigating an Event Queue Hang

We had a fun bug last week that almost doubles as a logic puzzle. It makes perfect sense in hindsight,
but as I was staring at it, it seemed every part of it was nailed down tight - until I spotted the
obvious, natural assumption that had been violated from the start.
A bit of background. We're working with microservices propagating state with event streams. For a typical
stream, most events will concern unrelated objects, so we parallelize them: each incoming event is assigned a key,
and we accept parallel events so long as their keys don't collide with any events currently in progress.
Furthermore, because we track our position in the input stream as a single number, we can't retire events as soon
as they are finished, but have to wait for all prior events to finish processing first. As a result, our
event state looks like this:
struct EventInProgress
{
    ulong eventNumber;
    K key;
    bool done;
}
EventInProgress[] processingEvents;

Last week, one of our sysops contacted me to tell me that one of our microservices was hanging on a test system. This
sort of thing does happen, so we attached gdb to find out what was going on. It seemed that at a certain point, the
service had just signaled to the event database to stop sending it more events and, apparently, calmly and without any
errors decided to stop doing its work entirely. With growing confusion, we chased ghosts for a few days - I thought I knew
what the issue was and missed the critical backtrace due to an innocent-looking top function. (Always read every thread's
backtrace!) Finally, we spotted that the service seemed to be waiting for the event handling threadpool to become idle.
"Fine," I thought, "let's add some logging to that and see what's stuffing it."
Nothing was stuffing it. The event handling threadpool was in fact completely idle.
Baffled, I logged the list of processing events as well. It reported that several events were still in progress.
But the threadpool was completely idle.
Now, this code is a bit indirect. It doesn't actually look at the threadpool at all. It just counts how many events
in the process list are not marked as done, and blocks until this number falls below the threadpool limit. But
every event that finished should mark itself as done.
Now it did not surprise me at all that this code was failing - threading code is iffy at the best of times, and this one
was already uncomfortably involved, juggling several precise pieces of logic. I'd previously considered it as "finished":
not in the sense that it didn't need changes, but that any further modifications would make it totally impossible to be
confident in its function. But I still had to find the issue. And the code, though complex, did look correct.
The Riddle

To cut short a lot of debugging, and make the riddle more fun, I'll first explain to you how the logic works, and
then list some possibilities that were not the culprit.
When we get an event, we wait for threadpool space to become free, all events with conflicting keys to finish, then add it to
the events-in-progress and kick off a worker task. The worker task processes the event (dropping its lock in the meantime),
then finds it in the list again by its number (since the list will have changed by this point) and marks it as done.
The worker task then performs cleanup: it goes through the list in event number order, and flushes all events that are
done to the stream tracker, dropping them from the array, until it finds one that still needs work.
That's it.
Finally, here are things that are not the culprit:

The threading and synchronization worked fine. There are no threading errors: the code did exactly what it said.
The event tasks were not failing - we kill the application if this happens. At any rate, we didn't see a failure log.
The tasks weren't getting lost somewhere - the threadpool really was idle. It had finished all its work.
It was not a compiler, library, framework or hardware bug. All code worked exactly as described.

This stumped me for some time. Can you spot the source of the error? Take your time.
It really is findable from this description. (Tested with a LLM!)
The Solution

Honestly, I didn't find it by thinking really hard but just by spamming debug prints until something jumped out at me.
The events that were marked as not done had been queued multiple times. (!)
That was the failed assumption; I assumed that the event database would only send monotonously increasing events numbers.
In fact, in some network conditions, apparently it could reset the event stream to a prior offset and resend a few events.
Then, it depends on precise timing:

Event 1 is accepted and goes in progress.
Event 2 is accepted, goes in progress and is finished. It is not flushed to the tracker because Event 1 is not done.
Event 2 is accepted again, goes in progress (because the previous instance of event 2 is done, so there's no key collision)
and finishes.
The second Event 2 searches in the list to find its entry, but finds the entry of the first Event 2 task, still not flushed,
and marks it as done a second time!
The second Event 2 entry is never removed. The event list gains a permanent resident.
Eventually, the number of zombie events exceeds the size of the threadpool and the process halts.

So our code in fact worked exactly right! It had simply relied on a background assumption that seemed so obvious,
so common-sense, I hadn't even thought to validate it.
We updated the database code to reject events with lower number than the last, and the error disappeared.