FeepingCreature/gist:ffa10421b9910b84ba21943ef653be14

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Me

Hi! Programming challenge. I'm gonna paste the convo from IRC, names anonymized. Same challenge goes to you. :)


[16:31] <foo> do you wanna hear some nerd shit about my event queue impl and why it started hanging forever once we started using google rpc

[16:31] <baz> Shore

[16:31] * foo spent three days on this bug

[16:31] <bar> Distributed stuff?

[16:31] <foo> okay so we have an eventstoredb client at work that used to use the tcp api but we're switching to the (cool, new) eventstore db

[16:31] <foo> er

[16:31] <foo> grpc api*

[16:32] <foo> bar: single client, multiple threads

[16:32] <foo> and ... not every time, but maybe 50% of the time, one specific service only would hang

[16:32] <foo> just stop processing events entirely

[16:33] <foo> now the grpc code is new and custom, so I spent the first and second day absolutely spamming it with debug messages

[16:33] <foo> manually updating libcurl in the docker container, stuff like that

[16:33] <foo> but it seemed to be working correctly. I thought it was an issue with the http/2 pause logic, but it just paused because it'd saturated the processing threads.

[16:33] <foo> thread.*

[16:34] <foo> and I proved this. the receiver was in fact stopping.

[16:34] <foo> so I grabbed a gdb trace from that thread and noticed something I'd completely overlooked earlier

[16:34] <foo> okay I need to explain a bit here about how eventstoredb works

[16:34] <foo> hang on do you know how kafka works

[16:36] <bar> I mean if it's more complex than echo "stuff hapen" >> /var/log.txt, no.

[16:36] <foo> close enough honestly

[16:37] <baz> Ohh, not Franx Kafka

[16:37] <foo> it's just a giant dump of json messages with a bunch of metadata that gets indexed on

[16:37] <foo> anyway what we do on our side is we want to parallelize processing cause sometimes the service has to do other things that block, like disk io, db io or publishing more events

[16:38] <foo> and mostly events will be unrelated. so we have a helper that assigns every event a key (or list of keys) and then sticks the event handler on a threadpool iff no other events with the same key are currently processing

[16:38] <foo> that's the dispatcher.

[16:39] <foo> the dispatcher does something like... if there's enough space on the threadpool to take the event (if the number of processing events that are not finished yet is less than the threadpool size) then start the task.

[16:39] <foo> this is the line that was blocking.

[16:39] <foo> at the same time, the threadpool was completely idle.

[16:40] <bar> I sorta feel this increasing unease the more sentences you need to describe how something multithreaded works.

[16:40] <foo> XD

[16:40] <foo> but like, okay?? fine?? so I picked up all my debug spam in the networking and moved it over to the dispatcher class

[16:40] <foo> bar: it's even worse actually because it also has to update the tracker, but the tracker isn't involved in today's bug so I won't go into it

[16:41] <foo> let me put it like this, I wrote this class, I have high confidence in it, but it also didn't surprise me that it was the cause of the bug (well, not really but we'll get to it)

[16:41] <foo> so the dispatcher has a dynamic list of tasks that are currently processing

[16:41] <foo> this is a Waitable, which is a concept that ... I genuinely don't know why no other language/framework has this

[16:42] <foo> it's just a way to wrap a data type with the ability to block until a condition about it is true

[16:42] <foo> so it's basically a condition variable glued to a value

[16:42] <foo> you can do Waitable!int test; test.waitFor!(a => a > 5); test.set(3); test.set(6); // waitFor unblocks

[16:43] <foo> and we're getting an event, adding a task entry to the waitable task list, add the task to the threadpool

[16:43] <foo> and the threadpool processes the event and then does... well, the list might have changed, so it grabs a lock and says "list
[list.find(event.eventId)].done = true;"

[16:44] <foo> and at this point you can already spot the bug, I've given you enough info, if you are smarter than I am

[16:45] * foo gives it a few minutes

[16:46] <bar> What happens if the list is changed so eventId is gone?

[16:47] <bar> And find fails.

[16:47] <foo> haha. good thinking but no that's fine. .. I actually don't know, but the code is good, that can't happen. I'll stipulate that ... an outside component was not doing exactly what you'd expect, but the code I've described is all that's necessary to cause an infinite hang.

[16:48] <foo> the error is in those lines.

[16:48] <baz> Hmmm

[16:48] * foo gonna ask claude 3.5 too :D

[16:49] <baz> i feel like it's staring me right in the face

[16:49] <foo> hint: there's a background assumption that this code critically rests on that is violated.

[16:49] <baz> Is it the a => a > 5 part?

[16:49] <foo> no that's just an example, not that line.

[16:50] <baz> ohh

[16:50] <foo> the logic as described is wrong.

[16:50] <baz> Well, it assumes that the event has an eventId

[16:50] <foo> it does.

[16:50] <foo> all code pasted does exactly what it says.

[16:50] <baz> and

[16:50] <baz> it also assumes that find will produce an event

[16:50] <foo> find will find an event.
Additional stipulations from a previous conversation: all threading code is working correctly. The key logic is a bit involved, so let's just say that every event key is identical to its event number. Furthermore, importantly, note that events can only be removed from the list when all prior events are done processing, which is why we check for the number of events not done and not the number total.
Claude 3.5 Sonnet

Based on the information provided and the additional stipulations, I think I can identify the potential issue. The problem likely lies in the assumption about the relationship between the number of processing events and the threadpool size.
The key issue appears to be in this logic:
"if there's enough space on the threadpool to take the event (if the number of processing events that are not finished yet is less than the threadpool size) then start the task."
The bug is that this logic doesn't account for the possibility that events can complete out of order. Here's the scenario that could cause a deadlock:

Let's say the threadpool size is N.
N events start processing.
The last event (Nth event) finishes processing quickly, while the others are still running.
The system tries to process the (N+1)th event.
The check "if the number of processing events that are not finished yet is less than the threadpool size" will return false, because N-1 events are still processing.
The system waits for an event to finish before starting the new one.
However, the remaining N-1 events might be blocked waiting for the (N+1)th event to process (due to some dependency or the key logic).

This creates a deadlock situation where the system is waiting for an event to finish before starting a new one, but the running events are waiting for the new event to start.
The fix would likely involve changing the logic to consider not just the number of unfinished events, but also which specific events are unfinished, ensuring that events can be processed even if earlier events are still running.
Me

Good thinking, we've had issues like this. In this case I'll state that none of the events are blocking: the threadpool is completely idle.
Claude 3.5 Sonnet

Ah, I see. Thank you for the clarification. If the threadpool is completely idle, yet the system is still hanging, we need to look at this from a different angle. Let's reconsider the information provided:

The threadpool is completely idle.
The dispatcher is blocking on adding a new task.
The condition for adding a new task is: "if the number of processing events that are not finished yet is less than the threadpool size"

Given these facts, the issue likely lies in how the system is tracking the number of "processing events that are not finished yet." Here's a potential explanation for the bug:
The problem could be in how the system updates the count of unfinished events. Specifically, there might be a race condition or a logical error in the code that marks events as finished.
Here's a possible scenario:

All events have actually finished processing (which is why the threadpool is idle).
However, the counter or data structure that keeps track of unfinished events is not being properly updated.
As a result, the system still thinks there are unfinished events, even though there aren't any.
This causes the dispatcher to keep waiting for the number of unfinished events to decrease, which never happens.

The bug might be in this part:
the threadpool processes the event and then does... well, the list might have changed, so it grabs a lock and says \"list[list.find(event.eventId)].done = true;\"

There could be an issue where this update operation is not correctly decrementing the count of unfinished events, or not properly signaling that an event has finished.
To fix this, you'd need to ensure that:

The count of unfinished events is accurately decremented every time an event finishes.
There's proper synchronization between the thread that finishes events and the dispatcher thread.
The condition variable or mechanism used to signal that events have finished is correctly triggered.

This would explain why the threadpool is idle (all events have actually finished) but the system is still hanging (it thinks there are unfinished events).
Me

Yep. So far, everything you said is correct. I will state that the code I've specified is working correctly and doing exactly what it says. Synchronization and condition signaling is all working perfectly. But note: there is no explicit count of unfinished events, we look at the list of events in progress and manually count how many are unfinished. It follows from this...
Claude 3.5 Sonnet

Ah, thank you for the additional clarification. This is a subtle but critical point. If there's no explicit count and you're manually counting unfinished events from the list, then the issue likely lies in how events are being removed from this list. Let's think through this:

The threadpool processes events and marks them as done in the list.
The dispatcher checks if there's space by looking at the number of unfinished events in the list.
All synchronization and condition signaling is working correctly.

The key realization here is: events are being marked as done, but they're not being removed from the list.
So, the scenario is probably:

Events are processed and marked as done in the list.
The threadpool becomes idle because all events are processed.
However, the list of events still contains all the events, just marked as done.
When the dispatcher checks for space, it's still seeing a full list of events.
Even though all events are marked as done, the total count of events in the list is still equal to (or greater than) the threadpool size.

This explains why the system hangs even though all events are processed and the threadpool is idle. The dispatcher is waiting for the number of events in the list to decrease below the threadpool size, but this never happens because events are never removed from the list.
The fix would involve adding logic to remove completed events from the list, not just marking them as done. This removal should happen either immediately after an event is processed, or periodically as part of a cleanup operation.
Me

Ah, clarification: I've given you the exact code used to mark an event as done. The scheduler counts the number of events that are in the list but not marked as done. You may want to look at that snippet of code in detail. :)
Claude 3.5 Sonnet

Ah, I see! Thank you for pointing that out. Let's look at that snippet again:
the threadpool processes the event and then does... well, the list might have changed, so it grabs a lock and says \"list[list.find(event.eventId)].done = true;\"

Now I see the subtle but critical issue. The problem is in the use of list.find(event.eventId).
Here's what's happening:

The threadpool finishes processing an event.
It tries to mark the event as done in the list.
It uses list.find(event.eventId) to locate the event in the list.
However, if the event is no longer in the list (perhaps it was removed for some reason), find will return an invalid index or iterator.
When we try to access list[invalidIndex].done, we're accessing an invalid memory location.
This could lead to undefined behavior, possibly corrupting the list or other memory.

As a result, some events might not be properly marked as done, causing the count of unfinished events to remain high even when all events have actually been processed.
The fix would involve adding a check to ensure the event is still in the list before attempting to mark it as done:
auto it = list.find(event.eventId);
if (it != list.end()) {  // or whatever indicates \"not found\" in your implementation
    it->done = true;
} else {
    // Handle the case where the event is no longer in the list
    // This might involve logging, error handling, or other appropriate action
}

This ensures that we only mark an event as done if it's actually still in the list, preventing potential memory corruption and ensuring accurate tracking of unfinished events.
Me

You're close! But I'll further specify that find is returning a valid offset in the list, corresponding to an event with the given ID.
It's a good one, isn't it!
Claude 3.5 Sonnet

Ah, now that is indeed a tricky one! Thank you for the additional clarification. If find is returning a valid offset, and we're correctly setting done = true for that event, then the issue must be even more subtle.
Let's think about this carefully:

The event is in the list.
We're finding it correctly.
We're setting done = true correctly.
But somehow, the count of unfinished events isn't decreasing.

The key here is that we're using list.find(event.eventId) to locate the event. This implies that we're searching for the event by its ID.
Now, here's the critical realization:
What if there are multiple entries in the list with the same eventId?
If this is possible, then we have our bug. Here's the scenario:

An event with ID X is added to the list multiple times (perhaps due to retries or some other logic in the system).
The threadpool processes one instance of event X and marks it as done.
list.find(event.eventId) finds the first occurrence of event X in the list and marks it as done.
However, there are still other instances of event X in the list that are not marked as done.
When the dispatcher counts unfinished events, it still sees these other instances of event X as unfinished.

This would explain why:

The threadpool becomes idle (it has processed all events)
Events are being marked as done correctly
But the count of unfinished events never decreases to allow new events to be processed

The fix would involve ensuring that all instances of an event with a given ID are marked as done when that event is processed, or preventing duplicate entries in the first place if that's not intended behavior.
Me

YEP, YOU GOT IT, VERY GOOD! In fact, the EventStoreDB GRPC connection just decides to randomly send us the same sequence of events multiple times here, ie. we're seeing an event sequence like 5 6 7 8 9 7 8 9 10. If 5 blocks for too long, 7 8 9 end up as zombies, eventually stuffing up the threadpool.