We had a fun bug last week that almost doubles as a logic puzzle. It makes perfect sense in hindsight, but as I was staring at it, it seemed every part of it was nailed down tight - until I spotted the obvious, natural assumption that had been violated from the start.
A bit of background. We're working with microservices propagating state with event streams. For a typical stream, most events will concern unrelated objects, so we parallelize them: each incoming event is assigned a key, and we accept parallel events so long as their keys don't collide with any events currently in progress. Furthermore, because we track our position in the input stream as a single number, we can't retire events as soon