Skip to content

Instantly share code, notes, and snippets.

@rbranson
Created July 21, 2021 17:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rbranson/3dd6b9d34625b49218e702630dd54ace to your computer and use it in GitHub Desktop.
Save rbranson/3dd6b9d34625b49218e702630dd54ace to your computer and use it in GitHub Desktop.

The other story begins with adding a C library extension to the monolith. This was some Python code that wrapped a C library which gave access to an internal data store. Unfortunately, the bindings had a bug in their string handling code. To get a string object that could be sent to the Python C APIs, it used a family of functions called PyString_FromString.

The developer writing these C bindings didn’t realize that strings are immutable in Python, and the C API documentation for strings doesn’t include that bit of information. To copy a string from the C world to Python-land, the bindings would first use this function to initialize a Python string to use as a buffer. It would fill said buffer with the string from the C side by poking at the object’s internal data structures. This did actually work, as it was never seen BEFORE it was filled nor was it modified AFTER it was sent across the wall to Python-land.

Then disaster struck. A configuration change caused the it-has-always-worked function to begin to request 1-byte buffers from the C Python API occasionally, which had never been tested before. To avoid extra allocation, Python uses a pre-built array of 256 strings that held all valid single-byte strings in the 8-bit space. A fast path in PyString_FromString returns these static objects when it detects that the string is a 1-byte string. That meant the C bindings were effectively modifying all single-byte strings. Each time the string containing byte 0 was requested, it would return byte 13 instead.

Besides creating general chaos, some requests were successful, but returned invalid session cookies. The hash routines in Python concatenate a large number of single-byte strings together into one long string. That meant the hash routine used to sign these cookies was producing invalid signatures (in this world, 0 = 13). The code that validated these cookies before accepting requests would reject the invalid signatures, and re-issue a new cookie with a completely new session. The result was a mass logout of millions of users.

Before we realized what was going on, we were unable to track down the configuration change that triggered the issue. Even though the change had been logged, the log had overwhelmed by thousands of automated changes, making it made it easy to miss it. Rolling the code back didn’t work because the change had lied dormant for months. It was pretty desperate times.

The band-aid fix was "interesting." Hitting the 1-byte buffer code path was rare, but once it “infected” the process, it was hosed. The fix was to check for the symptoms of the infection in the code that handled every response, and intentionally hard-exit the process if it detected a problem. This meant only a handful of requests would fail every few seconds, and more importantly, wouldn’t break anyone’s cookies.

We added this code in a middleware that executed before and after the request path, roughly like so:

if hashlib.md5("foo").hexdigest() != "acbd18db4cc2f85cedef654fccc4a4d8":
  sys.exit(11)

It took another day or so of grinding the process with gdb to even find the general culprit, and several days for the team who wrote the C bindings to track it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment