Skip to content

Instantly share code, notes, and snippets.

@bmastenbrook
Created December 6, 2020 01:46
Show Gist options
  • Star 30 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
test for AT&T's lovely packet corruption
#!/bin/sh
set -e
rm -f example-tls example-http
while ! curl -m 1 -s -o example-tls https://www.example.com; do
true
done
while true; do
if curl -m 1 -s -o example-http http://www.example.com/; then
if ! diff -q example-tls example-http; then break; fi
fi
done
@jhuseman
Copy link

jhuseman commented Dec 8, 2020

I've been doing a bit of analysis on this, and this is really bizarre.

Of the 53 different character changes on this thread so far, there are only 22 unique locations in the file for the change to occur. The 31 others are duplicates. Every single change is the result of flipping the 5th bit of a character (the ascii code increases or decreases by 8), and all but 5 of the changes are spaced in multiples of 128 characters from the previous error (at least within the same person's results).

It's clearly not just some random bit-flipping issue, but probably something to do with some 128-byte framing somewhere.

I can share my python code and/or results spreadsheet if anyone is interested in digging deeper. It's not the most readable code, but can parse the diffs everyone is posting and generate more readable statistics about the byte offsets of the errors.

For the record, I can't reproduce the issue over here in the Raleigh, NC area on ATT Fiber. I was just interested in the strange patterns showing up here and thought I'd share my findings.

@teichopsia
Copy link

I started doing a similar analysis but the html from www.example.com differs from what I have. so without having all redo raw captures to correct the offsets all I came across initially was 256 byte intervals. I now see a ton more posts here to work with so yea. all the errors I saw stole a bit#3 (lsb) then returned it on the next bitflip. feels like a tcam corruption

@gwillen
Copy link

gwillen commented Dec 8, 2020

Back when I worked at Google, I saw an example of something very similar to this live, and I knew what I was looking for because I'd read about a past example of a similar nature. In both cases, the problem was ultimately diagnosed as "bad linecard". Hacker News commenters were talking about bad RAM, which makes sense to me. So, it would seem that on some card in some router, there is (well, was) a RAM chip with a single bad bit, and some packet buffer would get allocated across it, presumably with 128-byte alignment. Since the error bit in this case is not always-on or always-off, it could be getting copied from another bit, or from something weirder like an address line, or who knows...

And since the error is always in the same place mod 16 bits, the TCP checksum has very little power to save you; if the number of bits flipped in a packet is even, and the number of 0-1 and 1-0 flips is equal, the checksum will be the same. (This is assuming the TCP checksum is present and functioning end-to-end, and not being recomputed by the offending router or something. I forget how that all works.)

EDIT: Ooh, I missed @teichopsia's comment. I'm not enough of a network person to be familiar with TCAM specifically, but if the error was "pushing" the bits backwards, as they seem to maybe be describing -- if each flip was in the opposite direction from the previous flip -- then that would give a really high chance of the TCP checksum being unaffected.

@teichopsia
Copy link

while i realize this is a dead/resolved issue some people were curious how things like this happen. if you feel like falling down the rabbit hole this thread from ARM list gives a good recap of the hell that is corrupted data and how it becomes a thing.

https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment