Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@bmastenbrook
Created December 6, 2020 01:46
Show Gist options
  • Star 30 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
test for AT&T's lovely packet corruption
#!/bin/sh
set -e
rm -f example-tls example-http
while ! curl -m 1 -s -o example-tls https://www.example.com; do
true
done
while true; do
if curl -m 1 -s -o example-http http://www.example.com/; then
if ! diff -q example-tls example-http; then break; fi
fi
done
@teichopsia
Copy link

I started doing a similar analysis but the html from www.example.com differs from what I have. so without having all redo raw captures to correct the offsets all I came across initially was 256 byte intervals. I now see a ton more posts here to work with so yea. all the errors I saw stole a bit#3 (lsb) then returned it on the next bitflip. feels like a tcam corruption

@gwillen
Copy link

gwillen commented Dec 8, 2020

Back when I worked at Google, I saw an example of something very similar to this live, and I knew what I was looking for because I'd read about a past example of a similar nature. In both cases, the problem was ultimately diagnosed as "bad linecard". Hacker News commenters were talking about bad RAM, which makes sense to me. So, it would seem that on some card in some router, there is (well, was) a RAM chip with a single bad bit, and some packet buffer would get allocated across it, presumably with 128-byte alignment. Since the error bit in this case is not always-on or always-off, it could be getting copied from another bit, or from something weirder like an address line, or who knows...

And since the error is always in the same place mod 16 bits, the TCP checksum has very little power to save you; if the number of bits flipped in a packet is even, and the number of 0-1 and 1-0 flips is equal, the checksum will be the same. (This is assuming the TCP checksum is present and functioning end-to-end, and not being recomputed by the offending router or something. I forget how that all works.)

EDIT: Ooh, I missed @teichopsia's comment. I'm not enough of a network person to be familiar with TCAM specifically, but if the error was "pushing" the bits backwards, as they seem to maybe be describing -- if each flip was in the opposite direction from the previous flip -- then that would give a really high chance of the TCP checksum being unaffected.

@teichopsia
Copy link

while i realize this is a dead/resolved issue some people were curious how things like this happen. if you feel like falling down the rabbit hole this thread from ARM list gives a good recap of the hell that is corrupted data and how it becomes a thing.

https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment