-
-
Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
#!/bin/sh | |
set -e | |
rm -f example-tls example-http | |
while ! curl -m 1 -s -o example-tls https://www.example.com; do | |
true | |
done | |
while true; do | |
if curl -m 1 -s -o example-http http://www.example.com/; then | |
if ! diff -q example-tls example-http; then break; fi | |
fi | |
done |
Back when I worked at Google, I saw an example of something very similar to this live, and I knew what I was looking for because I'd read about a past example of a similar nature. In both cases, the problem was ultimately diagnosed as "bad linecard". Hacker News commenters were talking about bad RAM, which makes sense to me. So, it would seem that on some card in some router, there is (well, was) a RAM chip with a single bad bit, and some packet buffer would get allocated across it, presumably with 128-byte alignment. Since the error bit in this case is not always-on or always-off, it could be getting copied from another bit, or from something weirder like an address line, or who knows...
And since the error is always in the same place mod 16 bits, the TCP checksum has very little power to save you; if the number of bits flipped in a packet is even, and the number of 0-1 and 1-0 flips is equal, the checksum will be the same. (This is assuming the TCP checksum is present and functioning end-to-end, and not being recomputed by the offending router or something. I forget how that all works.)
EDIT: Ooh, I missed @teichopsia's comment. I'm not enough of a network person to be familiar with TCAM specifically, but if the error was "pushing" the bits backwards, as they seem to maybe be describing -- if each flip was in the opposite direction from the previous flip -- then that would give a really high chance of the TCP checksum being unaffected.
while i realize this is a dead/resolved issue some people were curious how things like this happen. if you feel like falling down the rabbit hole this thread from ARM list gives a good recap of the hell that is corrupted data and how it becomes a thing.
https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/
I started doing a similar analysis but the html from www.example.com differs from what I have. so without having all redo raw captures to correct the offsets all I came across initially was 256 byte intervals. I now see a ton more posts here to work with so yea. all the errors I saw stole a bit#3 (lsb) then returned it on the next bitflip. feels like a tcam corruption