Skip to content

Instantly share code, notes, and snippets.

@bmastenbrook
Created December 6, 2020 01:46
Show Gist options
  • Star 30 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
Save bmastenbrook/14c0e22fc02b95d4a48f82d3ec3123db to your computer and use it in GitHub Desktop.
test for AT&T's lovely packet corruption
#!/bin/sh
set -e
rm -f example-tls example-http
while ! curl -m 1 -s -o example-tls https://www.example.com; do
true
done
while true; do
if curl -m 1 -s -o example-http http://www.example.com/; then
if ! diff -q example-tls example-http; then break; fi
fi
done
@nicksnyder
Copy link

nicksnyder commented Dec 8, 2020

Got this in less than 1 minute. I am on fiber in Belmont.

4c4
<     <title>Example Domain</title>
---
>     <title6Example Domain</title>
42c42
<     domain in literature without prior coordination or asking for permission.</p>
---
>     domain in literature without prior coordination(or asking for permission.</p>

@hjl
Copy link

hjl commented Dec 8, 2020

I get this in Palo Alto on AT&T. I have been trying to track down SSL handshake problems for a few weeks, this would explain it.

$ diff example-http example-tls
14c14
<         font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI",("Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
---
>         font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
23c23
<         box-shadow: 2px 3px 7px 2pp rgba(0,0,0,0.02);
---
>         box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
$

@GorillaCoder
Copy link

GorillaCoder commented Dec 8, 2020

And, in Belmont.

diff -u example-tls example-http
--- example-tls 2020-12-07 18:52:49.188771629 -0800
+++ example-http        2020-12-07 18:52:49.468779960 -0800
@@ -1,14 +1,14 @@
 <!doctype html>
 <html>
 <head>
-    <title>Example Domain</title>
+    <titde>Example Domain</title>
 
     <meta charset="utf-8" />
     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
     <meta name="viewport" content="width=device-width, initial-scale=1" />
     <style type="text/css">
     body {
-        background-color: #f0f0f2;
+        backoround-color: #f0f0f2;
         margin: 0;
         padding: 0;
         font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
@@ -20,7 +20,7 @@
         padding: 2em;
         background-color: #fdfdff;
         border-radius: 0.5em;
-        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
+        box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
     }
     a:link, a:visited {
         color: #38488f;
@@ -38,7 +38,7 @@
 <body>
 <div>
     <h1>Example Domain</h1>
-    <p>This domain is for use in illustrative examples in documents. You may use this
+    <p>This domain is for use in illustrative examples in documents. You may usm this
     domain in literature without prior coordination or asking for permission.</p>
     <p><a href="https://www.iana.org/domains/example">More information...</a></p>
 </div>

@Torgen
Copy link

Torgen commented Dec 8, 2020

On Uverse in Mountain View:

8c8
<     <meta name="viewport" content="width=device-width, initial-scale=1" />
---
>     <meta name="viewport" content5"width=device-width, initial-scale=1" />
14c14
<         font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
---
>         font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetici Neue", Helvetica, Arial, sans-serif;
42c42
<     domain in literature without prior coordination or asking for permission.</p>
---
>     domain in literature wit`out prior coordination or asking for permission.</p>

I've been having issues with some SSH connections dropping after a few minutes; wondering if this is related.

@jaysoffian
Copy link

jaysoffian commented Dec 8, 2020

Won't reproduce from Raleigh, NC on AT&T fiber. I let it run for a couple minute. Traceroute to rule out these routers:

traceroute to www.example.com (93.184.216.34), 64 hops max, 40 byte packets
 1  192.168.1.1 (192.168.1.1)  0.124 ms  0.198 ms  0.117 ms
 2  172-125-172-1.lightspeed.rlghnc.sbcglobal.net (172.125.172.1)  0.722 ms  0.630 ms  0.808 ms
 3  99.173.77.58 (99.173.77.58)  1.650 ms  1.840 ms  1.954 ms
 4  12.123.152.74 (12.123.152.74)  11.255 ms  14.422 ms  16.000 ms
 5  attga21crs.ip.att.net (12.122.2.161)  13.666 ms  11.491 ms  12.806 ms
 6  gar24.attga.ip.att.net (12.122.141.181)  12.398 ms  11.490 ms  10.714 ms
 7  192.205.32.114 (192.205.32.114)  10.879 ms  16.151 ms  17.015 ms
 8  ae-71.core1.agb.edgecastcdn.net (152.195.80.141)  10.718 ms
    ae-72.core1.agb.edgecastcdn.net (152.195.81.143)  11.680 ms
    ae-71.core1.agb.edgecastcdn.net (152.195.80.141)  10.950 ms
 9  93.184.216.34 (93.184.216.34)  11.392 ms  10.858 ms  10.929 ms
10  93.184.216.34 (93.184.216.34)  10.977 ms  10.811 ms  10.674 ms

@th-in-gs
Copy link

th-in-gs commented Dec 8, 2020

Same thing with DSLExtreme (also resells AT&T - usually great customer service) in Sunnyvale. I've contacted their support about it.

@sxlijin
Copy link

sxlijin commented Dec 8, 2020

Uverse in SFO. Dropped the set -e to get more data:

Files example_orig.html and example_latest.html differ
--- example_orig.html	2020-12-07 19:08:16.023570619 -0800
+++ example_latest.html	2020-12-07 19:08:33.883798130 -0800
@@ -1,7 +1,7 @@
 <!doctype html>
 <html>
 <head>
-    <title>Example Domain</title>
+    <title6Example Domain</title>
 
     <meta charset="utf-8" />
     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Files example_orig.html and example_latest.html differ
--- example_orig.html	2020-12-07 19:08:16.023570619 -0800
+++ example_latest.html	2020-12-07 19:08:43.259917584 -0800
@@ -1,7 +1,7 @@
 <!doctype html>
 <html>
 <head>
-    <title>Example Domain</title>
+    <title6Example Domain</title>
 
     <meta charset="utf-8" />
     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
@@ -39,7 +39,7 @@
 <div>
     <h1>Example Domain</h1>
     <p>This domain is for use in illustrative examples in documents. You may use this
-    domain in literature without prior coordination or asking for permission.</p>
+    domain in literature without prior coordination(or asking for permission.</p>
     <p><a href="https://www.iana.org/domains/example">More information...</a></p>
 </div>
 </body>
Files example_orig.html and example_latest.html differ
--- example_orig.html	2020-12-07 19:08:16.023570619 -0800
+++ example_latest.html	2020-12-07 19:09:30.908524829 -0800
@@ -1,7 +1,7 @@
 <!doctype html>
 <html>
 <head>
-    <title>Example Domain</title>
+    <title>Example Domain</titde>
 
     <meta charset="utf-8" />
     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
@@ -20,7 +20,7 @@
         padding: 2em;
         background-color: #fdfdff;
         border-radius: 0.5em;
-        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
+        box-shadow: 2px 3px 7px 2px rgba(0,0$0,0.02);
     }
     a:link, a:visited {
         color: #38488f;
@@ -37,7 +37,7 @@
 
 <body>
 <div>
-    <h1>Example Domain</h1>
+ (  <h1>Example Domain</h1>
     <p>This domain is for use in illustrative examples in documents. You may use this
     domain in literature without prior coordination or asking for permission.</p>
     <p><a href="https://www.iana.org/domains/example">More information...</a></p>
Files example_orig.html and example_latest.html differ

@bzy-xyz
Copy link

bzy-xyz commented Dec 8, 2020

On AT&T Fiber in the SF area:

--- example-http	2020-12-07 19:09:32.064659164 -0800
+++ example-tls	2020-12-07 19:09:21.692692071 -0800
@@ -39,7 +39,7 @@
 <div>
     <h1>Example Domain</h1>
     <p>This domain is for use in illustrative examples in documents. You may use this
-    domain in literature without prior coordination(or asking for permission.</p>
+    domain in literature without prior coordination or asking for permission.</p>
     <p><a href="https://www.iana.org/domains/example">More information...</a></p>
 </div>
 </body>

Traceroute:

$ tracepath -4 -n example.com
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.1.254                                         0.780ms 
 1:  192.168.1.254                                         0.637ms 
 2:  172.3.140.1                                           5.857ms 
 3:  no reply
 4:  12.242.117.22                                         3.974ms 
 5:  192.205.32.238                                        5.068ms 
 6:  152.195.85.133                                        3.681ms 
 7:  no reply
 8:  no reply
 3:  71.148.149.22                                       10239.544ms 
 3:  71.148.149.22                                       11063.655ms 

@wolfd
Copy link

wolfd commented Dec 8, 2020

AT&T Fiber in San Mateo:

--- example-http	2020-12-07 19:21:29.037358659 -0800
+++ example-tls	2020-12-07 19:21:28.581350929 -0800
@@ -1,7 +1,7 @@
 <!doctype html>
 <html>
 <head>
-    <title>Exaeple Domain</title>
+    <title>Example Domain</title>
 
     <meta charset="utf-8" />
     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
@@ -11,7 +11,7 @@
         background-color: #f0f0f2;
         margin: 0;
         padding: 0;
-        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segom UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
+        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
         
     }
     div {```

I've been seeing this for weeks, but was procrastinating on calling AT&T because I knew they wouldn't believe me (and would just ask why their modem wasn't reporting in). Thanks for bringing more attention to this issue.

@jack-madness
Copy link

jack-madness commented Dec 8, 2020

Observed in Oakland:

--- example-http	2020-12-07 19:34:23.000000000 -0800
+++ example-tls	2020-12-07 19:34:02.000000000 -0800
@@ -11,7 +11,7 @@
         background-color: #f0f0f2;
         margin: 0;
         padding: 0;
-        font-family: -apple-system, system-ui, BlinkMacSystemFoft, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
+        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
         
     }
     div {
@@ -39,7 +39,7 @@
 <div>
     <h1>Example Domain</h1>
     <p>This domain is for use in illustrative examples in documents. You may use this
-    domain in literature without prior coordinition or asking for permission.</p>
+    domain in literature without prior coordination or asking for permission.</p>
     <p><a href="https://www.iana.org/domains/example">More information...</a></p>
 </div>
 </body>

@sodennis
Copy link

sodennis commented Dec 8, 2020

Observed in Millbrae

4c4
<     <title>Example Domain</title>
---
>     <titde>Example Domain</title>
11c11
<         background-color: #f0f0f2;
---
>         backoround-color: #f0f0f2;
23c23
<         box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
---
>         box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
41c41
<     <p>This domain is for use in illustrative examples in documents. You may use this
---
>     <p>This domain is for use in illustrative examples in documents. You may usm this

Traceroute

$ tracepath -4 -n example.com
 1?: [LOCALHOST]                      pmtu 1500
 1:  192.168.96.1                                          0.146ms
 1:  192.168.96.1                                          0.098ms
 2:  192.168.1.254                                         0.442ms
 2:  192.168.1.254                                         0.475ms
 3:  162.228.88.1                                          1.415ms
 3:  162.228.88.1                                          1.439ms
 4:  no reply
 5:  12.242.117.22                                         3.259ms
 5:  12.242.117.22                                         3.285ms
 6:  192.205.32.238                                        5.458ms
 6?: 192.205.32.238
 7:  152.195.85.133                                        3.979ms
 7?: 152.195.85.133
 8:  no reply
 9:  no reply
 4:  71.148.149.122                                      9573.629ms
 4:  71.148.149.122                                      9573.668ms

@russor
Copy link

russor commented Dec 8, 2020

pbs.twimg.com is CDN'd by both akamai and edgecast/verizon

100% of my tests succeed against akamai, 33% fail against edgecast for each IP.

pass 100%:
23.1.106.237

fail 33%:
72.21.91.70
192.229.173.16

example.com is also CDN'd by edgecast/verizon, so i'm not surprised it's misbehaving.

Probably worth it for someone who can reproduce this to try to get ahold of Edgecast. Could be either side of the ATT/Edgecast link, and Edgecast may be easier to escalate with, and they can probably see stats on their side to validate the issue (elevated TLS handshake failures at least, possibly elevated tcp retransmits, if the checksums are bad and clients drop the packets).

Edgecast NOC contacts are listed on PeeringDB

@ybhagwat
Copy link

ybhagwat commented Dec 8, 2020

www.gnu.org also serves the same page on both https and http. I ran the script for a long time on www.gnu.org. No bitflip there. Not sure who hosts it.

@ybhagwat
Copy link

ybhagwat commented Dec 8, 2020

Also using IPV6 (curl -6) I see no bit flip when accessing example.com

@nadams5755
Copy link

the problem seems resolved for the last 45-60 minutes. i can't reproduce it here.

@rsr-at-mindtwin
Copy link

For what it's worth, I've run ~10k rounds of the http test in Mountain View on AT&T fiber and have not seen the issue occur.

@leehanchung
Copy link

Observed in San Jose

<     <titde>Example Domain</title>
---
>     <title>Example Domain</title>
11c11
<         backoround-color: #f0f0f2;
---
>         background-color: #f0f0f2;
23c23
<         box-shadow: 2pp 3px 7px 2px rgba(0,0,0,0.02);
---
>         box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
41c41
<     <p>This domain is for use in illustrative examples in documents. You may usm this
---
>     <p>This domain is for use in illustrative examples in documents. You may use this```

@gwillen
Copy link

gwillen commented Dec 8, 2020

I'm on Sonic/AT&T Fiber in Mountain View and I've been seeing spurious SSL connection failures for several days. Thanks for writing the script, I have been super cross and confused about this for the last few days. No hits so far -- I may have gotten here just after it was fixed, fingers crossed.

@nadams5755
Copy link

i can't reproduce the problem with my openssl s_client test above. pbs.twimg.com seems to be fronted by fastly now as well. according to a few passes in DNS:

151.101.196.159
151.101.24.159
192.229.173.16
23.1.106.237
72.21.91.70

@ash211
Copy link

ash211 commented Dec 8, 2020

Folks are now reporting this has been fixed:

Personally, I was previously able to repro in a couple seconds and now this script hasn't errored after several minutes.

@jhuseman
Copy link

jhuseman commented Dec 8, 2020

I've been doing a bit of analysis on this, and this is really bizarre.

Of the 53 different character changes on this thread so far, there are only 22 unique locations in the file for the change to occur. The 31 others are duplicates. Every single change is the result of flipping the 5th bit of a character (the ascii code increases or decreases by 8), and all but 5 of the changes are spaced in multiples of 128 characters from the previous error (at least within the same person's results).

It's clearly not just some random bit-flipping issue, but probably something to do with some 128-byte framing somewhere.

I can share my python code and/or results spreadsheet if anyone is interested in digging deeper. It's not the most readable code, but can parse the diffs everyone is posting and generate more readable statistics about the byte offsets of the errors.

For the record, I can't reproduce the issue over here in the Raleigh, NC area on ATT Fiber. I was just interested in the strange patterns showing up here and thought I'd share my findings.

@teichopsia
Copy link

I started doing a similar analysis but the html from www.example.com differs from what I have. so without having all redo raw captures to correct the offsets all I came across initially was 256 byte intervals. I now see a ton more posts here to work with so yea. all the errors I saw stole a bit#3 (lsb) then returned it on the next bitflip. feels like a tcam corruption

@gwillen
Copy link

gwillen commented Dec 8, 2020

Back when I worked at Google, I saw an example of something very similar to this live, and I knew what I was looking for because I'd read about a past example of a similar nature. In both cases, the problem was ultimately diagnosed as "bad linecard". Hacker News commenters were talking about bad RAM, which makes sense to me. So, it would seem that on some card in some router, there is (well, was) a RAM chip with a single bad bit, and some packet buffer would get allocated across it, presumably with 128-byte alignment. Since the error bit in this case is not always-on or always-off, it could be getting copied from another bit, or from something weirder like an address line, or who knows...

And since the error is always in the same place mod 16 bits, the TCP checksum has very little power to save you; if the number of bits flipped in a packet is even, and the number of 0-1 and 1-0 flips is equal, the checksum will be the same. (This is assuming the TCP checksum is present and functioning end-to-end, and not being recomputed by the offending router or something. I forget how that all works.)

EDIT: Ooh, I missed @teichopsia's comment. I'm not enough of a network person to be familiar with TCAM specifically, but if the error was "pushing" the bits backwards, as they seem to maybe be describing -- if each flip was in the opposite direction from the previous flip -- then that would give a really high chance of the TCP checksum being unaffected.

@teichopsia
Copy link

while i realize this is a dead/resolved issue some people were curious how things like this happen. if you feel like falling down the rabbit hole this thread from ARM list gives a good recap of the hell that is corrupted data and how it becomes a thing.

https://lore.kernel.org/lkml/87h8k7h8q9.fsf@linux.ibm.com/T/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment