This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
From: Matthew Prince | |
Date: Thu, Oct 7, 2010 at 9:09 AM | |
Subject: Re: Where's my dns? | |
To: John Graham-Cumming | |
So here's what happened..... | |
CloudFlare runs a bunch of DNS servers scattered across 5 data centers | |
(Tokyo, San Jose, Chicago, Ashburn, Amsterdam). Whenever you make a | |
change to your DNS config, a server in Chicago (out of which we run | |
our master DB) picks up that change and ships it out to all the DNS | |
servers. We target having those changes pushed network-wide within | |
about 15 seconds and the new records responding within 1 minute. | |
Last night some code got pushed which updated how the master DB ships | |
out changes. It was, we thought, a minor update (updating to a newer | |
version of Google's protocol buffers code) but even though it had | |
passed our unit tests when it went live it started pushing out | |
corrupted DNS zone files. The speed and efficiency of the system meant | |
everything crashed very quickly. We reverted the push and had things | |
back working within about 8 minutes of getting the alert that there | |
was a problem. | |
Unfortunately, it now looks like there was one corrupted zone that | |
persisted. I don't yet fully understand how it got saved, but that's | |
what we're investigating this morning now that things are back up. | |
That zone was pushed out network wide. DNS failures began just after | |
6:00am (PDT) and beginning in Europe. As DNS systems went down in data | |
centers, failover worked and traffic shifted to other data centers. | |
Unfortunately, because the problem was corrupt data we had pushed out | |
everywhere, as the zone continued to be referenced we would continue | |
to have DNS failures. We've done a lot to protect against external | |
attacks on our DNS infrastructure, this morning makes it clear we | |
haven't done enough to protect against bad data we introduce | |
ourselves. | |
So that's what we're working on now: better integrity checks before | |
pushing out DNS data, better detection of DNS failures, and diversity | |
in our DNS infrastructure so one bug is less likely to shut down all | |
systems. | |
Thanks for the report. You were the first of many to write in. If you | |
ever see anything in the future, don't hesitate to call/text my cell. | |
This is the first thing I've ever done professionally that I truly, | |
completely love. I wake up every day just thrilled at the work we're | |
doing. As a result, downtime kills me, so it's been a rough morning. | |
We'll make a lot of mistakes in the future, I'm sure, but we won't | |
make this particular one again. | |
Matthew. | |
On Thursday, October 7, 2010, Matthew Prince wrote: | |
> Something blew up this morning. Don't know what yet. Will send post | |
> mortem as soon as I know exact cause. Back up now. | |
> | |
> On Oct 7, 2010, at 6:28 AM, John Graham-Cumming wrote: | |
> | |
>> Help! | |
>> | |
>> John. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Well I can tell you I was in mid transaction when that happened and I could tell the shift instantly .. coinbase was normal then it went missing so I downloaded the app which to retrieve my wallet id needless to say none of thatvis anywhere to be found but on block chain