Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
From: Matthew Prince
Date: Thu, Oct 7, 2010 at 9:09 AM
Subject: Re: Where's my dns?
To: John Graham-Cumming
So here's what happened.....
CloudFlare runs a bunch of DNS servers scattered across 5 data centers
(Tokyo, San Jose, Chicago, Ashburn, Amsterdam). Whenever you make a
change to your DNS config, a server in Chicago (out of which we run
our master DB) picks up that change and ships it out to all the DNS
servers. We target having those changes pushed network-wide within
about 15 seconds and the new records responding within 1 minute.
Last night some code got pushed which updated how the master DB ships
out changes. It was, we thought, a minor update (updating to a newer
version of Google's protocol buffers code) but even though it had
passed our unit tests when it went live it started pushing out
corrupted DNS zone files. The speed and efficiency of the system meant
everything crashed very quickly. We reverted the push and had things
back working within about 8 minutes of getting the alert that there
was a problem.
Unfortunately, it now looks like there was one corrupted zone that
persisted. I don't yet fully understand how it got saved, but that's
what we're investigating this morning now that things are back up.
That zone was pushed out network wide. DNS failures began just after
6:00am (PDT) and beginning in Europe. As DNS systems went down in data
centers, failover worked and traffic shifted to other data centers.
Unfortunately, because the problem was corrupt data we had pushed out
everywhere, as the zone continued to be referenced we would continue
to have DNS failures. We've done a lot to protect against external
attacks on our DNS infrastructure, this morning makes it clear we
haven't done enough to protect against bad data we introduce
ourselves.
So that's what we're working on now: better integrity checks before
pushing out DNS data, better detection of DNS failures, and diversity
in our DNS infrastructure so one bug is less likely to shut down all
systems.
Thanks for the report. You were the first of many to write in. If you
ever see anything in the future, don't hesitate to call/text my cell.
This is the first thing I've ever done professionally that I truly,
completely love. I wake up every day just thrilled at the work we're
doing. As a result, downtime kills me, so it's been a rough morning.
We'll make a lot of mistakes in the future, I'm sure, but we won't
make this particular one again.
Matthew.
On Thursday, October 7, 2010, Matthew Prince wrote:
> Something blew up this morning. Don't know what yet. Will send post
> mortem as soon as I know exact cause. Back up now.
>
> On Oct 7, 2010, at 6:28 AM, John Graham-Cumming wrote:
>
>> Help!
>>
>> John.
@kristymoser2018

This comment has been minimized.

Copy link

commented Jul 14, 2019

Well I can tell you I was in mid transaction when that happened and I could tell the shift instantly .. coinbase was normal then it went missing so I downloaded the app which to retrieve my wallet id needless to say none of thatvis anywhere to be found but on block chain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.