jgrahamc/gist:6bb02a6f7c3799a1590b3cdb901f8e08 Secret

## gistfile1.txt
From: Matthew Prince
Date: Thu, Oct 7, 2010 at 9:09 AM
Subject: Re: Where's my dns?
To: John Graham-Cumming

So here's what happened.....

CloudFlare runs a bunch of DNS servers scattered across 5 data centers
(Tokyo, San Jose, Chicago, Ashburn, Amsterdam). Whenever you make a
change to your DNS config, a server in Chicago (out of which we run
our master DB) picks up that change and ships it out to all the DNS
servers. We target having those changes pushed network-wide within
about 15 seconds and the new records responding within 1 minute.

Last night some code got pushed which updated how the master DB ships
out changes. It was, we thought, a minor update (updating to a newer
version of Google's protocol buffers code) but even though it had
passed our unit tests when it went live it started pushing out
corrupted DNS zone files. The speed and efficiency of the system meant
everything crashed very quickly. We reverted the push and had things
back working within about 8 minutes of getting the alert that there
was a problem.

Unfortunately, it now looks like there was one corrupted zone that
persisted. I don't yet fully understand how it got saved, but that's
what we're investigating this morning now that things are back up.
That zone was pushed out network wide. DNS failures began just after
6:00am (PDT) and beginning in Europe. As DNS systems went down in data
centers, failover worked and traffic shifted to other data centers.
Unfortunately, because the problem was corrupt data we had pushed out
everywhere, as the zone continued to be referenced we would continue
to have DNS failures. We've done a lot to protect against external
attacks on our DNS infrastructure, this morning makes it clear we
haven't done enough to protect against bad data we introduce
ourselves.

So that's what we're working on now: better integrity checks before
pushing out DNS data, better detection of DNS failures, and diversity
in our DNS infrastructure so one bug is less likely to shut down all
systems.

Thanks for the report. You were the first of many to write in. If you
ever see anything in the future, don't hesitate to call/text my cell.
This is the first thing I've ever done professionally that I truly,
completely love. I wake up every day just thrilled at the work we're
doing. As a result, downtime kills me, so it's been a rough morning.
We'll make a lot of mistakes in the future, I'm sure, but we won't
make this particular one again.

Matthew.

On Thursday, October 7, 2010, Matthew Prince wrote:
> Something blew up this morning. Don't know what yet. Will send post
> mortem as soon as I know exact cause. Back up now.
>
> On Oct 7, 2010, at 6:28 AM, John Graham-Cumming wrote:
>
>> Help!
>>
>> John.
	From: Matthew Prince
	Date: Thu, Oct 7, 2010 at 9:09 AM
	Subject: Re: Where's my dns?
	To: John Graham-Cumming

	So here's what happened.....

	CloudFlare runs a bunch of DNS servers scattered across 5 data centers
	(Tokyo, San Jose, Chicago, Ashburn, Amsterdam). Whenever you make a
	change to your DNS config, a server in Chicago (out of which we run
	our master DB) picks up that change and ships it out to all the DNS
	servers. We target having those changes pushed network-wide within
	about 15 seconds and the new records responding within 1 minute.

	Last night some code got pushed which updated how the master DB ships
	out changes. It was, we thought, a minor update (updating to a newer
	version of Google's protocol buffers code) but even though it had
	passed our unit tests when it went live it started pushing out
	corrupted DNS zone files. The speed and efficiency of the system meant
	everything crashed very quickly. We reverted the push and had things
	back working within about 8 minutes of getting the alert that there
	was a problem.

	Unfortunately, it now looks like there was one corrupted zone that
	persisted. I don't yet fully understand how it got saved, but that's
	what we're investigating this morning now that things are back up.
	That zone was pushed out network wide. DNS failures began just after
	6:00am (PDT) and beginning in Europe. As DNS systems went down in data
	centers, failover worked and traffic shifted to other data centers.
	Unfortunately, because the problem was corrupt data we had pushed out
	everywhere, as the zone continued to be referenced we would continue
	to have DNS failures. We've done a lot to protect against external
	attacks on our DNS infrastructure, this morning makes it clear we
	haven't done enough to protect against bad data we introduce
	ourselves.

	So that's what we're working on now: better integrity checks before
	pushing out DNS data, better detection of DNS failures, and diversity
	in our DNS infrastructure so one bug is less likely to shut down all
	systems.

	Thanks for the report. You were the first of many to write in. If you
	ever see anything in the future, don't hesitate to call/text my cell.
	This is the first thing I've ever done professionally that I truly,
	completely love. I wake up every day just thrilled at the work we're
	doing. As a result, downtime kills me, so it's been a rough morning.
	We'll make a lot of mistakes in the future, I'm sure, but we won't
	make this particular one again.

	Matthew.

	On Thursday, October 7, 2010, Matthew Prince wrote:
	> Something blew up this morning. Don't know what yet. Will send post
	> mortem as soon as I know exact cause. Back up now.
	>
	> On Oct 7, 2010, at 6:28 AM, John Graham-Cumming wrote:
	>
	>> Help!
	>>
	>> John.