/eliot horowitz -mongodb-foursquare postmortem

## eliot horowitz -mongodb-foursquare postmortem
source : http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e

(Note: this is being posted with Foursquare’s permission.)

As many of you are aware, Foursquare had a significant outage this
week.  The outage was caused by capacity problems on one of the
machines hosting the MongoDB database used for check-ins.  This is an
account of what happened, why it happened, how it can be prevented,
and how 10gen is working to improve MongoDB in light of this outage.

It’s important to note that throughout this week, 10gen and Foursquare
engineers have been working together very closely to resolve the
issue.

* Some history
Foursquare has been hosting check-ins on a MongoDB database for some
time now.  The database was originally running on a single EC2
instance with 66GB of RAM.  About 2 months ago, in response to
increased capacity requirements, Foursquare migrated that single
instance to a two-shard cluster.  Now, each shard was running on its
own 66GB instance, and both shards were also replicating to a slave
for redundancy. This was an important migration because it allowed
Foursquare to keep all of their check-in data in RAM, which is
essential for maintaining acceptable performance.

The data had been split into 200 evenly distributed chunks based on
user id.  The first half went to one server, and the other half to the
other.  Each shard had about 33GB of data in RAM at this point, and
the whole system ran smoothly for two months.

* What we missed in the interim
Over these two months, check-ins were being written continually to
each shard.  Unfortunately, these check-ins did not grow evenly across
chunks. It’s easy to imagine how this might happen: assuming certain
subsets of users are more active than others, it’s conceivable that
their updates might all go to the same shard. That’s what occurred in
this case, resulting in one shard growing to 66GB and the other only
to 50GB. [1]

* What went wrong
On Monday morning, the data on one shard (we’ll call it shard0)
finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
machine. Whenever data size grows beyond physical RAM, it becomes
necessary to read and write to disk, which is orders of magnitude
slower than reading and writing RAM. Thus, certain queries started to
become very slow, and this caused a backlog that brought the site
down.

We first attempted to fix the problem by adding a third shard.  We
brought the third shard up and started migrating chunks. Queries were
now being distributed to all three shards, but shard0 continued to hit
disk very heavily.  When this failed to correct itself, we ultimately
discovered that the problem was due to data fragmentation on shard0.
In essence, although we had moved 5% of the data from shard0 to the
new third shard, the data files, in their fragmented state, still
needed the same amount of RAM. This can be explained by the fact that
Foursquare check-in documents are small (around 300 bytes each), so
many of them can fit on a 4KB page.  Removing 5% of these just made
each page a little more sparse, rather than removing pages
altogether.[2]

After the first day's outage it had become clear that chunk migration,
sans compaction, was not going to solve the immediate problem. By the
time the second day's outage occurred, we had already move 5% of the
data off of shard0, so we decided to start an offline process to
compact the database using MongoDB’s repairDatabase() feature.  This
process took about 4 hours (partly due to the data size, and partly
because of the slowness of EBS volumes at the time).  At the end of
the 4 hours, the RAM requirements for shard0 had in fact been reduced
by 5%, allowing us to bring the system back online.

* Afterwards
Since repairing shard0 and adding a third shard, we’ve set up even
more shards, and now the check-in data is evenly distributed and there
is a good deal of extra capacity.  Still, we had to address the
fragmentation problem.  We ran a repairDatabase() on the slaves, and
promoted the slaves to masters, further reducing the RAM needed on
each shard to about 20GB.

* How is this issue triggered?
Several conditions need to be met to trigger the issue that brought
down Foursquare:
1. Systems are at or over capacity. How capacity is defined varies; in
the case of Foursquare, all data needed to fit into RAM for acceptable
performance. Other deployments may not have such strict RAM
requirements.
2. Document size is less than 4k. Such documents, when moved, may be
too small to free up pages and, thus, memory.
3. Shard key order and insertion order are different. This prevents
data from being moved in contiguous chunks.

Most sharded deployments will not meet these criteria. Anyone whose
documents are larger than 4KB will not suffer significant
fragmentation because the pages that aren’t being used won’t be
cached.

* Prevention
The main thing to remember here is that once you’re at max capacity,
it’s difficult to add more capacity without some downtime when objects
are small.  However, if caught in advance, adding more shards on a
live system can be done with no downtime.

For example, if we had notifications in place to alert us 12 hours
earlier that we needed more capacity, we could have added a third
shard, migrated data, and then compacted the slaves.

Another salient point: when you’re operating at or near capacity,
realize that if things get slow at your hosting provider, you may find
yourself all of a sudden effectively over capacity.

* Final Thoughts
The 10gen tech team is working hard to correct the issues exposed by
this outage.  We will continue to work as hard as possible to ensure
that everyone using MongoDB has the best possible experience.  We are
thankful for the support that we have received from Foursquare and our
community during this unfortunate episode.  As always, please let us
know if you have any questions or concerns.

[1] Chunks get split when they are 200MB into 2 100MB halves.  This
means that even if the number of chunks on each shard was the same,
data size is not always so.  This is something we are going to be
addressing in MongoDB.  We'll be making splitting balancing look for
this imbalance so it can act upon it.

[2] The 10gen team is working on doing online incremental compaction
of both data files and indexes.  We know this has been a concern in
non-sharded systems as well.  More details about this will be coming
in the next few weeks.
	source : http://groups.google.com/group/mongodb-user/browse_thread/thread/528a94f287e9d77e

	(Note: this is being posted with Foursquare’s permission.)

	As many of you are aware, Foursquare had a significant outage this
	week. The outage was caused by capacity problems on one of the
	machines hosting the MongoDB database used for check-ins. This is an
	account of what happened, why it happened, how it can be prevented,
	and how 10gen is working to improve MongoDB in light of this outage.

	It’s important to note that throughout this week, 10gen and Foursquare
	engineers have been working together very closely to resolve the
	issue.

	* Some history
	Foursquare has been hosting check-ins on a MongoDB database for some
	time now. The database was originally running on a single EC2
	instance with 66GB of RAM. About 2 months ago, in response to
	increased capacity requirements, Foursquare migrated that single
	instance to a two-shard cluster. Now, each shard was running on its
	own 66GB instance, and both shards were also replicating to a slave
	for redundancy. This was an important migration because it allowed
	Foursquare to keep all of their check-in data in RAM, which is
	essential for maintaining acceptable performance.

	The data had been split into 200 evenly distributed chunks based on
	user id. The first half went to one server, and the other half to the
	other. Each shard had about 33GB of data in RAM at this point, and
	the whole system ran smoothly for two months.

	* What we missed in the interim
	Over these two months, check-ins were being written continually to
	each shard. Unfortunately, these check-ins did not grow evenly across
	chunks. It’s easy to imagine how this might happen: assuming certain
	subsets of users are more active than others, it’s conceivable that
	their updates might all go to the same shard. That’s what occurred in
	this case, resulting in one shard growing to 66GB and the other only
	to 50GB. [1]

	* What went wrong
	On Monday morning, the data on one shard (we’ll call it shard0)
	finally grew to about 67GB, surpassing the 66GB of RAM on the hosting
	machine. Whenever data size grows beyond physical RAM, it becomes
	necessary to read and write to disk, which is orders of magnitude
	slower than reading and writing RAM. Thus, certain queries started to
	become very slow, and this caused a backlog that brought the site
	down.

	We first attempted to fix the problem by adding a third shard. We
	brought the third shard up and started migrating chunks. Queries were
	now being distributed to all three shards, but shard0 continued to hit
	disk very heavily. When this failed to correct itself, we ultimately
	discovered that the problem was due to data fragmentation on shard0.
	In essence, although we had moved 5% of the data from shard0 to the
	new third shard, the data files, in their fragmented state, still
	needed the same amount of RAM. This can be explained by the fact that
	Foursquare check-in documents are small (around 300 bytes each), so
	many of them can fit on a 4KB page. Removing 5% of these just made
	each page a little more sparse, rather than removing pages
	altogether.[2]

	After the first day's outage it had become clear that chunk migration,
	sans compaction, was not going to solve the immediate problem. By the
	time the second day's outage occurred, we had already move 5% of the
	data off of shard0, so we decided to start an offline process to
	compact the database using MongoDB’s repairDatabase() feature. This
	process took about 4 hours (partly due to the data size, and partly
	because of the slowness of EBS volumes at the time). At the end of
	the 4 hours, the RAM requirements for shard0 had in fact been reduced
	by 5%, allowing us to bring the system back online.

	* Afterwards
	Since repairing shard0 and adding a third shard, we’ve set up even
	more shards, and now the check-in data is evenly distributed and there
	is a good deal of extra capacity. Still, we had to address the
	fragmentation problem. We ran a repairDatabase() on the slaves, and
	promoted the slaves to masters, further reducing the RAM needed on
	each shard to about 20GB.

	* How is this issue triggered?
	Several conditions need to be met to trigger the issue that brought
	down Foursquare:
	1. Systems are at or over capacity. How capacity is defined varies; in
	the case of Foursquare, all data needed to fit into RAM for acceptable
	performance. Other deployments may not have such strict RAM
	requirements.
	2. Document size is less than 4k. Such documents, when moved, may be
	too small to free up pages and, thus, memory.
	3. Shard key order and insertion order are different. This prevents
	data from being moved in contiguous chunks.

	Most sharded deployments will not meet these criteria. Anyone whose
	documents are larger than 4KB will not suffer significant
	fragmentation because the pages that aren’t being used won’t be
	cached.

	* Prevention
	The main thing to remember here is that once you’re at max capacity,
	it’s difficult to add more capacity without some downtime when objects
	are small. However, if caught in advance, adding more shards on a
	live system can be done with no downtime.

	For example, if we had notifications in place to alert us 12 hours
	earlier that we needed more capacity, we could have added a third
	shard, migrated data, and then compacted the slaves.

	Another salient point: when you’re operating at or near capacity,
	realize that if things get slow at your hosting provider, you may find
	yourself all of a sudden effectively over capacity.

	* Final Thoughts
	The 10gen tech team is working hard to correct the issues exposed by
	this outage. We will continue to work as hard as possible to ensure
	that everyone using MongoDB has the best possible experience. We are
	thankful for the support that we have received from Foursquare and our
	community during this unfortunate episode. As always, please let us
	know if you have any questions or concerns.

	[1] Chunks get split when they are 200MB into 2 100MB halves. This
	means that even if the number of chunks on each shard was the same,
	data size is not always so. This is something we are going to be
	addressing in MongoDB. We'll be making splitting balancing look for
	this imbalance so it can act upon it.

	[2] The 10gen team is working on doing online incremental compaction
	of both data files and indexes. We know this has been a concern in
	non-sharded systems as well. More details about this will be coming
	in the next few weeks.