amcgregor/mongodb-faq.md

## mongodb-faq.md

      
    Raw
  

              mongodb-faq.md
            
          
    Alice's MongoDB FAQ

Personal

Which operating system do you use for your servers?

I use Gentoo Linux, for a wide variety of reasons.  Performance is a big one, with the least amount of RAM utilized on startup after a fresh install, startup times to all-services-responding of around four seconds, and broad support for operation in a cluster: distributed computing.  By default it compiles packages from source, allowing the installed packages to be highly optimized for the physical hardware it is running on, however it also integrates binary package support and distributed compiling allowing one "master" host to orchestrate compilation across an entire cluster.  This results in "from depclean" kernel compiliation times in the order of 40 seconds in my production cluster.  I SSHFS mount the package directory from the master on the other hosts and simply install the binary packages it produces.
Hand-installing a new Gentoo build takes around ten minutes on a reasonably powered VM—server; desktop application compilation takes substantially longer.  For an example of me doing this, and screwing up a bunch as I tried to parallelize the install too much, see my YouTube video (21 minutes @ 2x speed) on the subject.
For one client, the memory savings of switching from Ubuntu Server to Gentoo resulted in a financial savings of ~$12,000 USD/year by allowing the majority of VMs to use smaller RAM and CPU allocations.
Hosting

Should I adjust ulimit resource limits on MongoDB servers?

In general, yes.  The official documentation covers resource limits, including recommendations for values.  Of note, both hard and soft limits will need to be changed; per-session soft limits may not exceed the hard limit set by the root user.
There are several situations where increasing the limit can be critical:

Where you have enabled smallFiles and are hosting large, or many databases with many on-disk stripes.
Where you have a very large number of simultaneous connections.
Where your client applications are unable to pool connections; i.e. opening and closing a connection on each request.

In the latter case you may also need to reduce the SOCK_WAIT timeout on the server hosting your client application.  This will increase the availability of port numbers for outbound allocation, which get tied up after being closed for SOCK_WAIT seconds before being able to be allocated again.
Memory

What's the deal with "rss" and "vsz" memory allocated to MongoDB?

From Wikipedia:

Resident set size, RSS, is the portion of memory occupied by a process that is held in main memory (RAM). The rest of the occupied memory exists in the swap space or file system, either because some parts of the occupied memory were paged out, or because some parts of the executable were never loaded.

For MongoDB this quantity of RAM represents the server process itself, connection tracking, working memory (thread stacks), "working set" where answers to queries are prepared, and any chunks of the on-disk stripes locked for writing or explicitly marked for copying into the MongoDB server process.  It explicitly does not relate to the overall data set size, and will grow and shrink according to utilization.
VSZ, on the other hand, represents the virtual memory size.  It includes all memory that the process can access, including memory that is swapped out, memory-mapped files, and shared libraries, which are also memory-mapped.  This value will include the totality of the allocated on-disk stripes.  It'll rarely appear to shrink, but don't worry, it's virtual, and sparse.  I.e. it doesn't represent actual memory allocation, and is filled with large holes.
Having a large virtual memory area to work in is very important for MongoDB; this is why 32-bit support is deprecated.  On 32-bit Linux kernels the VSZ is limited to half of the possible maximum of 4 GiB—Linux uses a 50/50 user/kernel split—thus the 2 GiB data size limit on these systems.
The vast majority of MongoDB's virtual memory map will be memory-mapped on-disk files.
Should I enable swap on my MongoDB servers?

Yes.  The Linux Kernel's virtual memory manager expects swap to be present, and may under-allocate RAM if no swap can be found.  I always ensure I have, at a minimum, as much swap as there is RAM.  Where possible I aim to have 4x as much swap as RAM.  On larger servers I cap the amount of swap at 8GB.
Should I adjust swappiness?

To date I have not found a situation improved by adjusting swappiness from the default.  Alice's Law #144: Optimization without measurement is by definition premature.  Adjustments to a value like this, for me, require concrete evidence that the adjustment will garner an improvement without negative impact elsewhere.
How much RAM should I allocate to a MongoDB server?

As with most things it comes down to picking two of: good, fast, or cheap.
The best case is always to have enough RAM to fit all of your data.  This will allow MongoDB free reign to answer your queries without too large a penalty for queries not covered by indexes.  (These will still be slower than indexed queries, but there won't be a penalty for fetching data from the disk.)  When your data set grows, you can use sharding to spread your data across multiple hosts.  This is the "good" and "fast" choice, but likely not cheap (or possibly even practical) if you have lots of data.  It is cheap from a development standpoint: you can be far more relaxed with your overall query performance, but this can bite pretty hard when it comes time to scale.
The second option is to have enough RAM to contain your indexes, but not nessicarily the rest of the data.  This situation is good in that it may be the only option, and cheaper due to the lowered hardware requirements.  It's only fast if you are very careful writing your queries: you must ensure that your queries are always answerable using an index, and where possible utilize covered queries to allow MongoDB to avoid interrogating the on-disk data.
MongoDB measures memory performance using the "page faults" counter.  A page fault is what happens when a process attempts to read a chunk of memory that isn't actually loaded into RAM because either it has been swapped to disk to make room, or it's a part of a memory-mapped file that isn't in the cache.  If your data set size exceeds RAM you should expect some faulting.  Just ensure all of your queries use indexes and you should be fine.  Indexed queries shouldn't page fault just querying, only in streaming the data back to the client.  (As long as indexes fit in RAM.)
In the event of a page fault, the CPU's VMMU (virtual memory management unit) catches the error and runs a callback in the kernel.  While this callback is running, un-swapping or loading file chunks from disk to correct the fault, the thread attempting the read gets suspended.  This is why, in general, excessive page fault counts are a Very Bad Thing™ and can cripple performance for database systems.
Regardless of which option you go with it's important to monitor the page fault counts as an indicator of trouble.
What's the deal with this "Transparent Huge Page" thing I keep seeing warnings about?

Huge pages are a method for an application to request large contiguous (uninterrupted) chunks of RAM; up to 1GB per "page" vs. the 4 KiB standard size.  Transparent huge page support is a method which allows a kernel, when allocating memory to applications not otherwise requesting huge pages, to allocate huge pages anyway.  This primarily benefits applications that perform bulk transfers.  Such applications include web browsers (these days a large strain on virtual memory systems), multimedia applications like video codecs or 3D renderers, or, in general, desktop-class workloads.  Server processes, on the other hand, suffer from the unpredictability of THP support and some of the pathalogical cases it introduces.
Say you're writing a server that uses memory-mapped files, like MongoDB does.  You expect that not all of a file may be loaded at once.  The cost of a page fault can be estimated: you need to load 4 KiB of data from the disk.  (Roughly; there are other optimizations for file access, such as read-ahead, that I'm not covering here.)  If you have transparent huge pages, though, the amount of data that needs to be moved in the event of a page fault may exceed a gigabyte!  This will clearly take much longer to service than a 4 KiB chunk.
It gets worse.  There is a problem with huge pages: because RAM is allocated willy-nilly, and memory may get freed in a different order than it was allocated in, RAM becomes fragmented.  If THP is disabled, this isn't a problem.  Every page is the same size, so fragmentation doesn't matter at all.  If THP is enabled, though, the kernel will regularly attempt to defragment RAM to free up enough contiguous (uninterrupted) space to build new huge pages.  It will also attempt to do this in the event it wants to hand a process a huge page, but none are available—causing a variable time delay in the allocation!
These two aspects make it very important to disable THP on servers running a mongod process in order to have reliable performance characteristics and to avoid memory fragmentation issues.  Do not ignore the warning MongoDB emits on startup if it detects THP is enabled.  If you frequently use slow IO devices (like USB thumb drives) on a desktop Linux installation you may want to disable THP, too, if you notice intermittent process hangs when writing to those slow IO devices.
Backups

What methods are available to backup my data?

There are several options available to you.  Each has its own pros and cons, and notably each has a very different amount of effort (and thus downtime) required to restore.  A standalone MongoDB server—one not a member of a replica set—is not generally recommended; doubly so unless you have a regular backup plan in place.  Without replication, several of the options listed below won't be applicable.
MongoDB Management Service

The backup service offered by MMS requires your database operate as a replica set.  The first gigabyte is free, and it's $2.50/GB/month ($30/GB/year) otherwise.  I can not stress how useful MMS is as a service, and, full disclosure, I'm a very satisfied customer.  This allows you to perform point-in-time restores from a convienent web interface.  Of note: place your backup processes on hosts other than your mongod database servers.
Replication Failover


Do this.  (I do.  Everyone should.)

The optimum solution for small-scale failures (i.e. failure of a single server) is a replica set.  This will allow for automatic failover in the event of a problem with the active primary.  This also ensures the shortest downtime—in my own tests, election of a new primary takes a few hundred milliseconds, and client drivers already integrate support for failover.
Some work may be needed in your application to handle "retrying" cursors that die mid-iteration gracefully, but even then replication failover will ensure the minimum amount of downtime possible for your application.  Without replication, your application's availability is at substantial risk.
Delayed Secondary

MongoDB allows you to "delay" a secondary by a given amount of time, say, 24 hours.  This delay must be within the bounds of your oplog; if your oplog tracks the last 110 hours of operations, then you can delay by almost that much without issue.  (Running close to the limit of your oplog may introduce some reliability issues; try to ensure your oplog always has enough headroom for growth without disrupting your backups.)
These delayed secondaries are extremely useful to recover not just from catastrophic failure of the whole cluster, but to allow recovery from user error.  For examples: dropped collections, mangled records, etc.
Manual Dump

MongoDB comes with many command-line tools to import, export, and otherwise manipulate your data.  A pair of these tools are called mongodump and mongorestore.  This set of tools allows you to extract your data in a compacted binary form.  It provides many options which can be useful for performing reliable backups.  An important configuration option is --oplog; it tells mongodump to record the operations that happen to the database during the time it takes to perform the backup, giving you a true "point in time" snapshot.  This option is only usable against a replica set.
Because MongoDB on-disk data files are "sparse", i.e. they develop holes, and are over-allocated vs. the amount of actual data, and each record has some "head room" added to it to allow it to grow without needing to be moved, you may see a potentially large difference between the MongoDB on-disk data file sizes and the size of your backup.  This is normal; the backup .bson files omit the holes present in the source data.
Of note, the mongodump tool can be extremely intensive on your database.  Running it against a secondary may be adviseable, if the replication delay is acceptable.