Skip to content

Instantly share code, notes, and snippets.

@jnewbery
Last active February 23, 2021 06:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jnewbery/552d53991c4e13c04823b6553b62b039 to your computer and use it in GitHub Desktop.
Save jnewbery/552d53991c4e13c04823b6553b62b039 to your computer and use it in GitHub Desktop.
File Descriptors in Bitcoin Core

Overview

  • We use leveldb for multiple databases: block index, chainstate and (optionally) txindex. block index is very small, but chainstate and txindex can grow to many hundreds of files.
  • Each database has a limit on number of 'open files'. This limit includes both mmapped files (which ldb takes the fd for, opens and then closes, returning the fd) and files which ldb opens and holds the fd for. There's a default in the leveldb code here. This is the per-database max_open_files limit.
  • As well as this per-database max_open_files limit , there's a global leveldb limit on number of mmapped files. That's set here. This is the global mmap_limit.
  • (each ldb databse also has a few housekeeping files that it keeps open: LOCK, MANIFEST and a log, but let's ignore those)

Previous settings

Prior to 2018, we set max_open_files to 64 on all architectures, due to concerns about fd exhaustion. The global mmap_limit was set to 1000. Since there were only three databases, and the maximum number of open files was 64 for each, the maximum possible number of open files was 64 x 3 = 192. That's nowhere near the global mmap_limit, so all open files would be mmapped and ldb would hold very few fds.

Changes in 2018

In PR 12495, eklitkze increased the max_open_files from 64 to 1000 for all systems except 32 byte POSIX. The PR description is excellent. I'll copy the rationale here:

When a LevelDB file handle is opened, a bloom filter and block index are decoded, and some CRCs are checked. Bloom filters and block indexes in open table handles can be checked purely in memory. This means that when doing a key lookup, if a given table file may contain a given key, all of the lookup operations can happen completely in RAM until the block itself is fetched. In the common case fetching the block is one disk seek, because the block index stores its physical offset. This is the ideal case, and what we want to happen as often as possible.

If a table file handle is not open in the table cache, then in addition to the regular system calls to open the file, the block index and bloom filter need to be decoded before they can be checked. This is expensive and is something we want to avoid.

Evan claimed that the fd exhaustion issue wouldn't be hit because:

  • On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using mmap(), and it does not retain an open file descriptor for such files.
  • On Windows non-socket files do not interfere with the main network select() loop, so the same fd exhaustion issues do not apply there.

I think his reasoning for 64-byte POSIX systems was wrong because max_open_files is a per-db limit and mmap_limit is a global limit. With #12495, each database could now open up to 1000 files (3000 total), but the global mmap_limit was 1000. Once that mmap limit was hit, ldb would continue to open files but would retain the fds for any files beyond 1000, meaning up to 2000 fds could be used by ldb.

I believe this is what happened to ossifrage in July 2018. His ldb reached the global 1000 mmap limit and then started retaining fds:

218 2018-07-31T19:41:23  <ossifrage> it looks like bitcoin-qt is using up 800 FDs for open files (chainstate, txindex, etc...)
219 2018-07-31T19:43:15  <ossifrage> 509 for txindex and 288 for chainstate
...
229 2018-07-31T19:58:47  <sipa> how many connwctions do you have?
230 2018-07-31T20:01:26  <ossifrage> sipa, 160

link

That meant that the number of fds held by bitcoin reached the 1024 limit and select() was not able to monitor new file descriptors:

208 2018-07-31T19:19:34  <ossifrage> This is a curious error: " dropped: non-selectable socket"

The fix was in PR 13860 (not merged but fixed in our ldb repo and merged as a subtree). The global mmap_limit was increased to 4096. That means that once again, the mmap_limit can't be reached (4096 > 1000 x 3), so ldb will only ever have a very small number of fds.

select() -> poll()

Finally, PR 14336 changes the net code to use the poll() syscall instead of select() for linux systems (select() is fine to use on windows because it doesn't have the 1024 fd limit, and poll() seems to be broken on macos). This should remove any concerns about fd exhaustion for good on linux. The only reason it was ever a problem was that select() could run into its limit of 1024.

Long term

We should move to using libevent, which takes care of all of this stuff and just uses the most appropriate syscall. But:

 34 2018-08-01T00:45:46  <gmaxwell> come on, why can't we just take the not-many-line-change to use poll?  I know libevent future ra ra ra... but we have held off this simple fix for years. :(

someone just has to go ahead an implement it.

Snapshot of current running node

→ bcli uptime
2603491
→ lsof -p $(pidof bitcoind) |    awk 'BEGIN { fd=0; mem=0; } /ldb$/ { if ($4 == "mem") mem++; else fd++ } END { printf "mem = %s, fd = %s\n", mem, fd}'
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
mem = 1985, fd = 0

So I have 1985 ldb files mmapped, and none open with file descriptors.

→ lsof -p $(pidof bitcoind) | grep ldb | grep txindex | wc -l
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
975
→ lsof -p $(pidof bitcoind) | grep ldb | grep chainstate | wc -l
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
992

Both of those are around the max_open_files limit. If I didn't have PR 13860 and mmap_limit set to 4096, then I expect many of those files would be open with fds instead of mmapped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment