jnewbery/file_descriptors_bitcoin_core.md

## file_descriptors_bitcoin_core.md

      
    Raw
  

              file_descriptors_bitcoin_core.md
            
          
    Overview


We use leveldb for multiple databases: block index, chainstate and
(optionally) txindex. block index is very small, but chainstate and txindex
can grow to many hundreds of files.
Each database has a limit on number of 'open files'. This limit includes both
mmapped files (which ldb takes the fd for, opens and then closes, returning
the fd) and files which ldb opens and holds the fd for. There's a default in
the leveldb code here.
This is the per-database max_open_files limit.
As well as this per-database max_open_files limit , there's a global leveldb
limit on number of mmapped files. That's set here.
This is the global mmap_limit.
(each ldb databse also has a few housekeeping files that it keeps open: LOCK,
MANIFEST and a log, but let's ignore those)

Previous settings

Prior to 2018, we set max_open_files to 64 on all architectures, due to
concerns about fd exhaustion. The global mmap_limit was set to 1000. Since
there were only three databases, and the maximum number of open files was 64
for each, the maximum possible number of open files was 64 x 3 = 192. That's
nowhere near the global mmap_limit, so all open files would be mmapped and ldb
would hold very few fds.
Changes in 2018

In PR 12495, eklitkze
increased the max_open_files from 64 to 1000 for all systems except 32 byte
POSIX. The PR description is excellent. I'll copy the rationale here:

When a LevelDB file handle is opened, a bloom filter and block index are
decoded, and some CRCs are checked. Bloom filters and block indexes in open
table handles can be checked purely in memory. This means that when doing a
key lookup, if a given table file may contain a given key, all of the lookup
operations can happen completely in RAM until the block itself is fetched. In
the common case fetching the block is one disk seek, because the block index
stores its physical offset. This is the ideal case, and what we want to
happen as often as possible.
If a table file handle is not open in the table cache, then in addition to the
regular system calls to open the file, the block index and bloom filter need to
be decoded before they can be checked. This is expensive and is something we
want to avoid.

Evan claimed that the fd exhaustion issue wouldn't be hit because:


On 64-bit POSIX hosts LevelDB will open up to 1000 file descriptors using
mmap(), and it does not retain an open file descriptor for such files.
On Windows non-socket files do not interfere with the main network select()
loop, so the same fd exhaustion issues do not apply there.


I think his reasoning for 64-byte POSIX systems was wrong because
max_open_files is a per-db limit and mmap_limit is a global limit. With #12495,
each database could now open up to 1000 files (3000 total), but the global
mmap_limit was 1000. Once that mmap limit was hit, ldb would continue to open
files but would retain the fds for any files beyond 1000, meaning up to 2000
fds could be used by ldb.
I believe this is what happened to ossifrage in July 2018. His ldb reached the
global 1000 mmap limit and then started retaining fds:
218 2018-07-31T19:41:23  <ossifrage> it looks like bitcoin-qt is using up 800 FDs for open files (chainstate, txindex, etc...)
219 2018-07-31T19:43:15  <ossifrage> 509 for txindex and 288 for chainstate
...
229 2018-07-31T19:58:47  <sipa> how many connwctions do you have?
230 2018-07-31T20:01:26  <ossifrage> sipa, 160

link
That meant that the number of fds held by bitcoin reached the 1024 limit and
select() was not able to monitor new file descriptors:
208 2018-07-31T19:19:34  <ossifrage> This is a curious error: " dropped: non-selectable socket"

The fix was in PR 13860 (not
merged but fixed in our ldb repo and merged as a subtree). The global
mmap_limit was increased to 4096. That means that once again, the
mmap_limit can't be reached (4096 > 1000 x 3), so ldb will only ever have a
very small number of fds.
select() -> poll()

Finally, PR 14336 changes the
net code to use the poll() syscall instead of select() for linux systems
(select() is fine to use on windows because it doesn't have the 1024 fd limit,
and poll() seems to be broken on macos). This should remove any concerns about
fd exhaustion for good on linux. The only reason it was ever a problem was that
select() could run into its limit of 1024.
Long term

We should move to using libevent, which takes care of all of this stuff and
just uses the most appropriate syscall. But:
 34 2018-08-01T00:45:46  <gmaxwell> come on, why can't we just take the not-many-line-change to use poll?  I know libevent future ra ra ra... but we have held off this simple fix for years. :(

someone just has to go ahead an implement it.
Snapshot of current running node

→ bcli uptime
2603491
→ lsof -p $(pidof bitcoind) |    awk 'BEGIN { fd=0; mem=0; } /ldb$/ { if ($4 == "mem") mem++; else fd++ } END { printf "mem = %s, fd = %s\n", mem, fd}'
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
mem = 1985, fd = 0

So I have 1985 ldb files mmapped, and none open with file descriptors.

→ lsof -p $(pidof bitcoind) | grep ldb | grep txindex | wc -l
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
975
→ lsof -p $(pidof bitcoind) | grep ldb | grep chainstate | wc -l
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
      Output information may be incomplete.
992

Both of those are around the max_open_files limit. If I didn't have PR 13860
and mmap_limit set to 4096, then I expect many of those files would be open
with fds instead of mmapped.