Skip to content

Instantly share code, notes, and snippets.

@rjl493456442
Last active May 25, 2020 06:40
Show Gist options
  • Save rjl493456442/3e3990a9690216da1dcd0c0b421042b0 to your computer and use it in GitHub Desktop.
Save rjl493456442/3e3990a9690216da1dcd0c0b421042b0 to your computer and use it in GitHub Desktop.

Database thoughts

Issues explaination

For trie node reading, we have a huge read amplification. According to Peter's benchmark statistic, the real data size 35GB, while the total disk read is 10,000GB, the read amplification is around 280.

I think this issue not only exists in state serving(for fast sync), but also exists in the block processing. Now we have two state accessing modes during the block processing:

Read from the trie directly

In this mode, we will try to access the state data through the trie path by resolving all the trie nodes in the specified path. Although we have memory trie database to reduce the disk access. But I Guess the cache hit rate for trie node may not very high.

Then for disk access, it's a huge problem. A single trie nodekey = hash(trie_node_content) disk access may require serveral disk access(multi-level0-access, serveral non-level0 access).

For single sstable searcing, it may involves these procedures:

  • Open the file, load the meta data(index block, bloom filter)
  • Filter by bloom filter
  • Search by index block, locate the data by index quickly
  • Load the specific data block(4KiB/operation)
  • Cache the loaded data
  • Cache the loaded meta data

If we can't find at the level0, then level1, ... leveln.

Read from the snapshot

With the help of the snapshot, we can locate the state data directly without resolving the whole merkle path(which is really expensive), read them and continue vm execution. But during the commit stage, we still need to resolve the merkle path and regenerate a new path based on the old one and execution changeset for new hash calculation.

So all in all, we need to resolve the merkle path for all state data(accounts, slots). The average merkle path may contain 10 trie nodes(really random number)? For each trie node, if we can hit in cache, then we may involve several disk access.

Leveldb in-built cache

Leveldb has its own cache mechanism which is used for storing two things: (a) sstable metadata and (b) sstable data.

metadata cache

For metadata, it includes (a) bloom filters (b) data indexes (c) etc. If we have these metadata cached, we can skip most of the useless disk access(with a low failure rate bloom filter).

And also, metadata size is small. Let's calculate it.

  • Leveldb will generate one filter data for each 2KB data. We assume the average trie node size is 200B, then whenever we store 10nodes, we create a new filter
  • The parameters we used for leveldb right now is: 10 Bits per key. So the filter size is 100bits ~ 12Byte.
  • The sstable size is 2MB, so there are 1000 filters in each sstable file, 12KB filters

If we have 100,000 sstable files(200GB total size), then we wil have 1.2GB filters(it's acceptable to put them in the memory).

And with the current bloom filter parameter(10 bits per key), the overall false positive rate is around 2%. Image if we put all Live sstable file metadata in the memory, then whenever we try to load some data, we can skip lots of metadata disk access.

data cache

Whenever we try to load data block, the size is 4KiB. And also we will put it in the data cache. However I would say these kind of cached data is useless because the trie nodes are totally random. So we have very high chance the cached data won't be touched again until it's evicted.

How can we improve the overall cache hit rate?

  • Improve the size of metadata cache

    Actually we already assign a very high number for metadata cache(a half of the maximum fd limit). So we need to improve the quality of cached items:

    • Evict the removed(removed by compaction) items(avoid useless items)
    • Add the created(created by compaction) items automatically(avoid re-open)

    Hopefully these approaches can improve the hit rate of metadata.

  • Allocate less size for data cache

  • Add these metrics into leveldb, monitor and debug

    • Meta data cache hit rate for read operation

    • Data cache hit rate for read operation

    • Memory db hit rate for read operation

    • Bloom filter false positive rate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment