ryogrid/lmdb-survey.md

## lmdb-survey.md

      
    Raw
  

              lmdb-survey.md
            
          
    Introduction


This is article about knowledges I got with survey of LMDB a DBMS

Discraimer


The author is not an expert on database systems, so there is a possibility that my descriptions are sloppy. Please keep this in mind. I would appreciate it if you could point out any mistakes
Some of the content is based on speculation. I write in a way that makes it clear that these parts are based on speculation

(Because I couldn't go as far as checking the implementation)


Why I decided to survey about LMDB


There is a communication protocol architecture for distributed applications called nostr
Although it has a relatively high degree of freedom, the current main application is distributed SNS (microblogging), so the keyword nostr is often used to refer to distributed SNS
For details, please refer to the link below

nostr - Notes and Other Stuff Transmitted by Relays


The author found LMDB because it was used in some implementations of nostr's relay servers

For example, it is used in an implementation called strfry

As will be explained later, LMDB is KVS, and range scanning can only be done with K, as it is called there
Since this is inconvenient, strfry wraps LMDB with a library called RasgueaDB and uses it after preparing a more sophisticated I/F such as range scan using specific data in V as an index. It's interesting


LMDB only provides primitive functions, but conversely, it is also compact software that does not include implementation for advanced functionality. It also takes the form of an embedded DB. For this reason, it is sometimes used as a storage engine for other DBMSs

For example, TiDB , which is a distributed database, seems to have had a time in the past when it was possible to use LMDB as its storage engine
InfluxDB , a time-series database , was also considered as a candidate for storage engine renewal


After searching the web a bit, I found information that it had high performance, and it also emphasized that it was memory mapped (using mmap), so I became interested

The expression "high performance" is full of ambiguity, but please forgive me


You can find out why I was interested in memory mapped in the next section

Note: A simple explanation of the relay server


It is the only server in the nostr architecture and acts as an intermediary for data between clients. If you just look at what it does, it can be said that it is a unique database system with a stream interface (Websocket)
Characteristics of relay server workload (as perceived by the author)

Structure of data to be handled

The data has little bit relational aspects


Mostly read
Write operation is mostly addition of new entries and is updating of existing entries rarely
Access patterns have temporal locality

There are many requests for entries recorded within the past few hours based on the current time


Overview of LMDB


Name: Lightning Memory-Mapped Database (LMDB)

It seems that it was originally called Memory-Mapped Database (MDB), but changed to its current name in 2012


KVS（Key-Value Store）

You can iterate by key
Supports transactions


Originally, OpenLDAP used Berklay DB as its backend, but there were various issues, so development began as a solution to those issues

Although I didn't know about it, it seems to be a fairly well-known system in the database system domain


Embedded DB
Read Optimized
According to the official website, it is "fully transactional with full ACID semantics"

Is LMDB's performance Nice?


benchmark results

https://web.archive.org/web/20210420131134/http://www.lmdb.tech/bench/inmem/
https://www.influxdata.com/blog/benchmarking-leveldb-vs-rocksdb-vs-hyperleveldb-vs-lmdb-performance-for-influxdb/


Articles that may be helpful

Notes on Embeddable key value store - hnakamura's blog


It seems that the appearance changes considerably depending on the evaluation method, but it seems certain that he is a high performer
What kind of design is it?

Design of LMDB


The contents of this section are the points of LMDB design that I read and the points that I found interesting from the following paper by the main developer, Howard Chu

Howard Chu  “MDB: A Memory-Mapped Database and Backend for OpenLDAP”


use of mmap

(Although this is not particularly unusual for embedded databases)
The paper claims that using mmap improves I/O efficiency compared to using buffer pools, but a paper was published in 2022 that shows that this is incorrect

Andrew Crotty, Viktor Leis, Andrew Pavlo, “Are You Sure You Want to Use MMAP in Your Database Management System?”


TLB shutdowns occur frequently when mmap is used on multi-core (though it should have been the same with multi-processor) computers that have become commonplace in recent years, but this seems to be making the situation worse

Regarding the LMDB paper, it may be necessary to interpret it in light of the time period in which it was written


After all, as of 2023, the advantage of using mmap is that it allows you to do something similar to Pointer Swizzling, which is introduced in this year's Advent Calendar article, more easily and with less overhead. I'm thinking about it

It is a speculation that it does something like that (I have not been able to confirm its implementation)
Even if things go as predicted, I think it would be difficult to say whether they would be effective enough to overcome the disadvantages of using mmap shown by Andrew Crotty and others


Read Optimized

CC protocol is MVCC (MultiVersion Concurrency Control)
Even when executing parallel transactions, one transaction does not block or abort the execution of another transaction
Both read and write
However, there is a restriction that only one write transaction can be executed at a time

It is seems to maintain data integrity

It is mentioned at the end of “3.1 Problem #1: Transactional Safety” in the paper I referred to in the explanation about mmap usage

Andrew Crotty, Viktor Leis, Andrew Pavlo, “Are You Sure You Want to Use MMAP in Your Database Management System?”


Copy on Write semantics (page content cannot be modified once written)

When performing an update operation, instead of rewriting the relevant page, a new page is created with the contents copied, and updates are added to it. I will explain later why this method works


Append Only B+tree

When Copy on Write semantics are adopted, no matter how the links between nodes are expressed, since it is not possible to directly rewrite an existing node, write transactions are required to make changes to the tree. requires a different approach than usual. Therefore, we use a data structure or algorithm called Append Only B+tree
The link below explains this

how the append-only btree works


Let me explain the contents of the link destination with some supplementary information...

The end page of the tree is called a leaf, and key and value pairs are stored in the leaf
The search is performed from the top (root) of the tree toward the leaves
When updating the contents of a leaf (this should be the same when adding a new leaf), replace all pages from the target leaf to the root node using the Copy on Write semantics

Replace links other than leaves with appropriately rewritten link information


Information about the root page that starts when a transaction accesses data is recorded in a page called a meta page, which is also newly created using Copy on Write, and is stored in a separate DB from the pages in the tree shown in the diagram. Written to the end of the file. mmap is designed to replace the root page information in memory  (supposedly)
By doing these things, you will essentially end up with a tree whose leaf contents have been updated. However, for a while there will also be trees that have not been updated

(Replaced pages are added to the list of reusable pages when they are found to be no longer needed, and become logically deleted)


This is an important point when executing multiple transactions concurrently and in parallel, and what it means is that another read transaction may be traversing the tree at the time you start updating the tree. , the read transaction can continue processing on the tree that has not been updated

In this case, the read transaction may read the value of the leaf before it was updated, but I will explain later that this is not a problem


In other words, by adopting something called Append Only B+tree, LMDB is able to execute read and write transactions without destroying data integrity without any special exclusive control


Methods for maintaining data integrity and ensuring crash resistance with CoW semantics and Append Only B+tree

Maintaining data integrity

After all, roughly speaking, you are doing concurrency control using the method explained in Append Only B+tree
I mentioned that when a write transaction is executed, an old tree and a new tree coexist, and there can also be a read transaction that reads the value of the older one. , this creates a situation where multiple versions of an entry (effectively a page) exist, meaning that each transaction may see a different version
Yes, MutiVersion Concurrency Control (MVCC) is a method to perform concurrency control between transactions in this way
It is probably obvious to those reading this article that it is possible to maintain data integrity relatively easily by using the behavior as described, so I will omit it in this article!

Keeping in mind the constraint that there is only one write transaction


Making crash resistant

Many database systems use a method called Write Ahead Logging (WAL) to write logs about the start and end of write transactions, operations performed, etc. to files so that they can be recovered from crashes. LMDB does not use this method
I don't think it was clearly explained in the LMDB paper, but the author's understanding (including speculation) is as follows

① When a new page is created (including when reused) instead of writing a log, its contents are written to a file each time along with the meta page
② Since only one writer transaction is executed at the same time, there cannot be a transaction that requires or is in the process of being written at the side when the uppermost writing is finished
③Since the data existing in the mmapped memory area is mapped with read only, there can be no dirty pages in the area, and the OS will not write the data in the mapped area to a file
From these three points, the contents of the DB file are basically always synchronized with the on-memory contents, and the consistency of all data is correct, so even if a crash occurs, all you have to do is reread the DB file for recovery
If a crash occurs while a write transaction is writing data

In this case, the contents of the DB file may be in an inconsistent state
However, as described in ①, the meta page is also written out at the same time, so it seems that by checking whether the writing has been completed, it is possible to distinguish whether the data is uncommitted or not

Similar to WAL, LMDB must avoid notifying the transaction completion to application until all writes are completed
There was a statement in the paper or official web page that the DB file is also a log, but I guess that's what they meant


LMDB can distinguish that writing to storage is completed or not completed, LMDB can perform a rollback in the same way as Undo in Redo/Undo. It should be possible to discard the uncommitted write data and replace the corresponding data with the version before the write. By design, data from previous versions always exist without being reused after crashing


I/O to the disk is written directly to the file without going through the mmapped area

When mmaping, the memory area to be mapped is designed to be read only
Since the memory area is passed as is to the application that uses LMDB, if you enable writing, the data in the DB will be destroyed if the application code writes by mistake
When a write transaction performs a write operation, if there is a page that can be reused (= a page that is known to no longer be referenced), it reuses it and writes directly to the corresponding area on the file side. If there is no reusable page, add it to the end of the DB file

Due to copy on write semantics, when a write operation is performed, a page (area) that is no longer referenced is created in the DB file. It seems that if you don't reuse this, the DB file will grow quickly


Passing data to application with zero copy

Passing the pointer of the memory area mapped with mmap to the application as is
In other words, a read transaction does not perform memory copying in the LMDB layer, so the number of memory copies is zero
If the pointer passed to the application is left accessible forever, old pages will not be able to be reused, so the application must make explicit function calls for LMDB to understand the start and end of a transaction. request. After the transaction ends, it is no longer guaranteed that the passed memory area can be accessed
There is a point to note about this I/F: as long as the transaction is not terminated, the page containing the data accessed during the transaction cannot be reused, so it remains open for a long time. The documentation says not to

I think it's still better because it doesn't interfere with the execution of other transactions


In other words, the application is required to quickly use the acquired data to make it unnecessary, or to copy the memory and save it to another location

It is not accepted to leave a reference somewhere, and there are many applications where the former is difficult. On the other hand, the latter poses the dilemma of wanting to avoid copying memory even though it has been acquired with zero copy, which may be a troubling point for users


Sequential disk I/O with copy on write semantics

As mentioned above, when performing a write, data is appended to the end of the file if there is no reusable page, but this allows the write transaction to write as much data as possible to the end of the file at the time of commit. This allows I/O to be performed in the form of efficient sequential access
As a side note, I think there was a similar story with LSM-tree type DBMS


After all, why is it so fast?


I think that the only advantages over other systems are as follows

(But does that make it that much faster? I'd like an opinion from an expert)
zero copy
Read Optimized

Reads are not blocked by writes and multiple transactions can run
(On the other hand, write cannot execute multiple transactions simultaneously, so the performance may be average or lower)


Raw pointers can be used in on memory processing

In many DBMSs that use buffer pools, it is necessary to repeatedly fetch pages from the buffer pool using logical addresses (page IDs, etc), but since LMDB is memory mapped, it may be replaced to access with raw pointers with doing something similar to Pointer Swizzling or mapping to the same address each initializing time (I think)


Suitable and unsuitable use cases


(Please note that this description is based on other KVS type DBMSs)
Suitable case

Read-bound workload case


Cases that are not suitable

Cases where the data size handled in one write transaction is too small
Let's consider the case of adding or updating a new key-value pair of 4 byte key and 4 byte value, totaling 8 bytes
LMDB performs processing on a page-by-page basis, and once a page is created, it is made immutable, so each operation involves the following inefficient processing:

In addition to the leaf pages, the pages located up to the root of the B+-tree (including the root page) are copied to memory, updated, and then written to disk

(The page size of LMDB may be larger than 4kB depending on the settings)


This area is mentioned in the reply between the authors of the linked article below (the second one was written by the LMDB developer) using words such as "Write Amplification"

https://web.archive.org/web/20210526123038/http://banksco.de/p/lmdb-the-leveldb-killer.html
https://web.archive.org/web/20201111195059/https://symas.com/is-lmdb-a-leveldb-killer/


Then, is it suitable for Nostr's relay server?


The data we handle has some relational aspects

=> It seems possible that it can be managed if KVS is used properly

It's neither good nor bad


Read-bounded workload

=> LMDB is Read Optimized

Very Good


Write mostly adds new entries and rarely updates existing entries

Especially since it's LMDB, there's no advantage

It's neither good nor bad


Access patterns have temporal locality (many requests are for entries recorded within a few hours based on the current time)

=> It is a characteristic that is easy to handle as a DBMS in general, and I do not think that LMDB has high performance especially for workloads with such characteristics. On the contrary, even though LMDB allows iteration in the order of K (index scan) only, it does not support iteration using other data (*), so it can be seen as inferior compared to RDB etc

Not Good
*: Here we assume that time information or etc will be used as the index key


My Opinion: Considering the original high speed of LMDB, I think this is a better choice than using common RDB (*) or NoSQL (including KVS) systems

*: I think using RDB is too overspec for what you want to do
Note: This is based on the following assumptions

It is necessary to handle data larger than primary memory. Or it could be such a situation

If you don't have this premise, I think it would be better to use something like Redis


Horizontally scale stateless servers and avoid getting data from a common DB server
Do not use a computer with a large number of cores, such as was used in the verification of Andrew Crotty et al.'s paper on the use of MMAP

The paper uses a 64-core computer for verification
According to knowledges on the Web, the negative impact on performance of TLB shootdown (not limited to I/O) seems to be greater as the number of cores increases
Therefore, if you are not using a computer with a monster number of cores, I/O performance disadvantege as shown in the evaluation results (although I don't think the paper's claims can be ignored) may not exist
In other words, the disadvantage of an LMDB that uses (or stores) MMAP over a DBMS that uses a buffer pool may not be as obvious
Furthermore, LMDB may be able to avoid some disadvantages other than the drop in read I/O performance due to TLB shootdown when using MMAP

Not writing through the map area, which is the worst thing to do when using MMAP

Is writing to storage in DMBS physically possible to secondary storage to ensure data persistence? It needs to be written out. In other words, once it is written to the OS's on-memory disk cache, it is finished! does not become
Nevertheless, writing to storage via a mapped area in memory will result in a memory copy to the OS's disk cache, incurring unnecessary overhead


Memory management of mmapped area

When a new page is loaded from a DB file into memory, if the page is evicted to make needed a memory area for that data, which page is evicted is determined by the DBMS and it is an important point
In this respect, LMDB may have some control over the appropriate pages to be evicted by cleverly using the madvice function

I feel like it would be better to implement a buffer pool...
Although we are not certain, we can confirm that there is a call to the madvice function in the implementation


Side Note


Given nostr's inherently redundant architecture and the relatively loose reliability that users expect from applications (this is just my understanding), I feel that a DBMS that tolerates data loss to a certain extent is fine

Even if data for a certain time period is lost, it is possible to retrieve it again from another relay server


For example, wouldn't it be nice if there was something like Redis that could handle data larger than primary memory, and persistence could be done with periodic dumps like Redis

Conclusion


Enjoy!