Skip to content

Instantly share code, notes, and snippets.

@ryogrid
Last active December 23, 2023 01:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ryogrid/d9132076d29d0e498797087529f10dba to your computer and use it in GitHub Desktop.
Save ryogrid/d9132076d29d0e498797087529f10dba to your computer and use it in GitHub Desktop.
Survey of internel design of Lightning Memory-Mapped Database (LMDB) and its usage on real applications (Nostr relays)

Introduction

  • This is article about knowledges I got with survey of LMDB a DBMS

Discraimer

  • The author is not an expert on database systems, so there is a possibility that my descriptions are sloppy. Please keep this in mind. I would appreciate it if you could point out any mistakes
  • Some of the content is based on speculation. I write in a way that makes it clear that these parts are based on speculation
    • (Because I couldn't go as far as checking the implementation)

Why I decided to survey about LMDB

  • There is a communication protocol architecture for distributed applications called nostr
  • Although it has a relatively high degree of freedom, the current main application is distributed SNS (microblogging), so the keyword nostr is often used to refer to distributed SNS
  • For details, please refer to the link below
  • The author found LMDB because it was used in some implementations of nostr's relay servers
    • For example, it is used in an implementation called strfry
      • As will be explained later, LMDB is KVS, and range scanning can only be done with K, as it is called there
      • Since this is inconvenient, strfry wraps LMDB with a library called RasgueaDB and uses it after preparing a more sophisticated I/F such as range scan using specific data in V as an index. It's interesting
    • LMDB only provides primitive functions, but conversely, it is also compact software that does not include implementation for advanced functionality. It also takes the form of an embedded DB. For this reason, it is sometimes used as a storage engine for other DBMSs
      • For example, TiDB , which is a distributed database, seems to have had a time in the past when it was possible to use LMDB as its storage engine
      • InfluxDB , a time-series database , was also considered as a candidate for storage engine renewal
  • After searching the web a bit, I found information that it had high performance, and it also emphasized that it was memory mapped (using mmap), so I became interested
    • The expression "high performance" is full of ambiguity, but please forgive me
  • You can find out why I was interested in memory mapped in the next section

Note: A simple explanation of the relay server

  • It is the only server in the nostr architecture and acts as an intermediary for data between clients. If you just look at what it does, it can be said that it is a unique database system with a stream interface (Websocket)
  • Characteristics of relay server workload (as perceived by the author)
    • Structure of data to be handled
      • The data has little bit relational aspects
    • Mostly read
    • Write operation is mostly addition of new entries and is updating of existing entries rarely
    • Access patterns have temporal locality
      • There are many requests for entries recorded within the past few hours based on the current time

Overview of LMDB

  • Name: Lightning Memory-Mapped Database (LMDB)
    • It seems that it was originally called Memory-Mapped Database (MDB), but changed to its current name in 2012
  • KVS(Key-Value Store)
    • You can iterate by key
    • Supports transactions
  • Originally, OpenLDAP used Berklay DB as its backend, but there were various issues, so development began as a solution to those issues
    • Although I didn't know about it, it seems to be a fairly well-known system in the database system domain
  • Embedded DB
  • Read Optimized
  • According to the official website, it is "fully transactional with full ACID semantics"

Is LMDB's performance Nice?

Design of LMDB

  • The contents of this section are the points of LMDB design that I read and the points that I found interesting from the following paper by the main developer, Howard Chu

  • use of mmap

    • (Although this is not particularly unusual for embedded databases)
    • The paper claims that using mmap improves I/O efficiency compared to using buffer pools, but a paper was published in 2022 that shows that this is incorrect
    • TLB shutdowns occur frequently when mmap is used on multi-core (though it should have been the same with multi-processor) computers that have become commonplace in recent years, but this seems to be making the situation worse
      • Regarding the LMDB paper, it may be necessary to interpret it in light of the time period in which it was written
    • After all, as of 2023, the advantage of using mmap is that it allows you to do something similar to Pointer Swizzling, which is introduced in this year's Advent Calendar article, more easily and with less overhead. I'm thinking about it
      • It is a speculation that it does something like that (I have not been able to confirm its implementation)
      • Even if things go as predicted, I think it would be difficult to say whether they would be effective enough to overcome the disadvantages of using mmap shown by Andrew Crotty and others
  • Read Optimized

  • Copy on Write semantics (page content cannot be modified once written)

    • When performing an update operation, instead of rewriting the relevant page, a new page is created with the contents copied, and updates are added to it. I will explain later why this method works
  • Append Only B+tree

    • When Copy on Write semantics are adopted, no matter how the links between nodes are expressed, since it is not possible to directly rewrite an existing node, write transactions are required to make changes to the tree. requires a different approach than usual. Therefore, we use a data structure or algorithm called Append Only B+tree
    • The link below explains this
    • Let me explain the contents of the link destination with some supplementary information...
      • The end page of the tree is called a leaf, and key and value pairs are stored in the leaf
      • The search is performed from the top (root) of the tree toward the leaves
      • When updating the contents of a leaf (this should be the same when adding a new leaf), replace all pages from the target leaf to the root node using the Copy on Write semantics
        • Replace links other than leaves with appropriately rewritten link information
      • Information about the root page that starts when a transaction accesses data is recorded in a page called a meta page, which is also newly created using Copy on Write, and is stored in a separate DB from the pages in the tree shown in the diagram. Written to the end of the file. mmap is designed to replace the root page information in memory (supposedly)
      • By doing these things, you will essentially end up with a tree whose leaf contents have been updated. However, for a while there will also be trees that have not been updated
        • (Replaced pages are added to the list of reusable pages when they are found to be no longer needed, and become logically deleted)
      • This is an important point when executing multiple transactions concurrently and in parallel, and what it means is that another read transaction may be traversing the tree at the time you start updating the tree. , the read transaction can continue processing on the tree that has not been updated
        • In this case, the read transaction may read the value of the leaf before it was updated, but I will explain later that this is not a problem
      • In other words, by adopting something called Append Only B+tree, LMDB is able to execute read and write transactions without destroying data integrity without any special exclusive control
  • Methods for maintaining data integrity and ensuring crash resistance with CoW semantics and Append Only B+tree

    • Maintaining data integrity
      • After all, roughly speaking, you are doing concurrency control using the method explained in Append Only B+tree
      • I mentioned that when a write transaction is executed, an old tree and a new tree coexist, and there can also be a read transaction that reads the value of the older one. , this creates a situation where multiple versions of an entry (effectively a page) exist, meaning that each transaction may see a different version
      • Yes, MutiVersion Concurrency Control (MVCC) is a method to perform concurrency control between transactions in this way
      • It is probably obvious to those reading this article that it is possible to maintain data integrity relatively easily by using the behavior as described, so I will omit it in this article!
        • Keeping in mind the constraint that there is only one write transaction
    • Making crash resistant
      • Many database systems use a method called Write Ahead Logging (WAL) to write logs about the start and end of write transactions, operations performed, etc. to files so that they can be recovered from crashes. LMDB does not use this method
      • I don't think it was clearly explained in the LMDB paper, but the author's understanding (including speculation) is as follows
        • ① When a new page is created (including when reused) instead of writing a log, its contents are written to a file each time along with the meta page
        • ② Since only one writer transaction is executed at the same time, there cannot be a transaction that requires or is in the process of being written at the side when the uppermost writing is finished
        • ③Since the data existing in the mmapped memory area is mapped with read only, there can be no dirty pages in the area, and the OS will not write the data in the mapped area to a file
        • From these three points, the contents of the DB file are basically always synchronized with the on-memory contents, and the consistency of all data is correct, so even if a crash occurs, all you have to do is reread the DB file for recovery
        • If a crash occurs while a write transaction is writing data
          • In this case, the contents of the DB file may be in an inconsistent state
          • However, as described in ①, the meta page is also written out at the same time, so it seems that by checking whether the writing has been completed, it is possible to distinguish whether the data is uncommitted or not
            • Similar to WAL, LMDB must avoid notifying the transaction completion to application until all writes are completed
            • There was a statement in the paper or official web page that the DB file is also a log, but I guess that's what they meant
          • LMDB can distinguish that writing to storage is completed or not completed, LMDB can perform a rollback in the same way as Undo in Redo/Undo. It should be possible to discard the uncommitted write data and replace the corresponding data with the version before the write. By design, data from previous versions always exist without being reused after crashing
  • I/O to the disk is written directly to the file without going through the mmapped area

    • When mmaping, the memory area to be mapped is designed to be read only
    • Since the memory area is passed as is to the application that uses LMDB, if you enable writing, the data in the DB will be destroyed if the application code writes by mistake
    • When a write transaction performs a write operation, if there is a page that can be reused (= a page that is known to no longer be referenced), it reuses it and writes directly to the corresponding area on the file side. If there is no reusable page, add it to the end of the DB file
      • Due to copy on write semantics, when a write operation is performed, a page (area) that is no longer referenced is created in the DB file. It seems that if you don't reuse this, the DB file will grow quickly
  • Passing data to application with zero copy

    • Passing the pointer of the memory area mapped with mmap to the application as is
    • In other words, a read transaction does not perform memory copying in the LMDB layer, so the number of memory copies is zero
    • If the pointer passed to the application is left accessible forever, old pages will not be able to be reused, so the application must make explicit function calls for LMDB to understand the start and end of a transaction. request. After the transaction ends, it is no longer guaranteed that the passed memory area can be accessed
    • There is a point to note about this I/F: as long as the transaction is not terminated, the page containing the data accessed during the transaction cannot be reused, so it remains open for a long time. The documentation says not to
      • I think it's still better because it doesn't interfere with the execution of other transactions
    • In other words, the application is required to quickly use the acquired data to make it unnecessary, or to copy the memory and save it to another location
      • It is not accepted to leave a reference somewhere, and there are many applications where the former is difficult. On the other hand, the latter poses the dilemma of wanting to avoid copying memory even though it has been acquired with zero copy, which may be a troubling point for users
  • Sequential disk I/O with copy on write semantics

    • As mentioned above, when performing a write, data is appended to the end of the file if there is no reusable page, but this allows the write transaction to write as much data as possible to the end of the file at the time of commit. This allows I/O to be performed in the form of efficient sequential access
    • As a side note, I think there was a similar story with LSM-tree type DBMS

After all, why is it so fast?

  • I think that the only advantages over other systems are as follows
    • (But does that make it that much faster? I'd like an opinion from an expert)
    • zero copy
    • Read Optimized
      • Reads are not blocked by writes and multiple transactions can run
      • (On the other hand, write cannot execute multiple transactions simultaneously, so the performance may be average or lower)
    • Raw pointers can be used in on memory processing
      • In many DBMSs that use buffer pools, it is necessary to repeatedly fetch pages from the buffer pool using logical addresses (page IDs, etc), but since LMDB is memory mapped, it may be replaced to access with raw pointers with doing something similar to Pointer Swizzling or mapping to the same address each initializing time (I think)

Suitable and unsuitable use cases

  • (Please note that this description is based on other KVS type DBMSs)
  • Suitable case
    • Read-bound workload case
  • Cases that are not suitable
    • Cases where the data size handled in one write transaction is too small
    • Let's consider the case of adding or updating a new key-value pair of 4 byte key and 4 byte value, totaling 8 bytes
    • LMDB performs processing on a page-by-page basis, and once a page is created, it is made immutable, so each operation involves the following inefficient processing:
      • In addition to the leaf pages, the pages located up to the root of the B+-tree (including the root page) are copied to memory, updated, and then written to disk
        • (The page size of LMDB may be larger than 4kB depending on the settings)
    • This area is mentioned in the reply between the authors of the linked article below (the second one was written by the LMDB developer) using words such as "Write Amplification"

Then, is it suitable for Nostr's relay server?

  • The data we handle has some relational aspects
    • => It seems possible that it can be managed if KVS is used properly
      • It's neither good nor bad
  • Read-bounded workload
    • => LMDB is Read Optimized
      • Very Good
  • Write mostly adds new entries and rarely updates existing entries
    • Especially since it's LMDB, there's no advantage
      • It's neither good nor bad
  • Access patterns have temporal locality (many requests are for entries recorded within a few hours based on the current time)
    • => It is a characteristic that is easy to handle as a DBMS in general, and I do not think that LMDB has high performance especially for workloads with such characteristics. On the contrary, even though LMDB allows iteration in the order of K (index scan) only, it does not support iteration using other data (*), so it can be seen as inferior compared to RDB etc
      • Not Good
      • *: Here we assume that time information or etc will be used as the index key
  • My Opinion: Considering the original high speed of LMDB, I think this is a better choice than using common RDB (*) or NoSQL (including KVS) systems
    • *: I think using RDB is too overspec for what you want to do
    • Note: This is based on the following assumptions
      • It is necessary to handle data larger than primary memory. Or it could be such a situation
        • If you don't have this premise, I think it would be better to use something like Redis
      • Horizontally scale stateless servers and avoid getting data from a common DB server
      • Do not use a computer with a large number of cores, such as was used in the verification of Andrew Crotty et al.'s paper on the use of MMAP
        • The paper uses a 64-core computer for verification
        • According to knowledges on the Web, the negative impact on performance of TLB shootdown (not limited to I/O) seems to be greater as the number of cores increases
        • Therefore, if you are not using a computer with a monster number of cores, I/O performance disadvantege as shown in the evaluation results (although I don't think the paper's claims can be ignored) may not exist
        • In other words, the disadvantage of an LMDB that uses (or stores) MMAP over a DBMS that uses a buffer pool may not be as obvious
        • Furthermore, LMDB may be able to avoid some disadvantages other than the drop in read I/O performance due to TLB shootdown when using MMAP
          • Not writing through the map area, which is the worst thing to do when using MMAP
            • Is writing to storage in DMBS physically possible to secondary storage to ensure data persistence? It needs to be written out. In other words, once it is written to the OS's on-memory disk cache, it is finished! does not become
            • Nevertheless, writing to storage via a mapped area in memory will result in a memory copy to the OS's disk cache, incurring unnecessary overhead
          • Memory management of mmapped area
            • When a new page is loaded from a DB file into memory, if the page is evicted to make needed a memory area for that data, which page is evicted is determined by the DBMS and it is an important point
            • In this respect, LMDB may have some control over the appropriate pages to be evicted by cleverly using the madvice function
              • I feel like it would be better to implement a buffer pool...
              • Although we are not certain, we can confirm that there is a call to the madvice function in the implementation

Side Note

  • Given nostr's inherently redundant architecture and the relatively loose reliability that users expect from applications (this is just my understanding), I feel that a DBMS that tolerates data loss to a certain extent is fine
    • Even if data for a certain time period is lost, it is possible to retrieve it again from another relay server
  • For example, wouldn't it be nice if there was something like Redis that could handle data larger than primary memory, and persistence could be done with periodic dumps like Redis

Conclusion

  • Enjoy!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment