Skip to content

Instantly share code, notes, and snippets.

@sreeix
Last active March 25, 2022 07:20
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save sreeix/45aec4f51a3578a20700 to your computer and use it in GitHub Desktop.
Save sreeix/45aec4f51a3578a20700 to your computer and use it in GitHub Desktop.
Notes for reading LSM tree paper

Notes for reading the LSM tree paper

  • Slow disk access: Disk(not withstanding SSD) are very slow for access compared to memory. Generally established number for this is 3 orders of magnitude. ie. a memory lookup that may take 10 microseconds will take 10 ms when it hits the disk. So all databases will avoid reading/writing to disk as much as possible(think where does mongodb's perf gains come from?)
  • Five minute rule refers to the Jim gray paper on when to cache an object in memory. Wikipedia, [Paper] (http://www.hpl.hp.com/techreports/tandem/TR-86.1.pdf), a later paper Five minute rule, ten years later and a later paper Five minute rule 20 years later and how flash memory changes it.

But from the perspective of the paper. The cost of increasing the Ram size

Intuition for B tree based indexes. A sparse index, that has the Pointer to the start of the first record on the page, thse Assume we want to do select * from employees. We can't just scan every available disk block. So we store all the elements in this relation in some form in a different place on the disk(same/different, typically different ) file. Obviously storing the data here does not make sense, so only pointers to the disk block are stored. Now we just have to find something like select * from employees where id = 100;

Assume we are querying on the db for something like select * from employees where salary > 10000; Remember salary is(typically) stored contigiously on the disk based on the employeeid(assuming employeeid is primary key)

Terms
  • TPC - A bunch of tests suites that are used to test for database performance. Usually you will hear something like "DB2 beats Oracle on the TPCC benchmark"
  • Transaction Logs - For durability, most databases will first write the changes it is going to perform on to the database to a file. These are transaction logs, also known as Write ahead logging (WAL). Because this is write only and a lot of data is being written fairly continuosly and flushed, the performance of TLOGS become super important for db performance
  • Btrees -> loosly refers to B+ trees. A data structure that allows fast inserts and finds (for a specific id) and also sequential ordered access.
  • Blocks -> refers to the the contents of one node in the btree. A block in b+ tree contains the keys for guiding the search process. And number size of the block has is related to how tall the tree is going to be.It has nothing to do with the disk blocks(sectors)
  • Pages -> Refers to a smallest chunk of information db can work with. Ie. do locks and persistence. Data in a db are kept in a series of pages that are mapped to the disk(sectors) A Btree will hold links to these data pages. Multiple pages can be resident on a single disk block(not the same block as above)
  • Multi-page blocks -> A block that contains multiple pages and can be fit into a single read/write operations. In LSM tree a block contains as many entires as can fit a disk block.
  • Leaf nodes - Typically the pointers to the records on disk?
  • Buffer(pool) -> A collection of data pages that have been loaded into the main memory. Typically managed by buffer manager. All operation on pages happen in memory(ie. search on the recordid/match record fields.), so if the page requested is not in memory(buffer) then it is loaded from the disk into the main memory, some exisitng pages may have to be evicted.
  • Records -> Each row of data stored on the disk. One page typically has multiple records stored sequentially(clustered) Row id (RID) -> Record id pointer to the record on the disk.(

Merge process(also called compaction in some papers) -> On reaching memory thresholds, the tree is merged downstream tree, similar to merge sort where 2 pointers are used to track the current smallest( remember that indexes need to be kept in sorted order.)

Things to think about while reading the paper
  • What are the big ideas?
  • How does this data structure work for concurrent Access?. Most databases use locking for acheiving Isolation and Atomicity. Does the locking work better on BTrees or LSM trees?
  • What about durability? Is it easier to checkpoint and write to disk with btrees?
  • What sort of in memory structure would be better for fast insertions and scans?
  • Why are the new databases(KV) now using LSM?
Things that may make sense?
  • Can/Should you have multiple memtables. Does it improve performance?
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@at15
Copy link

at15 commented Jan 24, 2017

In the term part about B Tree, there seems to be a typo

And number size of the block has is related to how tall the tree is going to be

'And number of blocks a node has is related to how tall the tree is going to be' might be a fix

And another one in the slide's personal context part

User a slightly different db terminology

User should be Use I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment