johnpmitsch/a.md

## a.md

      
    Raw
  

              a.md
            
          
    Databases

Indexing


B+ trees are always best, but ISAM came first, ISAM is a simpler way to start learning
Each node is a page on disk and has a pointer to the related parts, 'block 7'
B+ tree, only leaf nodes matter, the rest are to direct traffic
binary search is really difficult, it is reads on disk, so instead of this we use an "index" file that stores key values with pointers to pages
B+ tree is height-balanced.

each node has 50% pointers on them because pages will be split in half to make space


pointers can be between leaves of the tree
because the b+ tree is height balanced, the top levels can be kept "hot" in a cache/memory, reducing the I/O for a lookup
New leaves can be created up until a fixed number, then a new level of the tree will be created.
When inserting new values, if there is room in the page, nothing changes except for the insertion. If not, the leaf splits into two and is inserted. This can happen recursively.
In practice, most db systems don't actually ensure half-full page leaves, they just leave them there. This is because most people have insert-heavy workflows, not delete-heavy ones.

Joins


Joins are rendevous, combining information from multiple files/locations
To combine multiple groups of data and scan them, you can use multiple approaches:

Nested loops: slow, iterate over everything

page-oriented nested loops: read by groups, a bit faster than just nested loops, but still slow


Sort-merge join: rows are sorted, then walked through, can set pointers on the values and work from there, no need to scan higher values since they are sorted.
Hash-merge join: hash one side, hash the other, then join common


Databases use different algorithms depending on the situation, work of the query optimizer

Relational Model and Query Optimization