Skip to content

Instantly share code, notes, and snippets.

@johnpmitsch
Last active August 29, 2020 15:26
Show Gist options
  • Save johnpmitsch/b31579d057359b57e61785b59d7aad6f to your computer and use it in GitHub Desktop.
Save johnpmitsch/b31579d057359b57e61785b59d7aad6f to your computer and use it in GitHub Desktop.
Bradfield CS database notes

Databases

Indexing

  • B+ trees are always best, but ISAM came first, ISAM is a simpler way to start learning
  • Each node is a page on disk and has a pointer to the related parts, 'block 7'
  • B+ tree, only leaf nodes matter, the rest are to direct traffic
  • binary search is really difficult, it is reads on disk, so instead of this we use an "index" file that stores key values with pointers to pages
  • B+ tree is height-balanced.
    • each node has 50% pointers on them because pages will be split in half to make space
  • pointers can be between leaves of the tree
  • because the b+ tree is height balanced, the top levels can be kept "hot" in a cache/memory, reducing the I/O for a lookup
  • New leaves can be created up until a fixed number, then a new level of the tree will be created.
  • When inserting new values, if there is room in the page, nothing changes except for the insertion. If not, the leaf splits into two and is inserted. This can happen recursively.
  • In practice, most db systems don't actually ensure half-full page leaves, they just leave them there. This is because most people have insert-heavy workflows, not delete-heavy ones.

Joins

  • Joins are rendevous, combining information from multiple files/locations
  • To combine multiple groups of data and scan them, you can use multiple approaches:
    • Nested loops: slow, iterate over everything
      • page-oriented nested loops: read by groups, a bit faster than just nested loops, but still slow
    • Sort-merge join: rows are sorted, then walked through, can set pointers on the values and work from there, no need to scan higher values since they are sorted.
    • Hash-merge join: hash one side, hash the other, then join common
  • Databases use different algorithms depending on the situation, work of the query optimizer

Relational Model and Query Optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment