Skip to content

Instantly share code, notes, and snippets.

@jesseschalken
Last active February 22, 2018 00:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jesseschalken/d1db6b0712c79a0b125122945b405ca1 to your computer and use it in GitHub Desktop.
Save jesseschalken/d1db6b0712c79a0b125122945b405ca1 to your computer and use it in GitHub Desktop.

Random database IDs

I'm inferring this based on vague statements on the internet and a reading of how B-trees work.

Databases and file systems use B-tree indexes to implement key-value maps (indexes for a database table). Since it's a search tree and not a hash table, it is inherently ordered.

Cache locality

When inserting sequentially increasing values into a B-tree, you always hit the rightmost leaf node (sorting left to right), since that's where a value that is higher than all the existing values must be inserted. When the leaf is full, it is split in two and the parent node is updated to have an extra child. Since the parent may also already be full, it may need to be split, and so on. When the root node is full, it needs to be split, creating a new root and increasing the height of the tree by 1 level.

So all the insertion activity happens along the right edge of the tree. If nodes contain 10 records, then every insert will update the rightmost leaf, every 10th insert will also update its parent node, every 100th insert will update the parent's parent, etc. This is very friendly on the cache, because 10 inserts can be done by touching only one node, 100 inserts can be done while touching only 2 nodes etc. The nodes may remain in memory the entirety of the time and not even need to be written to disk.

If your inserts are random, however, then it will be a random leaf that needs to be updated by each insert. Unless the entire index tree resides in memory, nodes will need to be shuffled between memory and disk randomly. The larger the index tree, the more cache contention there will be.

Fragmentation

The fragmentation problem arises due to the way leaf nodes are split when they become full. Splitting a node involves deciding a value that will be the "pivot". Values below the pivot will go in the new left node and values above (or equal to) it will go in the new right node, and the right node is inserted into the parent using the pivot as the starting value.

If the middle (median) value is used as the pivot, then the node will be split 50/50, but then a sequence of increasing value inserts will leave behind a trail of 50% full leaf nodes, which is not space efficient. So instead if the new value is higher than all values in the leaf, an optimization is made: Instead of building two new 50% full left and right nodes, re-use the full node as the new left node and build a right node containing only our new record. (The same optimization can be applied when adding the new leaf to the parent node.) This way inserting increasing values leaves behind a trail of 100% full leaf nodes, and since new right leaf nodes start 1% full (or whatever 1 record is) instead of 50% full, splits have to occur half as many times.

However if your insert values are random rather than increasing, then you will frequently insert a value between two existing values and node splits are unavoidable, and will result in approximately 75% space utilization in nodes (half way between 100% node usage immediately before a split, and 50% node usage immediately after).

Clustered indexes

A "clustered" index is one where the table data is stored directly in the leaf nodes. In a non-clustered index the leaf nodes instead contain pointers into a different data structure containing the table data (a "heap", like the heap operated on by malloc()/free() I guess).

This means that the concerns about indexes (cache locality and fragmentation) also become concerns about the table data, since they are literally the same data structure. If however your table data is not in a clustered index, then it is unrelated. Cache locality and fragmentation of the table data would be decided by a separate policy which probably groups together rows by time of insertion rather than primary key.

Postgres primary keys are not clustered indexes, but the CLUSTER command can be used to sort table data by the index values as a once-off. This way applications that are designed to be friendly to B-trees (eg reads/writes grouped at the highest values) will also be friendly to the layout of the table data (read/writes grouped at the higher block offsets). The Postgres docs say:

In cases where you are accessing single rows randomly within a table, the actual order of the data in the table is unimportant. However, if you tend to access some data more than others, and there is an index that groups them together, you will benefit from using CLUSTER.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment