I often see people get confused by binary search tree balancing. This is very fair, because all the algorithms to make it work are quite complicated, and understanding them deeply is a lot of work.
But the misconceptions I see are often not about the mechanics of BST balancing, which are legitimately really complicated, but about the motivation and goals of these mechanisms. So I’ve written a few brief notes here in the hope that they let people understand what BST balancing mechanisms are supposed to do.
The key selling point of a BST is that it has O(log(n)) search and modification. The O(log(n)) here comes from the height of the tree. So we just require that the height of the tree doesn’t grow to more than a constant factor of the logarithm of the number of nodes.
Two common balancing mechanisms are AVL trees and red-black trees. In a red-black tree, the maintained invariant is that the paths from the root to a leaf can’t differ in length by more than a factor of two. So it’s fine for your left subtree to have a depth of 10 and your right subtree to have a depth of 20, but it’s not okay to have depths 10 and 22. This is good enough to ensure that height is only a constant multiple of log(n). (To see this, try to construct a tree with a given number of items but with the maximum height.)
(You might remember the rules of red black trees. The important ones are: All nodes are red or black. Red nodes can’t connect to other red nodes. There is the same number of black nodes on every path from the root to a leaf. These rules lead pretty immediately to the invariant that distance-to-leaf can’t vary by more than a factor of two—maximum height is when you alternate red and black, and minimum height is when you have a bunch of black nodes in a row. Another way of thinking about red-black trees is as a weird way of representing B-trees. If you want to learn about this, I recommend these slides.)
In an AVL tree, the rule is that your two subtrees have to have heights that differ by no more than one. This also ensures that your height is no more than a constant multiple of the logarithm of your number of nodes.
In both these types of BSTs, it’s a lot easier to describe the invariant the tree maintains than to explain exactly how the tree maintains this invariant when you insert or delete items. I recommend not really worrying about that. You can figure it out on the spot if you need to, by remembering the invariant that the tree is trying to maintain. (Figuring it out will be easier if you remember what BST rotations are.)
An extremely common misconception is “Balance means that you have similar numbers of nodes on the left and right side of the tree.” This is wrong. You only care about the height of the left and right subtrees being similar, where height means the maximum distance to a leaf node. This doesn’t require similar numbers of nodes on the different sides of the trees. For example, suppose that you require that your two child nodes can’t have heights that differ by more than a factor of two. At maximum unbalance, your left child can have 2^h descendants and your right child can have 2^(2h) = 4^h descendants. The difference between 2^h and 4^h grows extremely rapidly as h grows. If your left child has a thousand elements, your right child can have a million elements (that’s h = 10).
Here are a few random comments about BSTs that you could meditate on and maybe achieve enlightenment:
- Almost all nodes in a balanced BST have very few descendants.
- This is kind of obvious, but nodes in a BST have only log(n) ancestors. This means that modifying all the ancestors of a BST node is fast. This is the idea behind order statistic trees, and augmented BSTs more generally.
- It’s normally quite rare for balancing to take longer than a very short time, because you normally only have to fix balancing issues very locally. (Finicky detail: In a red-black tree, rebalancing is amortized O(1). In an AVL tree, there are weird cases where a sequence of insertions and deletions can require O(log(n))-time rebalancing operations.)
- Another neat consequence of BSTs having only log(n) ancestors is that you can quickly implement persistent BSTs—that is, BSTs where the insertion method returns a new tree. You only build new BST node objects for the nodes which were changed, which are just the ancestors of the node which you ended up messing with. So it only costs O(log(n)) to get a new copy of the tree without modifying the old one.
- My favorite unusual BST implementation is a hash-ordered treap.