Notes on Building a Database #2: Exploring B+ tree

In this post we will explore B+tree data structure and how it is utilised by databases.

Why do we need indexes?

As we discussed in previous post we need indexes to efficiently read and write data. Lets look at how a B+ tree is structured

B+ tree structure

A B+tree has root nodes, internal nodes and leaf nodes.

Root nodes and internal nodes only contain keys and pointers to the children.

And only leaf nodes will contain the values. The values can be actual value or pointers to disk.

We typically align the node size with the OS Page size (often 4KB). This ensures that when the database requests a node, the disk fetches exactly one page, minimizing wasted I/O operations.

B+tree is a balanced tree and all the leaf nodes are at the same level. And each leaf node has a pointer to next leaf node. This makes it great for Sequential/ range based scan.

                 [20]
                   /  -------------
                /                 \
           [10]                      [30]
          /  \                    /       \ 
        /     \                /           \
  [5 | 6 | 7] [10 | 12 | 17] [20 | 22 | 25] [30 | 35 | 40]
    |             |              |               |
    +----=>--------+----=>---------+----=>----------+

Here we have root node, with key 20 and internal nodes 10, 30.

Internal Node 10, connects to two leaf nodes. One leaf node has all values less than 10 and other has all values greater than or eual to 10.

Similarly, for the other internal node too.

All the keys are sorted, so any child to right of node will have key >= to current node.

Note: In implementations each internal node may not have children on both sides.

B+tree over B tree

Compared to B tree, A B+ tree will always have values in the leaf nodes.

B tree will have values in internal nodes also, this leads to

If our index size is big that memory can’t fully hold it, we will have to offload parts of it to disk and then this introduces significant complexity. If we are querying something which is not loaded to memory we have to first move it to memory and then do searching.

In contrast just holding the keys as in B+tree, will considerably reduce the amount of memory needed.

B+ tree also fan out more, these are shorter trees with a lot of branches. The number of branches is determined internally and is called the branching order.

FeatureB-TreeB+ Tree
Data StorageKeys & Values in all nodesValues only in Leaf nodes
Tree HeightHigher (pointers take up space)Lower (more keys per node)
Range ScansSlow (requires tree traversal)Fast (linked list at leaves)

B+ tree operation

Get Query

Range Query

Insert Query

Delete Query

Node management

Here is a simple demonstration how a Split and merge happen in B+tree.

B+ tree Visualisation: https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Splitting Nodes

As we have seen splitting of nodes is caused due to hitting the size limit of a node.

Once node has been split, lets say we will have two new nodes created left and right.

Left node replaces the original node. The link from parent node to original node now points to left node.

The right node has no parent and the first key in right node is promoted as key to parent node.

In some cases, this promotion to parent node causes split of parent node and so on, this is called cascading split.

Cascading splits may cause even the root node to split and addition of new root node, causing increase in height of tree.

Merging nodes

Sometimes deletion of keys causes, node to have size less than a minimum threshold. It is not good to have nodes which do not have many items.

Here it makes sense to merge nodes which have underflowed with their sibling.

When a two nodes are merged, the key to right node is removed from parent.

Similar to insert, sometimes these merges might cascade vertically and even in some cases might lead to decrease in height of the tree.

Conclusion

As we can see we have to do a lot of things when we insert. That’s why B+ tree based DBs do not perform well for write heavy applications.

B+ tree based DBs shine in read-heavy applications.