Notes on Building a Database #1: B+trees Vs LSM trees and SS Tables

Before talking about building a db

We need to discuss why do we even need a db?

One answer is to make our data durable. But why database ? I can make my data durable by storing it into a file.

Is storing all the info in a file a good idea? I basically want to do few operations on my data and if those are possible I am good with file.

Almost all other operations are combination of these two

Retrieval

Add Info

Although Adding info is very efficient in file based system, it is not the case that we just write in our system…we also have to go back and look at data we have written in past.

If only there was a way to improve the retrieval process.

The problem we have right now is that we have to look into the entire file. If there was anyway to reduce our search space so we can have less data to search upon.

And this is where databases comes in. Databases arrange our data in specific order and build indexes to reach to our data faster. Reducing the space that is required to search.

There are mainly two ways of arranging these indexes

B+ tree or B Tree

#computer-science #database #data-structure

B+tree is used by most traditional DBs , it works great with storage devices like HDDs and SSDs.

Structure of B+tree

                 [20]
                /    \
           [10]              [30]
          /   \             /    \
  [5 | 6 | 7] [10 | 12 | 17] [20 | 22 | 25] [30 | 35 | 40]
    |             |              |               |
    +----=>--------+----=>---------+----=>----------+

The above shows a B+ tree.

All the data is available only in the leafnodes, in context to DBs it will be offsets pointing to disk.

The root and intermediate nodes are internal nodes and are used to quickly reach out to leaf nodes, basically reducing the search space.

The number of branches created for each node depends on branching factor in db and is decided by Db internally.

In case of DB, Nodes will have a size limit which will be dependent on multiple factors like page size and OS.

Distinction b/w B Tree and B+ tree

The major distinction b/w B Tree and B+ tree is all the values in B+tree will only be in leaf nodes, while in case of B-Tree internal nodes will also keep values.

And this significantly increases the size of index.

Also in B+trees rhe Leaf nodes will generally point to next leaf node, this is specially helpful in case of range queries.

Search and Range queries

B+tree are highly efficient for reads and performs well for range queries.

For an incoming request, it is easy to binary search among the internal nodes and arrive on the correct leaf nodes.

To find a particular value we can scan that leaf node.

And to get a range, we can follow the next pointers from the leaf nodes.

Insertion

Insertion is where it gets difficult. Consider the case where you have to insert a value

Similar to insertion deletion also causes issues as it can lead to merging of internal nodes and this could also be propagated.

LSM tree and SS tables

The problem with b+tree based indexes is that you have to do a lot of work for writes as a write can lead to splitting of nodes and creation of another level of internal nodes.

Although reads are pretty easy.

How can we make writes faster ?

As we earlier looked the easiest way to do a write is just appending to a file. But..but that creates a problem for reads which goes to O(n).

So, how can we solve both of these problems.

We start dividing our logs into multiple files and in these files will contain keys in sorted order. We will also have unique constraint on the key in one segment file. This is called Sorted String Table

Also incoming write operations will be done in memory, in something called Memtable. This would utilise a data structure which allows reading back in sorted structure (RB or AVL tree).

Memtable contains all the key value pair, they are inserted as they come in.

Memtable acts as a buffer and once it crosses a threshold it is written to disk as a SSTable file, this would become the newest segment.

Searching for key

Searching for a key would be different than what we seen in B+ trees.

Here to search a key we first look into the memtable, then to newest sstable and so on.

Once we find the value of key we return it.

SStables include an additional sparse index to speed up this query. In this index we will keep some keys and their offset in SStable. This allows us to efficiently search for key in multiple segments, once we get the value or move to next SStable.

In case of deletion we set the key to something called tombstone marker, which indicates the key has been deleted.

Compaction And Merging

For a key we have unique constraint per SStable, so we might have repeated keys in different segments. And we would generally care about latest value, even in cases of delete we do not care about the key anymore.

If we continue the creation of segments we will sooner or later use up all the storage and performance will also degrade as we will have to combine results from numerous segments.

So what to do?

We can start merging different segments. When we are merging different segments if we keep the value of key in latest segment among the ones being merged and discard other values.

Optimisations

Compression

SS table can be compressed into mutiple chunks using different algos, to reduce the storage size and io. Compression should allow searching via keys and since there can be range queries getting a chunk of data at once allows for faster queries.

Bloom filters

One issue with this approach is searching for keys that do not exist. Following our approach if we query for data which does not exist, our db will start querying the memtable and various ss table but won’t find anything.

We can include bloom filters per SStable to check if key definitely does not exist in that particular SStable and we could reduce our search space.

Since bloom filters are memory efficient and fast we can quickly filter out keys which do not exist in our db.