Before talking about building a db
We need to discuss why do we even need a db?
One answer is to make our data durable. But why database ? I can make my data durable by storing it into a file.
Is storing all the info in a file a good idea? I basically want to do few operations on my data and if those are possible I am good with file.
- Retrieve specific Info
- Add Info
Almost all other operations are combination of these two
Retrieval
- For retrieving we will have to scan the entire file and that is ok for small dataset
- But as soon as our dataset size increases , this operation is time consuming
- Time Complexity: O(n)
Add Info
- Adding info is easier in a file
- We can just keep on appending to the end of file
- Time complexity: O(1)
Although Adding info is very efficient in file based system, it is not the case that we just write in our system…we also have to go back and look at data we have written in past.
If only there was a way to improve the retrieval process.
The problem we have right now is that we have to look into the entire file. If there was anyway to reduce our search space so we can have less data to search upon.
And this is where databases comes in. Databases arrange our data in specific order and build indexes to reach to our data faster. Reducing the space that is required to search.
There are mainly two ways of arranging these indexes
- B+trees
- LSM trees and SS tables
B+ tree or B Tree
#computer-science #database #data-structure
B+tree is used by most traditional DBs , it works great with storage devices like HDDs and SSDs.
Structure of B+tree
[20]
/ \
[10] [30]
/ \ / \
[5 | 6 | 7] [10 | 12 | 17] [20 | 22 | 25] [30 | 35 | 40]
| | | |
+----=>--------+----=>---------+----=>----------+
The above shows a B+ tree.
All the data is available only in the leafnodes, in context to DBs it will be offsets pointing to disk.
The root and intermediate nodes are internal nodes and are used to quickly reach out to leaf nodes, basically reducing the search space.
The number of branches created for each node depends on branching factor in db and is decided by Db internally.
In case of DB, Nodes will have a size limit which will be dependent on multiple factors like page size and OS.
Distinction b/w B Tree and B+ tree
The major distinction b/w B Tree and B+ tree is all the values in B+tree will only be in leaf nodes, while in case of B-Tree internal nodes will also keep values.
And this significantly increases the size of index.
Also in B+trees rhe Leaf nodes will generally point to next leaf node, this is specially helpful in case of range queries.
Search and Range queries
B+tree are highly efficient for reads and performs well for range queries.
For an incoming request, it is easy to binary search among the internal nodes and arrive on the correct leaf nodes.
To find a particular value we can scan that leaf node.
And to get a range, we can follow the next pointers from the leaf nodes.
Insertion
Insertion is where it gets difficult. Consider the case where you have to insert a value
- First you need to reach the correct internal node.
- Now you create a new leaf node and start inserting in leaf node, but the internal node is already full, so you need to split it.
- Once the internal node is split, you will also have to update the parent with reference of new internal node.
- If the parent is also full, parent node would also have to be split.
- This process can keep going on.
Similar to insertion deletion also causes issues as it can lead to merging of internal nodes and this could also be propagated.
LSM tree and SS tables
The problem with b+tree based indexes is that you have to do a lot of work for writes as a write can lead to splitting of nodes and creation of another level of internal nodes.
Although reads are pretty easy.
How can we make writes faster ?
As we earlier looked the easiest way to do a write is just appending to a file. But..but that creates a problem for reads which goes to O(n).
So, how can we solve both of these problems.
We start dividing our logs into multiple files and in these files will contain keys in sorted order. We will also have unique constraint on the key in one segment file. This is called Sorted String Table
Also incoming write operations will be done in memory, in something called Memtable. This would utilise a data structure which allows reading back in sorted structure (RB or AVL tree).
Memtable contains all the key value pair, they are inserted as they come in.
Memtable acts as a buffer and once it crosses a threshold it is written to disk as a SSTable file, this would become the newest segment.
Searching for key
Searching for a key would be different than what we seen in B+ trees.
Here to search a key we first look into the memtable, then to newest sstable and so on.
Once we find the value of key we return it.
SStables include an additional sparse index to speed up this query. In this index we will keep some keys and their offset in SStable. This allows us to efficiently search for key in multiple segments, once we get the value or move to next SStable.
In case of deletion we set the key to something called tombstone marker, which indicates the key has been deleted.
Compaction And Merging
For a key we have unique constraint per SStable, so we might have repeated keys in different segments. And we would generally care about latest value, even in cases of delete we do not care about the key anymore.
If we continue the creation of segments we will sooner or later use up all the storage and performance will also degrade as we will have to combine results from numerous segments.
So what to do?
We can start merging different segments. When we are merging different segments if we keep the value of key in latest segment among the ones being merged and discard other values.
Optimisations
Compression
SS table can be compressed into mutiple chunks using different algos, to reduce the storage size and io. Compression should allow searching via keys and since there can be range queries getting a chunk of data at once allows for faster queries.
Bloom filters
One issue with this approach is searching for keys that do not exist. Following our approach if we query for data which does not exist, our db will start querying the memtable and various ss table but won’t find anything.
We can include bloom filters per SStable to check if key definitely does not exist in that particular SStable and we could reduce our search space.
Since bloom filters are memory efficient and fast we can quickly filter out keys which do not exist in our db.