@rkenmi - B-Trees vs. LSM Trees

# B-Trees

Modern databases are typically represented as B-Trees or LSM Trees (Log structured merge trees).

B-trees are "tried and true" data structures that are popular in database usage, most notably SQL databases. With a B-Tree indexing structure, data is written onto the disk in fixed size page segments. These page segments are often about 4 KB in size, and have key value pairs sorted by the key.

A single B-Tree node is like an array with references to a page range. The max. number of references in the array are known as the branching factor. Each page range is another B-Tree node with references to other page ranges. Eventually, at the leaf level, you'll come to find the singular pages.

This idea is similar to pointers in low level programming languages, except that these page references are stored on disk as opposed to on memory.

When B-Tree inserts and deletes happen, it can often split a node into two subtrees in order to satisfy the branching factor. This is typically a big operation that can be dangerous if, say, the database crashes for whatever reason in the middle of the process.

To alleviate the scenario in which the database crashes, B-Tree implementations also write a write-ahead log (WAL) that records every single atomic database transaction, to keep track of the history. This WAL is used to restore the state of the B-Tree in the case that the B-Tree becomes corrupt.

# LSM Trees

LSM (log structured merge) Trees are a popular and trending data structure for use in modern relational and non-relational databases such as Bitcask, MongoDB and SQLite4.

A simple log structured storage works by having an in-memory hash table (or hash index) that keeps track of keys and values. The values would be a byte offset reference inside an append-only log file (that is on disk, not in-memory), which contains the actual value string.

For example, suppose you have a db1.txt log file with key/value pairs, and you have the key foo set to bar. 5 minutes later, you decide to update foo to bur. This transaction actually gets appended to the end of the db1.txt file, instead of modifying the line where foo is set to bar in the file.

This append-only design leads to better write speeds, due to sequential writes, which are generally faster than random writes.

While a hash index is fine for basic usage, performance can be optimized by sorting the keys. In order to keep the keys sorted in memory, we can use a data structure such as a Red-Black tree or AVL tree. This is often called a memtable.

## SSTables

While memtables are very fast, they have a size limitation since we typically have much less RAM than disk space. To optimize this problem, we create log files onto the disk, with keys that are usually kept in sorted order. These log files are known as SSTables, or Sorted String Tables. By having the keys sorted, this allows us to make hefty performance improvements.

In a basic LSM tree implementation, data is set and queried using this memtable. When the memtable reaches a certain threshold size, SSTable segment files are created onto the disk to reduce the burden of overwhelming the memory capacity.

Of course, only having a single SSTable segment won't be future-proof, since it will just keep growing and growing until disk space runs out. To counter this problem, SSTables are split up into smaller segments, and periodically, these segments go through the process of compaction and merging.

What compaction means is that duplicate keys are removed. In an append-only log, what matters is the date in which a transaction occurs. If foo used to be bar, but then got changed to bur, then we no longer need that foo=bar transaction inside our log files. Thus, compaction will preserve the most recent keys, and discard older keys that happen to be duplicates. There is a whole lot more that goes on behind the scenes, but this is the basic gist of it.

What merging means is simply combining all of the smaller segmented log files together, to reduce disk space and prevent overflow.

Compaction and merging may happen hand-in-hand or altogether, depending on the database implementation.

Like B-Trees, LSM Trees can also use a log file similar to WAL to read a history of transactions, in the case that a database crash occurs.

# Deciding Factor

B-Trees and LSM Trees are quite distinct from each other in design. B-Trees typically modify entries in-place. LSM Trees on the other hand append entries and discard stale entries from time to time. B-Trees only have to worry about 1 unique key (primary key) per index, where as LSM Trees will potentially have duplicate keys (before compaction and merging).

If reads are a concern, then it may be worthwhile to look into B-Trees instead of LSM Trees. In a nutshell, if we do a simple database query with LSM Trees, we'll first look at the memtable for the key. If it doesn't exist there, then we look at the most recent SSTable, and if not there, then we look at the 2nd most recent SSTable, and onwards. This means that if the key to be queried doesn't exist at all, then LSM Trees can be quite slow.

Otherwise if writes are a concern, LSM Trees are more attractive due to sequential writes. In general, sequential write is a lot faster than random writes, especially on magnetic hard disks. Do note though, that the maximum write throughput can be more unpredictable than B-Trees due to the periodic compaction and merging going on in the background.