The Log-Structured Merge-Tree (LSM-Tree)

The Log-Structured Merge-Tree (LSM-Tree)

This is an old paper, but increasingly used in a wide range of Storage Systems. The main idea of this paper is using a in memory tree to consolidate the write/update operations on the on disk tree structure. Thus, a sequence of tree write/update operations turns out to be a merge operation between the in memory tree and the on disk tree.


As shown in the above figure, $C_0$ tree works as $C_1$ read/write cache. When a trigger occurs, such as the size of $C_0$ exceeds the limit, the $C_0$ then be merged to $C_1$. The merge operation is a batch update process for $C_1$, thus has better I/O efficiency.


At the heart of the LSM algorithm is a rolling merge process:

The rolling merge acts in a series of merge steps. A read of a multi-page block containing leaf nodes of the C_1 tree makes a range of entries in C 1 buffer resident. Each merge step then reads a disk page sized leaf node of the C_1 tree buffered in this block, merges entries from the leaf node with entries taken from the leaf level of the C 0 tree, thus decreasing the size of C_0 , and creates a newly merged leaf node of the C_1 tree.

The buffered multi-page block containing old C 1 tree nodes prior to merge is called the emptying block, and new leaf nodes are written to a different buffered multi-page block called the filling block. When this filling block has been packed full with newly merged leaf nodes of C_1 , the block is written to a new free area on disk. The new multi-page block containing merged results is pictured in Figure 2.2 as lying on the right of the former nodes. Subsequent merge steps bring together increasing index value segments of the C_0 and C_1 components until the maximum values are reached and the rolling merge starts again from the smallest values.

The merge process is performed based on the collocated data of $C_1$ so the storage system can achieve better I/O throughput.

What LSM improves?

LSM is designed to improve the write/update performance of on disk B-tree. Because of the read-before-write problem, updating of B-tree structure causes write amplification. Consolidating of update operations can mitigate the problem.

The benchmark shows LSM-tree has better throughput in write intensive workloads, while perform worse than B-tree in read intensive workloads.