Design and Implementation of a Fast Persistent Key-Value Store

KVell [pdf] avoids to save indexes on disk to maximize the performance of the KV store.

There are two major index schemes for Key-Value stores:

  • LSM-Tree: good for write-dominated workloads (RocksDB, Cassandra)
  • B-Tree: good for read-dominated workloads (MongoDB, WiredTiger)

Both schemes suffer from the CPU-bottleneck as the raw performance of NVMe SSDs increases rapidly. The CPU-bottleneck relays within the fact that keys need to keep sorted in both designs.

  • In LSM-Tree, items are first appended to the log then merged into the tree through compaction. The compaction is both CPU and I/O intensive.
  • B-Tree and its variances suffer from synchronization overheads.

Workload: YCSB A 50% write – 50% read, uniform key distribution, 1KB KV-item size.

Design principles

  1. Share noting.
  2. Do not sort on disk, but keep indexes in memory.
  3. Aim for fewer syscalls, not for sequential I/O.
  4. No commit log.

Indexes are stored in memory only. 1.7GB of RAM stores 100M items. No need to order keys on disk, thus no tail latency.

Worker threads are independent and maintain their own indexes. Synchronization is not required.

I/O API of Linux

KVell uses the asynchronous I/O interface AIO. There are two AIO implementations in Linux:

  • libaio: user-space library implementation and use system calls internally
  • POSIX API: emulated AIO entirely in the user space without any kernel support

AIO is most efficient compared to mmap and the direct I/O interface.

Performance of Different I/O Interface

MMap is least efficient!

The overall performance of KVell is 2x faster for read-dominant workloads and 5x faster for write-dominant workloads.