BF-Tree: Approximate Tree Indexing

BF-Tree: Approximate Tree Indexing on VLDB 2014.

A B tree is a self-balancing binary tree. To locate an item on B tree stored on HDD, huge number of small random read need to be performed. A B+ tree is a specialized version of a B tree with a variable but often large number of children per node. Thus, items on B+ tree can be located with fewer number of random read. Tree structures, especially B+ Trees are widely used because they are optimized for the common storage technology – HDD – and they offer efficient indexing and accessing for both sorted and unsorted data (e.g., in heap files). Most modern, popular databases use B+ trees for storage (Oracle, Postgresql, Mysql, Microsoft SQL Server, …).

Indexing for SSDs also adopt B+ trees and focus on addressing the slow writes on flash and the limited device lifetime. This paper propose to employs probabilistic data structures (Bloom filters) to trade accuracy for size and produce smaller, yet powerful, tree indexes – BF-Trees.

bf-tree

Each leaf node corresponds to a page range (min_pid – max_pid) and to a key range (min_key – max_key), and it consists of a number, S, of BFs , each of which stores the key membership for each page of the range (or a group of consecutive pages).

Compared to the B+ tree, which consists of large leafs, BF-Trees is a more compact data structure.

b+-tree

Implicit Clustering

One of the most important feature that the data sets have is implicit clustering. The relevant data are trend to clustered on the time dimension, thus can be co-located on the storage.

implicit_clustering

This gives the evidence that Correlated Data can be co-located.

SDF: Software-Defined Flash for Web-Scale Internet Storage System

SDF: Software-Defined Flash for Web-Scale Internet Storage System on ASPLOS 2014.

Currently only 40% or less of the raw bandwidth of the flash memory in the SSDs is delivered by the storage system to the applications.

This is because the I/O requests from upper-level software, such as file systems, databases, and key-value stores, are usually not optimized to generate SSD-friendly access patterns. This observation is similar to what FlashGraph has observed:

the random (4KB) I/O throughput of SSDs today is only two or three times less than their sequential I/O.

Besides the bandwidth, only 40% or less of the raw bandwidth of the flash memory in the SSDs is delivered by the storage system to the applications.

To overcome these problems, SDF has redesigned the interface of SSDs:

  1. The read request is limited to 4KB, but the write unit size is limited to be multiple of the flash erase block size and require write address to be block-aligned.
  2. The internal parallelism are exposed to the applications. Each channels have it’s own independent controller and exposed as a single device.
  3. Erase operation is exposed as a new command to the device. So that no over-provision space are needed and no gc background process need to be run.

The I/O stack is much simpler in this design. However, the application design becomes more complicated.

The Dirty-Block Index

The Dirty-Block Index on ISCA 2014.

This paper is about dirty block management of on-chip caches. Existing cache organization mix dirty block and clean block in a single list and use a tag entry to distinguish them. The problem of current solution exhibits in 1) determine if a block is dirty, and 2) identify the list of spatial co-located dirty blocks. In this paper, the authors remove the dirty bit in the tag entries and organize them differently in a separate structure, which is called Dirty-Block Index (DBI).

cache_lookup_dbi

This design is very similar to Read-Write Disparity on HPCA 2014, which made the discover that dirty pages are rarely read in some applications. Onur Mutlu is on both of the author lists

The DBI Enabled the following Optimizations:

Excient DRAM-Aware Aggressive Writeback

In DRAM… Requests to the row buffer (row buffer hits) are much faster and more efficient than other requests (row buffer misses)… filling the write buffer with blocks from the same DRAM row will lead to a faster and more efficient writeback phase.

This is very similar to the motivation of Batch cache update in SSD.

Efficient Cache Lookup Bypass

If an application’s miss rate exceeds a threshold (0.95, in our experiments), all accesses of the application (except those that map to the sampled sets) are predicted to miss in the cache in the next epoch.

Reducing ECC Overhead

… only dirty blocks require a strong ECC; clean blocks only require error detection. This is because, if an error is detected in a clean block, the block can be retrieved from the next level of the memory hierarchy

C-Miner: Mining Block Correlations in Storage Systems

C-Miner: Mining Block Correlations in Storage Systems on FAST 2014.

Storage Software access data based on the semantic patterns. This semantic link is referred to as block correlations.

For example, in a database that uses index trees such as B-trees to speed up query performance, a tree node is correlated to its parent node and its ancestor nodes. Similarly, in a file server backend storage system, a file block is correlated to its inode block.

Unfortunately, information about block correlations are unavailable at a storage system because a storage system exports only block interfaces.

block_correlation

This semantic link also widely emerged in Graph, as discussed in FlashGraph. Because the semantic links in graph are so dominant, the I/O patterns of graph analysis applications are mainly consist of random I/Os. The random I/Os throughput of SSDs today is only two or three times less than their sequential I/O. This might be because of the cache hit ratio of mapping table look up performance for random I/O is very low, and each I/O will cause one mapping page read and one data page read.

This paper propose a practical black box approach to mining the block correlations in storage systems.

In the first step, the preprocessing use non-overlapped cutting to splits the access stream into access sequences of equal size. The performance of overlapping cutting would be better. However, most frequent subsequences would still be considered frequent after non-overlapped cutting.

The second step is mining the frequency sequences. The algorithm to mine the frequency sequences is an iterative algorithm 1. The algorithm makes the assumption that if a sequence is frequent, then all of its subsequences are frequent. So, the algorithm first search the frequent subsequences with length 1, and then with length 2 and so on.

block_correlation

The third step is generating rules from frequent sequences, e.g. if block a is accessed then block c will be accessed. Each rules has an confidence to measure the probable of truth estimation.

The rules generated for real world traces are very small. For 3 days Cello-92 trace, there are 228k rules generated in 7800s (2.1h) with a size of 3.1MB. The rules for a give trace is very stable. So we don’t need to generate new rules constantly.


  1. An iterative algorithm executes steps in iterations. It aims to find successive approximation in sequence to reach a solution. They are most commonly used in linear programs where large numbers of variables are involved. — Rajmeet Ghai
    The process of attempting for solving a problem which finds successive approximations for solution, starting from an initial guess. The result of repeated calculations is a sequence of approximate values for the quantities of interest. — Vidya Sagar
    From: http://www.careerride.com/Data-structure-iterative-algorithm.aspx 

Processing Graphs on SSDs

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs on FAST 2015.

This paper performs semi-external memory Graph analysis. “Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory”. As the data sets become larger, the graph can not be stored in memory in a single node. All edges in a graph are connected in some way, thus, graphs cannot be clustered or partitioned effectively to localize access. Because network performance, required for communication between graph partitions, emerges as the bottleneck, recent work has turned back to processing graphs from disk drives on a single machine. Partition technology are used in these analysis engines so that a small subset of a large graph can be processed at one time.

FlashGraph performs sequential I/O when possible, but it do not emphasize on sequential I/Os:

For SSDs, FlashGraph places a higher priority in reducing the number of bytes read from SSDs than in performing sequential I/O because the random (4KB) I/O throughput of SSDs today is only two or three times less than their sequential I/O.

The block driver are optimized for SSDs:

It deploys dedicated per-SSD I/O threads to issue I/O requests with Linux AIO to reduce locking overhead in the Linux kernel; it refactors I/Os from applications and sends them to I/O threads with message passing. Furthermore, it has a scalable, lightweight page cache that organizes pages in a hash table and places multiple pages in a hash table slot [31]. This page cache reduces locking overhead and incurs little overhead when the cache hit rate is low; it increases application-perceived performance linearly along with the cache hit rate.

The task is executed in side of filesystem, to avoid memory allocation and copy:

Upon completion of a request, the associated user task executes inside the filesystem, accessing data in the page cache directly. Therefore, there is no memory allocation and copy for asynchronous I/O.

For the Graph execution engine, the value of each verticals are stored in memory, the edges are stored in SSDs. The execution engine use multiple ways to compress the data both for the in memory data representation and external-memory data representation.

And also the execution engine use multiple ways to improve the cache hit ratio, including merge I/Os to perform sequential I/O and use partitioning to improve the locality.