FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs on FAST 2015.

Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. As the data sets become larger, the graph can not be stored in memory in a single node. Thus, many cloud based graph processing engine, such as the Graphx on Spark, are studied recent years. One of the major challenge to process large graph on physically distributed nodes is the large communication overhead, because the computing nodes need to exchange the vertex information in each iteration with each other.

Many partition strategy has been proposed to reduce the communication overhead. However, because all edges in a graph are connected in some way, thus, graphs cannot be clustered or partitioned effectively to localize access. Because of the significant bottleneck of network performance, recent work has turned back to processing large graphs on a single node by introducing semi-external memory Graph processing. In the semi-external memory Graph processing engine, large graphs are partitioned to multiple sub graph, so that each sub graph can be stored and processed in the main memory. Because there is no network communication, these semi-external memory Graph processing can beat some small scale cloud graph processing engine in performance. However, state-of-art solutions do not fully exploit the potential of SSDs.

FlashGraph Architecture

flashgraph

As shown in above graph, the FlashGraph provide a vertex-centric interface to let the user graph algorithms define the vertex programs. Then the vertex programs are translate to vertex tasks. The graph engine schedule the vertex tasks based on the on disk locality of graph.

The vertex-centric interface, which is also used in Pregel and PowerGraph, provide a flexible programming interface to express a variety of graph algorithms and their optimizations. The rest of FlashGraph’s programming interface is event-driven to overlap computation and I/O, and receive notifications from the graph engine and other vertices.

flashgraph-execution-model

FlashGraph splits a graph into multiple partitions and assigns a worker thread to each partition to process vertices. Each worker thread maintains a queue of active vertices within its own partitio. The vertex is divided to inactive vertex which is stored on disk, active vertex which is stored in memory and running vertex which is under processing. The asynchronous user-task I/O interface load the inactive vertex to memory based on the schedule. The running vertices interact with other vertices via message passing.

FlashGraph’s scheduler both manages the order of execution of active vertices and guarantees only a fixed number of running vertices in a thread.

Message passing is used to avoids concurrent data access to the state of other vertices. So, all the execution interface inside the graph engine is asynchronous.

Optimization in FlashGraph

Sequential I/O

FlashGraph performs sequential I/O when possible, but it do not emphasize on sequential I/Os because:

For SSDs, FlashGraph places a higher priority in reducing the number of bytes read from SSDs than in performing sequential I/O because the random (4KB) I/O throughput of SSDs today is only two or three times less than their sequential I/O.

Parallel I/O for SSDs

It deploys dedicated per-SSD I/O threads to issue I/O requests with Linux AIO to reduce locking overhead in the Linux kernel; it refactors I/Os from applications and sends them to I/O threads with message passing. Furthermore, it has a scalable, lightweight page cache that organizes pages in a hash table and places multiple pages in a hash table slot [31]. This page cache reduces locking overhead and incurs little overhead when the cache hit rate is low; it increases application-perceived performance linearly along with the cache hit rate.

Avoid Memory Copy

The task is executed in side of filesystem, to avoid memory allocation and copy:

Upon completion of a request, the associated user task executes inside the filesystem, accessing data in the page cache directly. Therefore, there is no memory allocation and copy for asynchronous I/O.

Reduce the Memory Usage by compressing

For the Graph execution engine, the value of each verticals are stored in memory, the edges are stored in SSDs. The execution engine use multiple ways to compress the data both for the in memory data representation and external-memory data representation.

Improving Cache hit ratio by locality I/O

And also the execution engine use multiple ways to improve the cache hit ratio, including merge I/Os to perform sequential I/O and use partitioning to improve the locality.