[ASBD 2016] ApproxSSD: Fast Data Sampling on SSD Arrays (ISCA 2016 Workshop)

With the explosive growth in data volume and the

increasing demand for real-time analysis, running analytical

frameworks on a subset of the myriad input data has been

trending. Such a data sampling technique computes based on a

combination of sub-datasets and delivers the results with an ac-

ceptable “error bars” at an interactive speed. Furthermore, data

sampling is often performed at the application level by selecting

data randomly without any knowledge of the lower level data

placement. However, for today’s widely deployed primary storage

– Solid State Disk (SSD), its I/O performance is highly dependent

on the data access pattern. Random workloads will result in sever

performance degradation for SSDs. In this paper, we propose

ApproxSSD, which is a framework that leverages the tolerance

of data selection in many applications to perform data-layout

aware sampling on the SSD array. Aiming to minimize the read

latency, ApproxSSD not only uses data-layout aware sampling to

balance workloads on SSD array, but also utilizes delay reflection

to avoid occasional contentions. We have developed a prototype

system for the ApproxSSD in Scala. Evaluation results show that

our prototype system can achieve up to 2.7 speedup compared

to Spark and maintain the high output accuracy simultaneously.