Layout-Aware Data Scheduling for bulk end-to-end Data Transfers

LADS: Optimizing Data Transfers Using Layout-Aware Data Scheduling on FAST 2015 by Youngjae Kim et al.

Youngjae Kim is the author of FlashSim, DFTL, HybridStore, NVMalloc and Active Flash.

This paper is talking about bulk data movement between large parallel file systems (PFS) and the HPC systems (e.g. OLCF’s Titan and ALCF’s Mira) and scientific instruments (e.g. ORNL’s SNS and ANL’s APS). The author observed that unaware of the underlying file system layout, contributes to the problem of I/O contention in the PFS. This research aims to lower the storage occupancy rate of stalled I/Os in order to minimize the impact of storage congestion on the overall I/O performance using three techniques: Layout-aware I/O Scheduling, OST congestion-aware I/O scheduling, and object caching on SSDs.

The HPC systems and scientific instruments have large parallel file systems (PFS) connected by a high-performance network as their data storage system. They often needs bulk data movement between geographically distributed organization.

The HPC systems run simulations that have intense computational phases, followed by inter process communication, and periodically by I/O to checkpoint state or save an output file. The simulation’s startup is dominated by a read phase to retrieve the input files as well as the application binary and libraries.

The instruments, on the other hand, do not have a read phase and strictly have write workloads that capture measurements. These measurements are triggered by a periodic event such as an accelerated particle hitting a target which generates various energies and sub-particles. The instrument’s detectors will capture these events and it must move the data off the device before the next event.

File/data transfer on PVFS launches multiple thread to read/write concurrently. The following figure, gives an example when two threads are launches to read two files concurrently. Traditionally, the schedule scheme are unaware of the data layout, such as which server/storage does a slice of data is located. Multiple threads may request data from the same server/storage at the same time, thus cause the contention on that server and all of the threads will be delayed.


We also use multiple thread to read multiple slices of a single file concurrently. As shown in the following figure, these threads can also be delayed when contention occurs.


Given a set of files, we determine where all of the objects reside in the case of reading at the source or determine which servers to stripe the objects over when writing at the sink. We then schedule the accesses based the location of the objects, not based on the file. … If another thread is accessing an OST, the other threads skip that queue and move on to the next. … We then extend layout-aware scheduling to be congestion-aware by implementing a heuristic algorithm to detect and avoid the congested OSTs. … If the average read time during (W) is greater than the pre-set threshold value (T), then it marks the server as congested. The algorithm tells the threads that they should skip congested servers (M) times.

The local storage is faster than network. Some time, “the source’s I/O threads will stall because they have no buffers in which to read”.

To mitigate this, we investigate using a fast NVM device to extend the buffer space available for reading at the source.