Jiangling Yin, Jun Wang, Jian Zhou, Tyler Lukasiewicz, Dan Huang and Junyao Zhang. IPDPS 2015.
In this paper, we study parallel data access on distributed file systems, e.g, the Hadoop file system. Our experiments show that parallel data read requests often access data remotely and in an imbalanced fashion. This results in a serious disk access and data transfer contention on certain cluster/storage nodes. We conduct a complete analysis on how the remote and imbalanced read patterns occur and how they are affected by the size of the cluster. Then, we propose a novel method to Optimize Parallel Data Access on Distributed File Systems referred to as Opass. Opass maps the data read requests that are issued by parallel applications to cluster nodes to a graph data structure where edge weights encode the demands of data locality and load capacity. To achieve the maximum degree of data locality and balanced access, we propose new matching-based algorithms to match processes to data based on the configurations of the graph data structure. Our proposed method could benefit parallel dataintensive analysis with different parallel data access strategies. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results from both benchmark and wellknown parallel applications show the performance benefits and scalability of Opass.