PERP: Attacking the balance among energy, performance and recovery in storage systems

Most recently, an important metric called “energy proportional” is presented as a guideline for energy efficiency systems (Barroso and Hölzle, 2007), which advocates that energy consumption should be in proportion to system performance/utilization. However, this tradeoff metric is only defined for normal mode where the system is functioning normally without node failures. When node failure occurs, the system enters degradation mode during which node reconstruction is initiated. This very process needs to wake/spin up a number of disks and takes a substantial amount of I/O bandwidth, which will not only compromise energy efficiency but also performance. Moreover, as in replication-based storage such as Google File System (Sanjay Ghemawat, Gobioff, 2003 [10]) and Hadoop Distributed File System (Borthakur, 2007), systems are adopting a recovery policy that defines a deadline for recovery rather than simply recovering the data as soon as possible. Given the flexibility of the recovery time, this makes it possible to reduce energy consumption with respect to the performance and recovery requirements. This raises a natural problem: how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system? Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer this question accurately. This paper presents a mathematical model named Perfect Energy, Recovery and Performance (PERP) which provides guidelines for provisioning the number of active nodes as well as the assigned recovery bandwidth at each time slot with respect to the performance and recovery constraints. To utilize PERP in storage systems, we take data layouts into consideration and propose a node selection algorithm named “Gradual Increase/decrease” Algorithm (GIA) to select the active nodes based on PERP results. We apply PERP and GIA to current popular power proportional layouts and test their effectiveness on a 25 nodes in-house CASS cluster. Experimental results validate that while meeting both performance and recovery constraints, PERP helps realize 25% power savings comparing with maximum recovery policy from Sierra (Thereska et al., 2011)and 20% power savings comparing with recovery group policy from Rabbit (Amur et al., 2010).