1. Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers.
- Author
-
Huang, Weichen, Fang, Juntao, Wan, Shenggang, Xie, Changsheng, and He, Xubin
- Subjects
- *
SERVER farms (Computer network management) , *EMULATION software , *MATHEMATICAL optimization , *IDENTIFICATION , *HEART beat - Abstract
Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF