1. Distributed cache strategy based on LT codes under spark platform.
- Author
-
Shang, Jing, Zhang, Yifei, Wang, Jibin, Wu, Zhihui, and Xiao, Zhiwen
- Abstract
During the execution of tasks in a distributed computing system, the master needs to assign new workers to replace lagging workers that usually occur due to various issues such as network problems, hardware failures, software bugs, or high cluster load. If the completed part of tasks by lagging workers is not cached timely, new workers would need to recalculate it, leading to further delays. In the Spark platform, checkpointing allows for the preservation of intermediate results computed by workers in resilient distributed datasets (RDDs) by storing them in Hadoop distributed file system (HDFS). However, storing and retrieving data from HDFS still remains time-consuming. Traditional distributed cache strategies neglect to utilize both coding techniques and different lag probabilities of workers to reduce overhead of caching data. This paper introduces a distributed cache strategy based on Luby transform (LT) codes to enhance system's effectiveness and robustness. An optimization method is designed within prescribed cache limits to guide the selection of RDDs for caching. Additionally, a novel algorithm is proposed for determining the participation of specific partitions in encoding caching under consideration of worker failure probabilities and overhead of encoding and decoding. In the event of worker failure, intermediate data can be swiftly recovered through decoding, and then the standbys can rapidly take over stragglers' tasks, hence markedly mitigating overall task latency. Theoretical analysis and experimental results demonstrate that the LT code-based caching strategy exhibits more efficiency and lower latency compared to recomputation, HDFS-based and weight-based caching methods, and it saves much more memory overhead compared to Redis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF