Back to Search Start Over

Dynamic Traffic Control of Staging Traffic on the Interconnect of the HPC Cluster System

Authors :
Arata Endo
Hiroki Ohtsuji
Erika Hayashi
Eiji Yoshida
Chunghan Lee
Susumu Date
Shinji Shimojo
Source :
IEEE Access, Vol 8, Pp 198518-198531 (2020)
Publication Year :
2020
Publisher :
IEEE, 2020.

Abstract

High-performance computing (HPC) cluster systems sometimes adopt a two-layered file system composed of local and global file systems to achieve both capacity and performance in storage. In such a cluster system, the input data of an application needs to be staged from the global storage into the local storage, and the output data needs to be staged from the local storage out to the global storage. This staging operation must be efficiently and quickly performed to gain higher job throughput because an inefficient staging operation prevents waiting job requests from being executed. In particular, in the case of the cluster system with the oversubscribed interconnect shared by the storage and the computing nodes, the inter-node communication and this staging operation traffic collides, which may degrade the job throughput. In this research, we focus on the traffic collision of the inter-node communication and the staging traffic to improve job throughput, targeting the cluster system with the oversubscribed interconnect where these two types of traffic flow. In other words, whether the dynamic control of the traffic flow derived from the staging operation leads to the improvement in the job throughput or not is investigated. For the investigation, we present a traffic collision avoidance method to dynamically configure a set of data paths for each type of the traffic only while the staging operation is conducted. The evaluation in this article shows that the proposed method avoids a traffic collision and accelerates the staging operation by 22.0% on our cluster system. Also, this evaluation indicates the overhead of the application incurred by the proposed method is negligible. Furthermore, 8.7% of the job execution time is reduced by the proposed method.

Details

Language :
English
ISSN :
21693536
Volume :
8
Database :
Directory of Open Access Journals
Journal :
IEEE Access
Publication Type :
Academic Journal
Accession number :
edsdoj.048157a9514c90b0accafc08c18bef
Document Type :
article
Full Text :
https://doi.org/10.1109/ACCESS.2020.3035158