14 results on '"He Sun"'
Search Results
2. A Holistic Heterogeneity-Aware Data Placement Scheme for Hybrid Parallel I/O Systems
- Author
-
Yanlong Yin, Xian-He Sun, Shuibing He, Xiaohua Xu, Zheng Li, Yong Chen, and Jiang Zhou
- Subjects
Scheme (programming language) ,020203 distributed computing ,Focus (computing) ,Distributed database ,Computer science ,Distributed computing ,02 engineering and technology ,Parallel I/O ,Computational Theory and Mathematics ,Hardware and Architecture ,Server ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,computer ,Data placement ,computer.programming_language - Abstract
We present H2DP , a holistic heterogeneity-aware data placement scheme for hybrid parallel I/O systems, which consist of HDD servers and SSD servers. Most of the existing approaches focus on server performance or application I/O pattern heterogeneity in data placement. H2DP considers three axes of heterogeneity: server performance, server space, and application I/O pattern. More specifically, H2DP determines the optimized stripe sizes on servers based on server performance, keeps only critical data on all hybrid servers and the rest data on HDD servers, and dynamically migrates data among different types of servers at run-time. This holistic heterogeneity-awareness enables H2DP to achieve high performance by alleviating server load imbalance, efficiently utilizing SSD space, and accommodating application pattern variation. We have implemented a prototype of H2DP under MPICH2 atop OrangeFS. Extensive experimental results demonstrate that H2DP significantly improve I/O system performance compared to existing data placement schemes.
- Published
- 2020
3. LPM: A Systematic Methodology for Concurrent Data Access Pattern Optimization from a Matching Perspective
- Author
-
Yuhang Liu and Xian-He Sun
- Subjects
Data flow diagram ,Data access ,Correctness ,Computational Theory and Mathematics ,Memory hierarchy ,Hardware and Architecture ,Computer science ,Concurrency ,Distributed computing ,Signal Processing ,Locality ,Concurrent computing ,Hierarchical control system - Abstract
As applications become increasingly data intensive, conventional computing systems become increasingly inefficient due to data access performance bottlenecks. While intensive efforts have been made in developing new memory technologies and in designing special purpose machines, there is a lack of solutions for evaluating and utilizing recent hardware advancements to address the memory-wall problem in a systematic way. In this study, we present the memory Layered Performance Matching (LPM) methodology to provide a systematic approach for data access performance optimization. LPM uniquely presents and utilizes the data access concurrency, in addition to data access locality, in a memory hierarchical system. The LPM methodology consists of models and algorithms, and is supported with a series of analytic results for its correctness. The rationale of LPM is to reduce the overall data access delay through the matching of data request rate and data supply rate at each layer of a memory hierarchy, with a balanced consideration of data locality, data concurrency, and latency hiding of data flow. Extensive experimentations on both physical platforms and software simulators confirm our theoretical findings, and they show that the LPM approach can be applied in diverse computing platforms and can effectively guide performance optimization of memory systems.
- Published
- 2019
4. On Cost-Driven Collaborative Data Caching: A New Model Approach
- Author
-
Cheng-Zhong Xu, Xiaopeng Fan, Xian-He Sun, Shuibing He, and Yang Wang
- Subjects
020203 distributed computing ,Computer science ,Distributed computing ,02 engineering and technology ,Data modeling ,Dynamic programming ,Computational Theory and Mathematics ,Hardware and Architecture ,Server ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Cache ,Online algorithm ,Cache algorithms - Abstract
In this paper we consider a new caching model that enables data sharing for network services in a cost-effective way. The proposed caching algorithms are characterized by using monetary cost and access information to control the cache replacements, instead of exploiting capacity-oriented strategies as in traditional approaches. In particular, given a stream of requests to a shared data item with respect to a homogeneous cost model, we first propose a fast off-line algorithm using dynamic programming techniques, which can generate an optimal schedule within $O(mn)$ time-space complexity by using cache, migration as well as replication to serve a $n$ -length request sequence in a $m$ -node network, substantially improving the previous results. Furthermore, we also study the online form of this problem, and present an 3-competitive online algorithm by leveraging an idea of anticipatory caching. The algorithm can serve an online request in constant time and is space efficient in $O(m)$ as well, rendering it more practical in reality. We evaluate our algorithms, together with some variants, by conducting extensive simulation studies. Our results show that the optimal cost of the off-line algorithm is changed in a parabolic form as the ratio of caching cost to transfer cost is increased, and the online algorithm is less than 2 times worse in most cases than its optimal off-line counterpart.
- Published
- 2019
5. Fault-aware runtime strategies for high-performance computing
- Author
-
Yawei Li, Zhiling Lan, Gujrati, Prashasta, and Xian-He sun
- Subjects
Event history analysis -- Usage ,Fault tolerance (Computers) -- Evaluation ,Fault tolerance ,Performance improvement ,Business ,Computers ,Electronics ,Electronics and electrical industries - Published
- 2009
6. A parallel two-level hybrid method for tridiagonal systems and its application to fast Poisson solvers
- Author
-
Xian-He Sun and Wu Zhung
- Subjects
Computer programming -- Analysis ,Parallel processing -- Analysis ,Poisson processes -- Analysis ,Computer programming ,Parallel processing ,Business ,Computers ,Electronics ,Electronics and electrical industries - Abstract
The Parallel Two-Level Hybrid (PTH) method, developed to solve tridiagonal systems on parallel computers is presented. Theoretical analyses and numerical experiments indicates that PTH is over 10 times faster and better than any other existing methods on massively parallel computers.
- Published
- 2004
7. Cost-Aware Region-Level Data Placement in Multi-Tiered Parallel I/O Systems
- Author
-
Zheng Li, Yang Wang, Xian-He Sun, Shuibing He, and Chenzhong Xu
- Subjects
Scheme (programming language) ,Computer science ,Distributed computing ,02 engineering and technology ,computer.software_genre ,GeneralLiterature_MISCELLANEOUS ,Data modeling ,Server ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,computer.programming_language ,File system ,020203 distributed computing ,Distributed database ,business.industry ,Dynamic data ,05 social sciences ,Workload ,Solid-state drive ,Parallel I/O ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,business ,computer ,050203 business & management ,Computer network - Abstract
Multi-tiered Parallel I/O systems that combine traditional HDDs with emerging SSDs mitigate the cost burden of SSDs while benefiting from their superior I/O performance. While a multi-tiered parallel I/O system is promising for data-intensive applications in high-performance (HPC) domains, placing data on each tier of the system to achieve high I/O performance remains a challenge. In this paper, we propose a cost-aware region-level (CARL) data placement scheme in multi-tiered parallel I/O systems. CARL divides a large file into several small regions, and then places regions on different types of servers based on region access costs. CARL includes a static policy S-CARL and a dynamic policy D-CARL. For applications whose I/O access patterns are completely known, S-CARL calculates the region costs within the entire workload duration, and uses a static data placement scheme to selectively place regions on the proper servers. To adapt to applications whose access patterns are unknown in advance, D-CARL uses a dynamic data placement scheme which migrates data among different servers within each time window. We have implemented CARL under MPI-IO library and OrangeFS parallel file system environment. Our evaluation with representative benchmarks and an application shows that CARL is both feasible and able to improve I/O performance significantly.
- Published
- 2017
8. Using MinMax-Memory Claims to Improve In-Memory Workflow Computations in the Cloud
- Author
-
Cheng-Zhong Xu, Xian-He Sun, Shuibing He, and Yang Wang
- Subjects
020203 distributed computing ,business.industry ,CPU cache ,Computer science ,Concurrency ,Distributed computing ,Cloud computing ,Provisioning ,02 engineering and technology ,Deadlock ,020202 computer hardware & architecture ,Workflow technology ,Workflow ,Memory management ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Concurrent computing ,business ,Workflow management system - Abstract
In this paper, we consider to improve scientific workflows in cloud environments where data transfers between tasks are performed via provisioned in-memory caching as a service, instead of relying entirely on slower disk-based file systems. However, this improvement is not free since services in the cloud are usually charged in a “pay-as-you-go” model. As a consequence, the workflow tenants have to estimate the amount of memory that they would like to pay. Given the intrinsic complexity of the workflows, it would be very hard to make an accurate prediction, which would lead to either oversubscription or undersubscription, resulting in unproductive spending or performance degradation. To address this problem, we propose a concept of minmax memory claim (MMC) to achieve cost-effective workflow computations in in-memory cloud computing environments. The minmax-memory claim is defined as the minimum amount of memory required to finish the workflow without compromising its maximum concurrency. With the concept of MMC, the workflow tenants can achieve the best performance via in-memory computing while minimizing the cost. In this paper, we present the procedure of how to find the MMCs for those workflows with arbitrary graphs in general and develop optimal efficient algorithms for some well-structured workflows in particular. To further show the values of this concept, we also implement these algorithms and apply them, through a simulation study, to improve deadlock resolutions in workflow-based workloads when memory resources are constrained.
- Published
- 2017
9. Improving Performance of Parallel I/O Systems through Selective and Layout-Aware SSD Cache
- Author
-
Xian-He Sun, Shuibing He, and Yang Wang
- Subjects
020203 distributed computing ,Snoopy cache ,Hardware_MEMORYSTRUCTURES ,Computer science ,Cache coloring ,02 engineering and technology ,Cache pollution ,computer.software_genre ,Parallel I/O ,020202 computer hardware & architecture ,Smart Cache ,File server ,Computational Theory and Mathematics ,Hardware and Architecture ,Cache invalidation ,Server ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Operating system ,Page cache ,Cache ,Cache algorithms ,computer - Abstract
Parallel file systems (PFS) are widely-used to ease the I/O bottleneck of modern high-performance computing systems. However, PFSs do not work well for small requests, especially small random requests. Newer Solid State Drives (SSD) have excellent performance on small random data accesses, but also incur a high monetary cost. In this study, we propose SLA-Cache, a Selective and Layout-Aware Cache system that employs a small set of SSD-based file servers as a cache of conventional HDD-based file servers. SLA-Cache uses a novel scheme to identify performance-critical data, and conducts a selective cache admission (SCA) policy to fully utilize SSD-based file servers. Moreover, since data layout of the cache system can also largely influence its access performance, SLA-Cache applies a layout-aware cache placement scheme (LCP) to store data on SSD-based file servers. By storing data with an optimal layout requiring the lowest access cost among three typical layout candidates, LCP can further improve system performance. We have implemented SLA-Cache under the MPICH2 I/O library. Experimental results show that SLA-Cache can significantly improve I/O throughput, and is a promising approach for parallel applications.
- Published
- 2016
10. Towards Exploring Data-Intensive Scientific Applications at Extreme Scales through Systems and Simulations
- Author
-
Ioan Raicu, Robert Ross, Xian-He Sun, Dries Kimpe, Dongfang Zhao, and Ning Liu
- Subjects
Computer science ,business.industry ,Distributed computing ,Concurrency ,020206 networking & telecommunications ,Fault tolerance ,Cloud computing ,02 engineering and technology ,Supercomputer ,Bottleneck ,Metadata ,Computational Theory and Mathematics ,Computer architecture ,Hardware and Architecture ,Signal Processing ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Systems design ,020201 artificial intelligence & image processing ,business - Abstract
The state-of-the-art storage architecture of high-performance computing systems was designed decades ago, and with today's scale and level of concurrency, it is showing significant limitations. Our recent work proposed a new architecture to address the I/O bottleneck of the conventional wisdom, and the system prototype (FusionFS) demonstrated its effectiveness on up to 16 K nodes—the scale on par with today's largest supercomputers. The main objective of this paper is to investigate FusionFS's scalability towards exascale. Exascale computers are predicted to emerge by 2018, comprising millions of cores and billions of threads. We built an event-driven simulator (FusionSim) according to the FusionFS architecture, and validated it with FusionFS's traces. FusionSim introduced less than 4 percent error between its simulation results and FusionFS traces. With FusionSim we simulated workloads on up to two million nodes and find out almost linear scalability of I/O performance; results justified FusionFS's viability for exascale systems. In addition to the simulation work, this paper extends the FusionFS system prototype in the following perspectives: (1) the fault tolerance of file metadata is supported, (2) the limitations of the current system design is discussed, and (3) a more thorough performance evaluation is conducted, such as N-to-1 metadata write, system efficiency, and more platforms such as Amazon Cloud.
- Published
- 2016
11. Fault-Aware Runtime Strategies for High-Performance Computing
- Author
-
Zhiling Lan, Yawei Li, Xian-He Sun, and P. Gujrati
- Subjects
Runtime system ,Computational Theory and Mathematics ,Hardware and Architecture ,Computer science ,Software fault tolerance ,Distributed computing ,Signal Processing ,Runtime verification ,Concurrent computing ,Fault tolerance ,Scheduling (computing) ,Fault management - Abstract
As the scale of parallel systems continues to grow, fault management of these systems is becoming a critical challenge. While existing research mainly focuses on developing or improving fault tolerance techniques, a number of key issues remain open. In this paper, we propose runtime strategies for spare node allocation and job rescheduling in response to failure prediction. These strategies, together with failure predictor and fault tolerance techniques, construct a runtime system called FARS (Fault-Aware Runtime System). In particular, we propose a 0-1 knapsack model and demonstrate its flexibility and effectiveness for reallocating running jobs to avoid failures. Experiments, by means of synthetic data and real traces from production systems, show that FARS has the potential to significantly improve system productivity (i.e., performance and reliability).
- Published
- 2009
12. Integrated range comparison for data-parallel compilation systems
- Author
-
Thomas Fahringer, Xian-He Sun, and M. Pantano
- Subjects
Correctness ,Computer science ,Fortran ,Parallel computing ,Asset (computer security) ,Execution time ,Range (mathematics) ,Computational Theory and Mathematics ,Hardware and Architecture ,Signal Processing ,Scalability ,Point (geometry) ,Software system ,computer ,computer.programming_language - Abstract
A major difficulty in restructuring compilation, and in parallel programming in general, is how to compare parallel performance over a range of system and problem sizes. Execution time varies with system and problem size and an initially fast implementation may become slow when system and problem size scale up. This paper introduces the concept of range comparison. Unlike conventional execution time comparison in which performance is compared for a particular system and problem size, range comparison compares the performance of programs over a range of ensemble and problem sizes via scalability and performance crossing point analysis. A novel algorithm is developed to predict the crossing point automatically. The correctness of the algorithm is proven and a methodology is developed to integrate range comparison into restructuring compilations for data-parallel programming. A preliminary prototype of the methodology is implemented and tested under Vienna Fortran Compilation System. Experimental results demonstrate that range comparison is feasible and effective. It is an important asset for program evaluation, restructuring compilation, and parallel programming.
- Published
- 1999
13. Performance considerations of shared virtual memory machines
- Author
-
Jianping Zhu and Xian-He Sun
- Subjects
Speedup ,Computational Theory and Mathematics ,Shared memory ,Hardware and Architecture ,Computer science ,Signal Processing ,Scalability ,Virtual memory ,Uniprocessor system ,Parallel computing ,Supercomputer - Abstract
Generalized speedup is defined as parallel speed over sequential speed. In this paper the generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, we show that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application has been implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, an interesting relation between fixed-time and memory-bounded speedup is revealed. Various causes of superlinear speedup are also presented.
- Published
- 1995
14. Scalability of parallel algorithm-machine combinations
- Author
-
Xian-He Sun and Diane T. Rover
- Subjects
Speedup ,Computer science ,Parallel algorithm ,Scalability testing ,Parallel computing ,Software metric ,Visualization ,Computational Theory and Mathematics ,Parallel processing (DSP implementation) ,Hardware and Architecture ,Signal Processing ,Scalability ,Instrumentation (computer programming) ,Word (computer architecture) - Abstract
Scalability has become an important consideration in parallel algorithm and machine designs. The word scalable, or scalability, has been widely and often used in the parallel processing community. However, there is no adequate, commonly accepted definition of scalability available. Scalabilities of computer systems and programs are difficult to quantify, evaluate, and compare. In this paper, scalability is formally defined for algorithm-machine combinations. A practical method is proposed to provide a quantitative measurement of the scalability. The relation between the newly proposed scalability and other existing parallel performance metrics is studied. A harmony between speedup and scalability has been observed. Theoretical results show that a large class of algorithm-machine combinations is scalable and the scalability can be predicted through premeasured machine parameters. Two algorithms have been studied on an nCUBE 2 multicomputer and on a MasPar MP-1 computer. These case studies have shown how scalabilities can be measured, computed, and predicted. Performance instrumentation and visualization tools also have been used and developed to understand the scalability related behavior. >
- Published
- 1994
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.