Understanding performance of distributed data-intensive applications

Authors :: Michael V. Miceli
Shantenu Jha
Christopher Miceli
Bety Rodriguez-Milla
Source :: Philosophical transactions. Series A, Mathematical, physical, and engineering sciences. 368(1926)
Publication Year :: 2010
Abstract: Grids, clouds and cloud-like infrastructures are capable of supporting a broad range of data-intensive applications. There are interesting and unique performance issues that appear as the volume of data and degree of distribution increases. New scalable data-placement and management techniques, as well as novel approaches to determine the relative placement of data and computational workload, are required. We develop and study a genome sequence matching application that is simple to control and deploy, yet serves as a prototype of a data-intensive application. The application uses a SAGA-based implementation of the All-Pairs pattern. This paper aims to understand some of the factors that influence the performance of this application and the interplay of those factors. We also demonstrate how the SAGA approach can enable data-intensive applications to be extensible and interoperable over a range of infrastructure. This capability enables us to compare and contrast two different approaches for executing distributed data-intensive applications—simple application-level data-placement heuristics versus distributed file systems.

Subjects :: Matching (statistics)
Theoretical computer science
Computer science
business.industry
General Mathematics
Distributed computing
Interoperability
General Engineering
Volume (computing)
General Physics and Astronomy
Cloud computing
computer.software_genre
Grid computing
Scalability
Data-intensive computing
Heuristics
business
computer

ISSN :: 1364503X
Volume :: 368
Issue :: 1926
Database :: OpenAIRE
Journal :: Philosophical transactions. Series A, Mathematical, physical, and engineering sciences
Accession number :: edsair.doi.dedup.....413b8d306262762a2e31ef663836c048

Tools