1. Scalable monitoring and dependable job scheduling support for multi-domain grid infrastructures
- Author
-
Luca Foschini, Antonio Corradi, Javier Povedano-Molina, Marcello Cinque, Flavio Frattini, Cinque, Marcello, Corradi, Antonio, Foschini, Luca, Frattini, Flavio, Povedano-Molina, Javier, and Povedano Molina, Javier
- Subjects
Job scheduler ,020203 distributed computing ,Monitoring ,Computer science ,Distributed computing ,Fault tolerance ,ScHeduling ,020206 networking & telecommunications ,02 engineering and technology ,Troubleshooting ,Dependability ,computer.software_genre ,Grid ,Fair-share scheduling ,Scheduling (computing) ,Grid computing ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,computer ,Software - Abstract
The management of Grid systems commonly lacks information for identifying the failures that may hinder the timely completion of jobs, and cause the wasting of computing resources. Monitoring can certainly help, but novel approaches need to be conceived for such large and geographically distributed systems. We propose a Grid Architecture for scalable Monitoring and Enhanced dependable job ScHeduling (GAMESH). GAMESH is a completely distributed and highly efficient management infrastructure for the dissemination of monitoring data and troubleshooting of job execution failures in large-scale and multi-domain Grid environments. Challenged in a real deployment and compared to other Grid management systems, GAMESH demonstrates to (i) ensure measurements of both computing resources and conditions of task scheduling at geographically sparse sites, while inducing a low overhead on the entire infrastructure, and (ii) enable failure-aware scheduling and improve overall system performance, even in the presence of failures, by coordinating local job schedulers at multiple domains.
- Published
- 2016
- Full Text
- View/download PDF