Back to Search Start Over

Predicting Production Job Failures for the CIPRES Science Gateway

Authors :
Naik, Pooja
Shava Smallen
Saul, Lawrence
Publication Year :
2017
Publisher :
figshare, 2017.

Abstract

A science gateway is a web-based interface that provides access to High Performance Computing (HPC) computers and storage systems for researchers and the broader scientific community. The Cyberinfrastructure for Phylogenetic Research (CIPRES) science gateway is one of the most popular gateways, with approximately 1300 active users per month, adding an average of 422 new users every month. The growth to additional resources, coupled with constant addition of functionality and tools makes the science gateway more susceptible to failures. Knowledge about the probability of failure in a production environment is most useful when it is known before or at job submission time to allow the science gateway to take corrective action. In this poster, we analyze temporally sequential data from CIPRES and combine it with software and services monitoring data to create a machine learning model that predicts the outcome of a job, focusing on failures caused by the most common cause of failure - system errors. At one operating point of our classifier, we are able to detect almost 85% of jobs that will fail with a false negative rate of about 5%. Our code will be deployed to production through an API and will allow for the easy addition or modification of features. The model can be extended to build a more generic automated monitoring analysis service for science gateways in the future.

Details

Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....172099ffeec6ba1ecb055c4077baed46
Full Text :
https://doi.org/10.6084/m9.figshare.4522511.v1