1. Programming models and runtimes
- Author
-
Jorge G. Barbosa, Peter Van Roy, Ali Shoker, Georges Da Costa, Juan C. Díaz-Martín, Juan A. Rico-Gallego, Matthias Janetschek, Ravi Reddy Manumachu, Radu Prodan, Albert van der Linde, João Leitão, Emmanuel Jeannot, Juan L. García-Zapata, Alexey Lastovetsky, Système d’exploitation, systèmes répartis, de l’intergiciel à l’architecture (IRIT-SEPIA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Université Toulouse III - Paul Sabatier (UT3), University College Dublin [Dublin] (UCD), Faculdade de Engenharia da Universidade do Porto (FEUP), Universidade do Porto, University of Extremadura, University of Innsbruck, Topology-Aware System-Scale Data Management for High-Performance Computing (TADAAM), Laboratoire Bordelais de Recherche en Informatique (LaBRI), Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Université de Bordeaux (UB)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Inria Bordeaux - Sud-Ouest, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Universidade Nova de Lisboa = NOVA University Lisbon (NOVA), Alpen-Adria-Universität Klagenfurt [Klagenfurt, Austria], Université Catholique de Louvain = Catholic University of Louvain (UCL), High-Assurance Software Laboratory [Braga, Portugal] ( HASLab ), University of Minho [Braga]-Institute for Systems and Computer Engineering, Technology and Science [Porto] (INESC TEC), University of Minho [Braga], Jesus Carretero, Emmanuel Jeannot, Albert Zomaya, Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI), Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT), Universidade do Porto = University of Porto, Universidad de Extremadura - University of Extremadura (UEX), Leopold Franzens Universität Innsbruck - University of Innsbruck, and Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Université de Bordeaux (UB)-École Nationale Supérieure d'Électronique, Informatique et Radiocommunications de Bordeaux (ENSEIRB)-Centre National de la Recherche Scientifique (CNRS)-Inria Bordeaux - Sud-Ouest
- Subjects
Computer science ,Fault-tolerant mechanisms ,Reliability (computer networking) ,Distributed computing ,02 engineering and technology ,Task (project management) ,Software ,Ultrascale computing systems ,Software fault tolerance ,0202 electrical engineering, electronic engineering, information engineering ,Runtimes ,Overhead (computing) ,[INFO]Computer Science [cs] ,Resilience (network) ,Programmer ,020203 distributed computing ,business.industry ,Distributed programming ,Programming models ,Checkpointing ,Checkpoint restart ,020202 computer hardware & architecture ,Programming paradigm ,Failure management ,[INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC] ,business ,Software stack - Abstract
International audience; Several millions of execution flows will be executed in ultrascale computing systems (UCS), and the task for the programmer to understand their coherency and for the runtime to coordinate them is unfathomable. Moreover, related to UCS large scale and their impact on reliability, the current static point of view is not more sufficient. A runtime cannot consider to restart an application because of the failure of a single node as statically several nodes will fail every day. Classical management of these failures by the programmers using checkpoint restart is also too limited due to the overhead at such a scale. The article explores programming models and runtimes required to facilitate the task of scaling and extracting performance on continuously evolving platforms, while providing resilience and fault-tolerant mechanisms to tackle the increasing probability of failures throughout the whole software stack.
- Published
- 2019
- Full Text
- View/download PDF