Application-driven Fault-Tolerance for High Performance Distributed Computing
Abstract
As the age of Exascale draws closer and the size of large-scale, distributed applications continues to increase, so does the failure rate and thus the need for advanced resilience techniques to handle them. Over the last years, the resilience topic evolved from an open question to a clear requirement where the failure occurrences are not questioned anymore, but instead the focus is on the frequency of such radical events during the execution of applications at scale. Solutions to transparently manage faults at the system level exist, but their scalability potential and overhead in terms of performance and resource utilization remains high, even for low failure frequency. Therefore, empowering the developers to deal with the failures at application-level instead brings more opportunities to reduce the resilience overhead that needs holistic support from all layers: hardware and software as well as from the parallel programming paradigm. This tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches (1) application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and (2) user-level failure mitigation (as demonstrated through ULFM runtime).
Topic area: Programming Models & Systems Software
Keywords: application-driven fault tolerance, high performance computing, user-level failure mitigation, checkpoint-restart
Agenda
- Introduction
- Why resilience? What strategies exist to enable resilience?
- What makes distributed tightly coupled application particularly difficult to enable resilience for?
- Broad principles: backward recovery (coordinated/uncoordinated checkpoint- restart), forward recovery, application-specific forward recovery
- Trade-offs for system-level vs. application-level techniques
-
Checkpoint-Restart
- Introduction to VeloC: general considerations
- Features of VeloC
- multi-level checkpoint-restart
- masked complexity of interaction with heterogeneous storage hierarchies
- flexibility: configurable resilience strategy, synchronous vs. asynchronous mode of operation
- extensibility: modular design enables custom modules (compression, filtering)
- memory-based vs. file-based checkpointing API
- Hands-on session
- installation, API walkthrough
- example application to be modified by students to integrate with VeloC
- configure, deploy and run application with VeloC
- User-Level Failure Mitigation (ULFM)
- Introduction to ULFM
- Extensions to MPI to enable:
- failure notification
- error propagation
- error recovery
- Software infrastructure:
- scalable failure detection
- scalable revocation
- scalable agreement
- Hands-on session
- Installing and using the docker environment
- Examples illustrating the MPI extensions
Target audience
Any individual with interest in understanding the challenges related to reliability when dealing with the increasing scale of computer infrastructures. Any practician interested in acquiring basic understanding of some of the potential solutions that empower application developers to overcome process and node failures.
Prerequisite knowledge
The participants are supposed to bring their own laptop, running Linux or MAC OS X. Some previous knowledge with MPI and the C programming language is necessary. The content is structured such that it covers 30% beginner, 40% intermediate and 30% advanced level. We estimate an audience of 30 participants.
Material samples
The tutorial will use an updated slide deck of a similar tutorial presented by one of the authors available here:
Presenters
George Bosilca is a Research Assistant Professor and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. His research interests revolve around providing support for parallel applications to maximize their efficiency, scalability and heterogeneity at any scale and in any settings. Dr. Bosilca works in programming paradigms for parallel applications, and in designing parallel programming paradigms providing scalable and portable constructs for dealing with heterogeneity and resilience.
Bogdan Nicolae is a Computer Scientist at Argonne National Laboratory, USA. In the past, he held appointments at Huawei Research, Germany and IBM Research, Ireland. He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, with a focus on high performance architectures cloud computing. He holds a PhD from University of Rennes 1, France and a Dipl. Eng. degree from Politehnica University Bucharest, Romania.