Application-driven Fault-Tolerance for High Performance Distributed Computing

Abstract

As the age of Exascale draws closer and the size of large-scale, distributed applications continues to increase, so does the failure rate and thus the need for advanced resilience techniques to handle them. Over the last years, the resilience topic evolved from an open question to a clear requirement where the failure occurrences are not questioned anymore, but instead the focus is on the frequency of such radical events during the execution of applications at scale. Solutions to transparently manage faults at the system level exist, but their scalability potential and overhead in terms of performance and resource utilization remains high, even for low failure frequency. Therefore, empowering the developers to deal with the failures at application-level instead brings more opportunities to reduce the resilience overhead that needs holistic support from all layers: hardware and software as well as from the parallel programming paradigm. This tutorial highlights application-driven solutions to survive faults and provide a basic understanding of their expected costs at scale. The presented solutions cover two complementary approaches (1) application-defined checkpoint-restart (as demonstrated through the VeloC runtime); and (2) user-level failure mitigation (as demonstrated through ULFM runtime).

Topic area: Programming Models & Systems Software

Keywords: application-driven fault tolerance, high performance computing, user-level failure mitigation, checkpoint-restart

Agenda

Introduction
Why resilience? What strategies exist to enable resilience?
- What makes distributed tightly coupled application particularly difficult to enable resilience for?
- Broad principles: backward recovery (coordinated/uncoordinated checkpoint- restart), forward recovery, application-specific forward recovery
- Trade-offs for system-level vs. application-level techniques
Checkpoint-Restart
- Introduction to VeloC: general considerations
- Features of VeloC
  - multi-level checkpoint-restart
  - masked complexity of interaction with heterogeneous storage hierarchies
  - flexibility: configurable resilience strategy, synchronous vs. asynchronous mode of operation
  - extensibility: modular design enables custom modules (compression, filtering)
  - memory-based vs. file-based checkpointing API
- Hands-on session
  - installation, API walkthrough
  - example application to be modified by students to integrate with VeloC
  - configure, deploy and run application with VeloC
User-Level Failure Mitigation (ULFM)
- Introduction to ULFM
- Extensions to MPI to enable:
  - failure notification
  - error propagation
  - error recovery
- Software infrastructure:
  - scalable failure detection
  - scalable revocation
  - scalable agreement
- Hands-on session
  - Installing and using the docker environment
  - Examples illustrating the MPI extensions

Target audience

Any individual with interest in understanding the challenges related to reliability when dealing with the increasing scale of computer infrastructures. Any practician interested in acquiring basic understanding of some of the potential solutions that empower application developers to overcome process and node failures.

Prerequisite knowledge

The participants are supposed to bring their own laptop, running Linux or MAC OS X. Some previous knowledge with MPI and the C programming language is necessary. The content is structured such that it covers 30% beginner, 40% intermediate and 30% advanced level. We estimate an audience of 30 participants.

Material samples

The tutorial will use an updated slide deck of a similar tutorial presented by one of the authors available here:

Presenters

George Bosilca is a Research Assistant Professor and Adjunct Assistant Professor at the Innovative Computing Laboratory at University of Tennessee, Knoxville. His research interests revolve around providing support for parallel applications to maximize their efficiency, scalability and heterogeneity at any scale and in any settings. Dr. Bosilca works in programming paradigms for parallel applications, and in designing parallel programming paradigms providing scalable and portable constructs for dealing with heterogeneity and resilience.

Bogdan Nicolae is a Computer Scientist at Argonne National Laboratory, USA. In the past, he held appointments at Huawei Research, Germany and IBM Research, Ireland. He specializes in scalable storage, data management and fault tolerance for large scale distributed systems, with a focus on high performance architectures cloud computing. He holds a PhD from University of Rennes 1, France and a Dipl. Eng. degree from Politehnica University Bucharest, Romania.