
Greg Bronevetsky
Modern high performance systems have grown in both size and complexity to a point where their mean time to failure is on the order of hours or days. Checkpointing is a general-purpose fault tolerance technique where the state of the application is periodically saved to reliable media. When the application or the system fails, the application is restarted from a recently stored computational state. In the case of parallel applications prior work on checkpointing has involved modifying the underlying implementation of the parallel software or hardware to enable appropriate state-saving and coordination. While effective, this approach is non-portable, requiring all vendors of parallel communication libraries to individually implement their own checkpointing functionality. The primary contribution of this project is to develop mechanisms that checkpoint popular message-passing and shared memory APIs (MPI and OpenMP) at the level of the API, thus ensuring that a single system could checkpoint any implementation of MPI and OpenMP. In addition to these improvements in portability, we developed several novel checkpoint coordination protocols that improved the performance and scalability of parallel checkpointing in general.
Developed a novel checkpoint coordination protocol for message passing applications and implemented it in the context of the portable MPI checkpointer [PPoPP2003] that supports the entire MPI1.1 specification[ICS2003]. The protocol is essentially a combination between the Chandy-Lamport distributed snapshot protocol and message logging, where different processes may checkpoint at different points in time and the protocol records the appropriate messages and non-deterministic event outcomes to ensure that these individual snapshots can be combined into a full global checkpoint. Experimental results on Windows/PentiumIII/cLAN and Tru64/Alpha/Quadrics demontrated the scalability, portability and performance of this approach [SC2004].
Designed a novel protocol for checkpointing shared memory applications/APIs, regardless of the underlying memory model[ASPLOS2004]. The protocol is based on all threads performing two successive barrier operations and recording their local and shared states between the two barriers. Since this insertion of barriers can create deadlocks in combination with the application's regular synchronization operations, all threads automatically release all their resources before starting a checkpoint and reacquire them before completing the checkpoint. The protocol ensures that no thread acquires a resource unless this would be legal in a checkpoint-free execution. The protocol's correctness was proven in [Cornell07]. The protocol was extended to incorporate the entire OpenMP 1.0 specification, excluding nested parallelism[ICS2006] and was shown to exhibit good performance on three different platforms: Linux/Itanium, Tru64/Alpha and Solaris/Sparc. Since OpenMP is specified as a set of source code annotations, the OpenMP checkpointer works as a source-to-source compiler that tranforms the application source code into a self-checkpointing version that still runs on top of OpenMP.
A major issue for checkpointing large-scale applications is the fact that modern high-performance computing systems have a great deal of RAM. Some examples among the largest systems include ASC Purple at LLNL, with 49TB of RAM, BlueGene/L at LLNL, with 54TB, and the Ranger system at TACC, with 123 TB. While useful for performing large-scale scientific applications, these large memory sizes mean that the cost of taking a single checkpoint becomes very high, with the BlueGene/L taking more than 20 minutes per full machine checkpoint and the upcoming BlueGene/P system at Argonne expected to take no less than 30 minutes per checkpooint. These large checkpointing costs make it imperative to reduce the cost of checkpointing by reducing the sizes of the checkpoints. I have been working on two different approaches to this problem.
In [Static] I have developed a novel compiler analysis that analyzes application read-write patterns of 1-dimensional arrays. It can identify the sub-range of each array that is written or read by a given loop, even if the size of this region is computed suring the loop itself. This makes it possible to identify arrays that are overwritten before they are read (i.e. they have dead data), a common pattern with buffers used in communication and file access.
In [Hybrid] I have developed a hybrid static-dynamic approach for optimizing incremental and asynchronous checkpointing. Incremental checkpointing reduces checkpoint sizes by detecting memory regions that have no been changed since the last checkpointing and omitting them from the next checkpoint. Asynchronous checkpointing identified periods of time around a checkpoint when a given memory region is not written to by the application and checkpoints them asynchronously, while the application is allowed to resume its work. In this work a static analysis identifies for each array and checkpoint (the programmer annotates the code with possible checkpoint locations), the last write in the code to the array before the checkpoint and the first write to the array after the checkpoint. The runtime system then uses this information to asynchronously checkpoint this array in parallel with the application. This is in contrast to purely runtime approaches that may only begin the asynchronous state-saving after the checkpoint has begin. These annotations are futher used to identify arrays that are not written to between two checkpoints and omitting these arrays from the second checkpoint.
The rise of multi-core processors is creating a revolution in computing by bringing the benefits and pitfalls of parallel programming to general programmers. While there has been significant work on parallel programming models and compilers that can create parallel code from sequential code or highly structured parallel code (i.e. High Performance Fortran), there has been little work on the analysis and optimization of parallel code. In particular, most prior work has focused on sequential analyses of parallel applications or analyses of mostly sequential behavior, such as detecting the static barrier operations that may be matched at runtime. This one-sides approach has resulted in a situation where large numbers of application programmers are starting to write parallel applications, with no compiler analyses that can actually understand the applications' parallel structure. In this project I am developing a compiler analysis that can detect the communication topology of message passing applications. Based on shape analysis techniques, the analysis works as a symbolic evaluation of the application's control-flow graph. The analysis creates one thread for every set of processes used by the application and analyzes the application from the point of view of each process set in parallel. Whenever the application tries to communicate, the threads of the appropriate process sets are synchronized and the analysis identifies the match between the relevant send and receive operations. The analysis is conservative in the sense that if a send-receive pair is identified as matching by the analysis, it may match at runtime. However, if the pair is identified as non-matching, it will never match at runtime. In addition to simply matching static send and receive operations, the analysis can create communication topology graphs that represent the true structure of the fully parallel application. Although the above analysis has numerous specific applications, its primary use is to extend classic analyses to parallel applications. Simple examples include computing flows of data, dominator relationships, def-use chains and inspector-executor. Additinal applications include