Greg Bronevetsky

Research

Application-level Checkpointing for Parallel Applications

Modern high performance systems have grown in both size and complexity to a point where their mean time to failure is on the order of hours or days. Checkpointing is a general-purpose fault tolerance technique where the state of the application is periodically saved to reliable media. When the application or the system fails, the application is restarted from a recently stored computational state. In the case of parallel applications prior work on checkpointing has involved modifying the underlying implementation of the parallel software or hardware to enable appropriate state-saving and coordination. While effective, this approach is non-portable, requiring all vendors of parallel communication libraries to individually implement their own checkpointing functionality. The primary contribution of this project is to develop mechanisms that checkpoint popular message-passing and shared memory APIs (MPI and OpenMP) at the level of the API, thus ensuring that a single system could checkpoint any implementation of MPI and OpenMP. In addition to these improvements in portability, we developed several novel checkpoint coordination protocols that improved the performance and scalability of parallel checkpointing in general.

MPI

Developed a novel checkpoint coordination protocol for message passing applications and implemented it in the context of the portable MPI checkpointer [PPoPP2003] that supports the entire MPI1.1 specification[ICS2003]. The protocol is essentially a combination between the Chandy-Lamport distributed snapshot protocol and message logging, where different processes may checkpoint at different points in time and the protocol records the appropriate messages and non-deterministic event outcomes to ensure that these individual snapshots can be combined into a full global checkpoint. Experimental results on Windows/PentiumIII/cLAN and Tru64/Alpha/Quadrics demontrated the scalability, portability and performance of this approach [SC2004].

OpenMP

Designed a novel protocol for checkpointing shared memory applications/APIs, regardless of the underlying memory model[ASPLOS2004]. The protocol is based on all threads performing two successive barrier operations and recording their local and shared states between the two barriers. Since this insertion of barriers can create deadlocks in combination with the application's regular synchronization operations, all threads automatically release all their resources before starting a checkpoint and reacquire them before completing the checkpoint. The protocol ensures that no thread acquires a resource unless this would be legal in a checkpoint-free execution. The protocol's correctness was proven in [Cornell07]. The protocol was extended to incorporate the entire OpenMP 1.0 specification, excluding nested parallelism[ICS2006] and was shown to exhibit good performance on three different platforms: Linux/Itanium, Tru64/Alpha and Solaris/Sparc. Since OpenMP is specified as a set of source code annotations, the OpenMP checkpointer works as a source-to-source compiler that tranforms the application source code into a self-checkpointing version that still runs on top of OpenMP.

Checkpoint Optimization

A major issue for checkpointing large-scale applications is the fact that modern high-performance computing systems have a great deal of RAM. Some examples among the largest systems include ASC Purple at LLNL, with 49TB of RAM, BlueGene/L at LLNL, with 54TB, and the Ranger system at TACC, with 123 TB. While useful for performing large-scale scientific applications, these large memory sizes mean that the cost of taking a single checkpoint becomes very high, with the BlueGene/L taking more than 20 minutes per full machine checkpoint and the upcoming BlueGene/P system at Argonne expected to take no less than 30 minutes per checkpooint. These large checkpointing costs make it imperative to reduce the cost of checkpointing by reducing the sizes of the checkpoints. I have been working on two different approaches to this problem.

Static Analysis

In [Static] I have developed a novel compiler analysis that analyzes application read-write patterns of 1-dimensional arrays. It can identify the sub-range of each array that is written or read by a given loop, even if the size of this region is computed suring the loop itself. This makes it possible to identify arrays that are overwritten before they are read (i.e. they have dead data), a common pattern with buffers used in communication and file access.

Hybrid Compiler + Runtime

In [Hybrid] I have developed a hybrid static-dynamic approach for optimizing incremental and asynchronous checkpointing. Incremental checkpointing reduces checkpoint sizes by detecting memory regions that have no been changed since the last checkpointing and omitting them from the next checkpoint. Asynchronous checkpointing identified periods of time around a checkpoint when a given memory region is not written to by the application and checkpoints them asynchronously, while the application is allowed to resume its work. In this work a static analysis identifies for each array and checkpoint (the programmer annotates the code with possible checkpoint locations), the last write in the code to the array before the checkpoint and the first write to the array after the checkpoint. The runtime system then uses this information to asynchronously checkpoint this array in parallel with the application. This is in contrast to purely runtime approaches that may only begin the asynchronous state-saving after the checkpoint has begin. These annotations are futher used to identify arrays that are not written to between two checkpoints and omitting these arrays from the second checkpoint.

Formal Methods

Shared memory is a popular mechanism for writing parallel application. A direct extension of sequential programming, it allows multiple threads of execution to access the same address space, thus allowing them to communicate and coordinate their work. While it is generally simplest for programmers to think of these threads as executing on a single processor and accessing their shared memory one instruction at a time, this abstraction, called "sequential consistency", cannot be maintained efficiently. Real shared memory systems provide programmers with a much looser set of guarantees about the possible interactions between reads and writes, documenting them in a specification called a "memory model". One example of a non-trivial memory model comes up in systems that use 32-bit messages to exchange information between different processors. If two threads running on two processors simultaneously decide to write different 64-bit values to the same memory location, the fact that these updates will be sent as two different 32-bit messages means that the final value of the memory location may actually be a combination of the two values. In other words, the result of this race is a value that no thread actually intended to write. This is in contrast to sequential consistency, which stipulates that the final value must be the value written by one of the threads. The OpenMP specification constains a section that describes the official memory model guaranteed by any implementation of OpenMP. Although reasonably detailed, it is written in English and suffers from the imprecision typical of any natural-language text. As such, it is often difficult to answer detailed questions about an application's behavior on top of an arbitrary OpenMP implementation and about the legalities of various compiler transformations. In [IJPP2006] we developed a detailed and fully formal specification of the OpenMP memory model, after consulting with various experts on the OpenMP specification. This formalization is novel in specifying the memory model at two levels. A compiler-level specification identifies legal transformations from OpenMP source code to sequences of low-level read, write and synchronization operations. A runtime-level specification verifies whether there exists a valid way to execute such sequences of low-level operations and ensure that all the reads return the values recorded in the sequence. This is contrast to prior memory model work that exclusively focused on runtime-level specifications. The primary effect of this extended design is that unlike prior work, it connects the memory model specification to actual application source code. This is especially critical for the OpenMP specification, which is explicitly an application-level API, and has not been a major focus in the past because prior work focused on formalizing hardware-level events.

Static Communication Topology Detection

The rise of multi-core processors is creating a revolution in computing by bringing the benefits and pitfalls of parallel programming to general programmers. While there has been significant work on parallel programming models and compilers that can create parallel code from sequential code or highly structured parallel code (i.e. High Performance Fortran), there has been little work on the analysis and optimization of parallel code. In particular, most prior work has focused on sequential analyses of parallel applications or analyses of mostly sequential behavior, such as detecting the static barrier operations that may be matched at runtime. This one-sides approach has resulted in a situation where large numbers of application programmers are starting to write parallel applications, with no compiler analyses that can actually understand the applications' parallel structure. In this project I am developing a compiler analysis that can detect the communication topology of message passing applications. Based on shape analysis techniques, the analysis works as a symbolic evaluation of the application's control-flow graph. The analysis creates one thread for every set of processes used by the application and analyzes the application from the point of view of each process set in parallel. Whenever the application tries to communicate, the threads of the appropriate process sets are synchronized and the analysis identifies the match between the relevant send and receive operations. The analysis is conservative in the sense that if a send-receive pair is identified as matching by the analysis, it may match at runtime. However, if the pair is identified as non-matching, it will never match at runtime. In addition to simply matching static send and receive operations, the analysis can create communication topology graphs that represent the true structure of the fully parallel application. Although the above analysis has numerous specific applications, its primary use is to extend classic analyses to parallel applications. Simple examples include computing flows of data, dominator relationships, def-use chains and inspector-executor. Additinal applications include

  • converting communication patterns expressed as point-to-point messages into larger collective communications and
  • reducing the runtime cost of matching incoming packets to outstanding receive buffers by identifying these matches at the sender

Soft Errors

As the feature sizes of modern electronics shift to fewer and fewer nanometers, they become increasingly vulnerable to soft errors. A soft error is a one-time fault in the operation of a circuit that is not repeated by the circuit in subsequent operation. Soft errors may be caused by a variety of physical phenomena, including:
  • ambient neutron radiation from cosmic rays or chip packaging can cause decay and nuclear fission reactions that cause charged particles to pass through transistors and overload their charge
  • thin separation between adjacent wires can cause electrons to tunnel from one wire to another
  • process variations can result in intermittent errors caused by marginal components
I am working on techniques to evaluate the soft error vulnerability of realistic scientific applications and to develop general techniques for detecting and tolerating such errors. In contrast to prior work, I focus on application-level vulnerability and am working on techniques that are applicable to scientific applications in general rather than a few specific applications.
Soft Error Vulnerability of Iterative Linear Algebra Methods