The state of the art today in detection and tolerance of complex faults can be divided into two camps. On one side are solutions such a replication that are inefficient but automatically apply to all codes. On the other side is a wide range of manual solutions that provide high-quality and (usually) efficient error detection and tolerance for specific codes but don't generalize to arbitrary programs. In this rift lies the more useful and much more complex problem of efficient error detection and tolerance for arbitrary programs with no programmer intervention.
This project explores the use of compilers to determine the vulnerability of programs to random errors. A simple stochastic model is assumed, modeling the effect of radioation on system memory (i.e. random bit-flips). The effect of such bit-flips is propagated through the program to determine which regions of memory and regions of code are most vulnerable to such errors. In addition to providing us with information about the vulnerability of a given program to random errors, this analysis would be able to identify an efficient way of making the program more fault tolerant. In particular, if it is observed that the execution of a particular piece of code increases the probability of error beyond a certain bound we can replicate this code, run it multiple times and compare the results. This local use of replication would bring the probability of error to below a reasonable bound without the significant overhead of applying replication blindly to the whole program. Furthermore, if a certain amount of reliable memory is available in the system, this analysis would be able to effectively allocate variables to that memory in a way that would best serve to reduce the vulnerability of the program to errors.
Watchdog monitors have been examined by a number of studies over the years but most of this work has focused on their use for control-flow checking. This project examines the use of a generic programmable monitoring module that sits on the processor-memory bus to perform additional error checks. In particular, the focus is on memory safety: ensuring that the processor is only using the memory that it has been allocated with no accesses that are misaligned relative to the declared type of each memory buffer.
The Cornell Checkpoint Compiler (C3) provides applications with checkpointing capabilities by transforming their source code so as to allow them to periodically save their own state. The advantages of working at the application-level are twofold. First, since it is the application that is modified for checkpointing rather than the system, the application can checkpoint itself on any system. This is significant considering the many systems for which checkpointing is either not available or very restricted. Second, since we are working at the application-level, it is possible to use compile-time analyses to determine which parts of the program's active state really do need to be saved and which may be omitted from the checkpoint. Given that hard disks are fairly slow (especially networked disks), this can lead to dramatic reductions in the amount of time it takes to checkpoint large applications.
One limitation of application-level techniques is that it becomes difficult to checkpoint programs that use libraries for which source code has not been provided. This project aims to modify the default system linker to:
Linear relationships between variables provide for the detection and correction of errors, a form of error correcting coding. However, a set of linear relationships between n variables may not allow error detection and correction of more than one erroneous variable, as errors may cancel one another and relationships can still hold. This paper aims to examine the linear error detection and correction and its limitations. Moreover, the paper will consider linear equations over binary fields and non-linear equations, namely cyclic polynomials.
Suppose that we have a set of variables where we know a set of affine relationships between them. Since we know the relationships between them it becomes fairly trivial to detect errors that might have affected these variables by simply verifying that these relationships still hold. (note that affine relationships are transitive; if we know how A relates to B and how B relates to C, we know how A relates to C)
So the challenge is this: can we develop techniques to improve the detectability and locatability of errors in variables with known affine relationships between them? In particular, it should be possible to do this by throwing additional linearly independent affine relationships to the mix. For example, borrowing a page from error correcting codes, we can add additional redundant variables that are guaranteed to be related to the original set of variables via some affine relationships linearly independent from the original set. Given n original variables and k redundant variables we should be able to generate n+k linearly independent relationships between them.
Challenges: (pick any reasonable subset)
Ask CTC and CIT install a script to run on every processor when it is not busy. This script will perform repeated checks of the system and report back to a server. The problem with detecting random errors is that they are fairly infrequent and traditionally people have studied them by either shooting processors with radiation or simulating bit flips via software, neither of which is realistic. The goal in this project is raise the probability of seeing real errors by running our checking scripts on as many processors as possible, for as long as possible. Hopefully we'll catch enough solar/space/temperature activity to catch the processors failing reasonably often. It would be great if we also kept track of solar/space/temperature activity so that we could correlate the results.
One extension would be to grab some computers that are about to be thrown away and subject them to variations in temperature, varying numbers of on-off cycles, etc. This way we can try to correlate external stimulation to actual errors, though with used computers (the only ones we're likely to be able to do this to) their varying prior experiences in life will probably skew the results.
The Pittsburgh Supercomputing Center is home to a 3000-processor Alpha cluster that has been in operation since 2001. In this time they have collected a lot of information about processor failures and temperatures in parts of the room. The project would involve taking this information and performing statistical analysis on it so that we can understand the source of the errors. Two observations that the PSC people have already made are 1. processor failures do correlate with increased temperatures (within reason since they always try to keep their temperatures low) and 2. the biggest predictor of whether a processor will fail in the future is whether is has failed in the past since it appears that some processors keep breaking while others live for a long time (apparently the source of problem are manufacturing defects that cause transient faults).
There are many different software fault detection schemes. Unfortunately, experimental evaluations of these schemes are wildly different in the types of errors injected and applications used for evaluation. As such, this body of work does not leave us with a clear impression of the relative strengths and weaknesses of the different schemes. This project would involve the implementation of several software fault detection schemes (such as signature-based control flow checking, temporal replication of code, hand-inserted assertions, etc.) and their comparison along several criteria such as error detection coverage (percent of errors detected) and latency (time from error to detection).