CS717: Application-level Detection and Tolerance of Complex Faults

(Fall 2004)


Instructors

  • Keshav Pingali Rhodes 457
  • Greg Bronevetsky Rhodes 490

Time/Place
Tuesdays&Thursdays 1:25 - 2:40 PM, Rhodes 484

Time/Place
List of Lectures and Papers
List of Projects

Course Description

You give a problem to a computer. It chugs away for a while and returns an answer. How do you know that its correct? A particle of radiation could have hit the memory and flipped some bits. A hacker could have changed the code to modify the output. The program could have simply been buggy and computed the wrong thing. The question is: how can you tell?

Computation in the presence of an adversary is an important area of computer science. However, the work on non-trivial adversaries (i.e. when the systemdoesn't simply halt on error) has thus far been a patchwork of independent efforts in disparate fields. Efforts by the Theory, Distributed Systems, Security, Numerical Analysis, Software Engineering and Computer Architecture communities have produced a number of solutions, all of which either

In CS717 we will examine a broad range of literature cutting through a variety of diverse fields: from Theoretical checking schemes to Encoded computation to Byzantine quorum systems to Experiments on particle accelerators. The goal: to find ways to detect and tolerate errors at the application-level. The ability to create checkers and correctors that are customized to any given application would be a tremendous tool, allowing us to cheaply deal with complex faults in a way that could be embedded in a compiler.

This is a large field with many unexplored corners. Indeed, it is the difficulty of the problem that creates the need for this course: if we can seethe many techniques used in the past, we may get inspiration for new approaches of our own. This semester we will study the past in our search for the future.

Course Outline