Greg Bronevetsky

Contact Info:
    Email: greg@bronevetsky.com
    Phone: 1-925-424-5756
    Fax: 1-925-422-9551
    Address:
    L-557, Building 453, Room 4074
    7000 East Ave
    Livermore, CA 94551

I recently received the Presidential Early Career Award for Scientists and Engineers (PECASE) and the ARRA Early Career Award. Before that I was a Lawrence Post-doctoral Fellow and an NSF Graduate Fellow.

I am a Computer Scientist at the Lawrence Livermore National Laboratory (LLNL). My research focuses on a variety of topics related to supercomputing. The primary motivation of my work is to make supercomputing a commodity technology, available to anybody who has a hard computational problem to solve. Unfortunately, the reality is that to design a machine that can reach such extreme levels of performance (the current peak is more than a quadrillion of mathematical operations per second) it is necessary to use hundreds of thousands of processors and similarly gigantic amounts of memory, network cables and power. Further, productive operation of these machines depend on decades of effort by many scientists to write physical simulations, numerical solvers, operating systems and management tools.

Altogether, a supercomputer is a nightmare of complexity and the goal of my research is to tame it. Some of the key challenges I'm working on are:

  • Reliability:
    While supercomputers are made from very high-quality components, the staggering size of these machines means that components fail on a regular basis, which makes these machines Reliably Unreliable. I am working on enabling supercomputers to tolerate a large range of such failures.
    • To tolerate failures that cause a key system component to stop functioning, I have developed algorithms to efficiently save the state of large-scale parallel applications. When an application is affected by such an error, it recovers by rolling back to a prior state and continuing work as if nothing happened.
    • While some failures are obvious, others are far more insidious. Computer circuits are very reliable but when a quadrillion operations are executing each second, the chance that some mathematical operation may be computed incorrectly grows to be uncomfortably high. I am working to solve this problem by (i) developing techniques to quantify the vulnerability of applications to such errors and (ii) by designing algorithms that can detect and tolerate such corruptions.
  • Statistical Analysis and Management of Computers:
    In addition to outright failures, supercomputers are vulnerable to more subtle problems that lead them to perform inefficiently. Phenomena such as faulty hardware, application load imbalance or poor performance tuning can cause a $100 million machine to perform well below its potential without any clear reason. Since the complexity of these machines makes it impossible for any human developer to analyze these phenomena, I am working on an empirical approach to analyzing and managing computer behavior. The basic idea is to look at a computer as a natural phenomenon and study its behavior by performing experiments, making observations and creating statistical models that organize this information to build a precise understanding of how the hardware and software behaves in different contexts. This Scientific Computer Engineering will enable computers to analyze their own behavior and use the predictions made by statistical models to proactively manage their own behavior and resource use to ensure high efficiency and performance.
  • Parallel Programming:
    Application developers on the next generation of supercomputers will need to manage millions of threads, multiple levels of memory hierarchy and complex network topologies such as multi-dimensional torii. They will need to coordinate all this work in a way that performs the most computation using the least amount of power and network capacity. All this must be done while doing bleeding edge science, which relies on many decades of work on simulation components that may or may not have been written to run on the machines available today. To overcome this challenge we must develop ways for developers to express their applications in a way that achieves high performance while maintaining developer productivity and leveraging existing code-bases.

    My approach to addressing this challenge is fundamentally pragmatic. Instead of proposing a new programming model that developers must incorporate into their existing applications, I am working on ways to improve existing programming models such as MPI to ensure that they continue to enable future computational science while requiring developers to make only incremental changes to their applications. The basic insight is that message passing applications are already written to express large degrees of parallelism and locality and it will be possible to achieve high performance by leveraging this existing information rather than demanding that it be re-expressed in a new programming model. My work includes (i) research on compiler analyses to understand the structure of MPI applications, (ii) extensions to MPI runtime implementations to take advantage of such high-level information and (iii) new formal methods to enable developers to reason about the correctness of their applications regardless of the scale they run on.