Contact Info:
Email: greg@bronevetsky.com
Phone: 1-925-424-5756
Fax: 1-925-422-9551
Address:
L-557, Building 453, Room 4074
7000 East Ave
Livermore, CA 94551
I am a Computer Scientist at the Lawrence Livermore
National Laboratory (LLNL). My research focuses on a variety of
topics related to supercomputing. The primary motivation of my work
is to make supercomputing a commodity technology, available to
anybody who has a hard computational problem to solve.
Unfortunately, the reality is that to design a machine that can
reach such extreme levels of performance (the current peak is more
than a quadrillion of mathematical operations per second) it is
necessary to use hundreds of thousands of processors and similarly
gigantic amounts of memory, network cables and power. Further,
productive operation of these machines depend on decades of effort
by many scientists to write physical simulations, numerical solvers,
operating systems and management tools.
Altogether, a supercomputer is a nightmare of complexity and the
goal of my research is to tame it. Some of the key challenges I'm
working on are:
- Reliability:
While supercomputers are made from very high-quality components,
the staggering size of these machines means that components fail
on a regular basis, which makes these machines Reliably
Unreliable. I am working on enabling supercomputers to tolerate
a large range of such failures.
- To tolerate failures that cause a key system component
to stop functioning, I have developed algorithms to
efficiently save the state of large-scale parallel
applications. When an application is affected by such an
error, it recovers by rolling back to a prior state and
continuing work as if nothing happened.
- While some failures are obvious, others are far more
insidious. Computer circuits are very reliable but when a
quadrillion operations are executing each second, the chance
that some mathematical operation may be computed incorrectly
grows to be uncomfortably high. I am working to solve this
problem by (i) developing techniques to quantify the
vulnerability of applications to such errors and (ii) by
designing algorithms that can detect and tolerate such
corruptions.
- Statistical Analysis and
Management of Computers:
In addition to outright failures, supercomputers are vulnerable
to more subtle problems that lead them to perform inefficiently.
Phenomena such as faulty hardware, application load imbalance or
poor performance tuning can cause a $100 million machine to
perform well below its potential without any clear reason. Since
the complexity of these machines makes it impossible for any
human developer to analyze these phenomena, I am working on an
empirical approach to analyzing and managing computer behavior.
The basic idea is to look at a computer as a natural phenomenon
and study its behavior by performing experiments, making
observations and creating statistical models that organize this
information to build a precise understanding of how the hardware
and software behaves in different contexts. This Scientific
Computer Engineering will enable computers to analyze their own
behavior and use the predictions made by statistical models to proactively manage
their own behavior and resource use to ensure high
efficiency and performance.
- Parallel Programming:
Application developers on the next generation of
supercomputers will need to manage millions of threads, multiple
levels of memory hierarchy and complex network topologies such
as multi-dimensional torii. They will need to coordinate all
this work in a way that performs the most computation using the
least amount of power and network capacity. All this must be
done while doing bleeding edge science, which relies on many
decades of work on simulation components that may or may not
have been written to run on the machines available today. To
overcome this challenge we must develop ways for developers to
express their applications in a way that achieves high
performance while maintaining developer productivity and
leveraging existing code-bases.
My approach to addressing this challenge is fundamentally
pragmatic. Instead of proposing a new programming model that
developers must incorporate into their existing applications, I
am working on ways to improve existing programming models such
as MPI to ensure that they continue to enable future
computational science while requiring developers to make only
incremental changes to their applications. The basic insight is
that message passing applications are already written to express
large degrees of parallelism and locality and it will be
possible to achieve high performance by leveraging this existing
information rather than demanding that it be re-expressed in a
new programming model. My work includes (i) research on compiler
analyses to understand the structure of MPI applications, (ii)
extensions to MPI runtime implementations to take advantage of
such high-level information and (iii) new formal methods to
enable developers to reason about the correctness of their
applications regardless of the scale they run on.