Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. a href="http://iss.cs.cornell.edu/Publications/Papers/ICS2003.pdf">"Collective Operations in an Application-level Fault Tolerant MPI System", International Conference on Supercomputing, 2003.
Many more papers on the effects of environmental factors on computer hardware can be found below.
Artifical Intelligence
The broad goal of AI is to make sense of complex, failure-prone systems. In particular, machine learning tries to capture the complexity of systems using (hopefully) simple automata. As such, it seems that techniques from the field of Artificial Intelligence will be useful in detecting and tolerating complex faults.
Nicholas Pippenger, "Developments in 'The Synthesis of Reliable Organisms from Unreliable Gates'" (survey), Proceedings of Symposia in Pure Mathematics, 1990.
Nicholas Pippenger, "Invariance of Complexity Measures for Networks with Unreliable Gates", Journal of the ACM, 1989.
Nicholas Pippenger, "On Networks of Noisy Gates", IEEE, 1985.
S. Winograd, "Redundancy and Complexity of Logical Elements", Information and Control, 1963.
Rudolf Ahlswede, "Improvements on Winograd's Result on Computation in the Presence of Noise", IEEE Transactions on Information Theory, 1984.
John von Neumann, "Probabilistic logics and synthesis of reliable organisms from unreliable components", Automata Studies, 1956.
Paper on defeating the Java virtual machine's security through random bit flips (experiment involves heating CPU)
J. F. Ziegler, H. W. Curtis, F. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFave, J. L. Walsh, J. M. Orro, G. J. Unger, J. M. Ross, T. J. O'Gorman, B. Messina, T. D. Sullivan, A. J. Sykes, H. Yourke, T. A. Enger, V. Tolat, T. S. Scott, A. H. Taber, R. J. Sussman, W. A. Klein, C. W. Wahaus,
"IBM Experiments in Soft Fails in Computer Electronics", IBM Journal of Research and Development, 1996
Additional miscelaneous papers:
Injecting faults into a prototyps system or a software simulation of the system
Schmid, M.E., R.L. Trapp, A.E. Davidoff, and G.M. Masson, "Upset exposure by means of abstraction verification," Proc. 12th IntŐl Fault-Tolerant Comput. Symp., pp. 237-244, Jun. 1982.
Fault injection for increasing software test coverage
Bieman, J.M., D. Dreilinger , and L. Lin, "Using Fault Injection to Increase Software Test Coverage," Proc. IEEE Int'l Symp. On Software Reliability Engineering, pp. 166-174, 30 Oct.-2 Nov. 1996.
Electronic system fault injection
Iyer, R.K. and D. Tang, "Experimental Analysis of Computer System Dependability," Center for Reliable and High-Performance Computing, Technical Report CRHC-93-15, University of Illinois at Urbana-Champaign, 1993.
General-purpose fault inserter (screws around with the chip's input pins)
Schuette, M.A., and J.P. Shen, "Processor Control Flow Monitoring Using Signatured Instruction Streams," IEEE Trans. Comput., Vol. C-36, No. 3, pp. 264- 276, Mar. 1987.
Shooting chips with radiation
Gunneflo, U., J. Karlsson, and J. Torin, "Evaluation of error detection schemes using fault injection by heavy-ion radiation," 19th International Symp. on Fault Tolerant Computing, Chicago, IL, pp. 340-347, Jun. 21-23, 1989.
Karlsson, J., P. Liden, P. Dahlgren, R. Johansson, and U. Gunneflo, "Using Heavy-Ion Radiation to Validate Fault-Handling Mechanisms," IEEE Micro, Vol. 14, No. 1, pp. 8-23, February 1994.
Shaeffer, D.L., et al., "High energy proton SEU test results for the commercially available MIPS R3000 microprocessor and R3010 floating point unit," IEEE Trans. on Nuclear Science, Vol. 38, No. 6, pt. 1, pp. 1421-1428, Dec. 1991.
Shaeffer, D.L., et al., "Proton-induced SEU, dose effects, and LEO performance predictions for R3000 microprocessors," IEEE Trans. on Nuclear Science, Vol. 39, No. 6, pt. 2, pp. 2309-2315, Dec. 1992.
Kaschmitter, J.L., et al., "Operation of commercial R3000 processors in the Low Earth Orbit (LEO) space environment," IEEE Trans. on Nuclear Science, Vol. 38, No. 6, pt. 1, pp. 1415-1420, Dec. 1991.
Power supply disturance
Cortes M., et al., "Properties of Transient Errors Due to Power Supply Disturbances," Center for Reliable Computing Technical Report, No. 86-1, Stanford University, 1986.
(used by both Miremadi's above)
Fault injection using software simulation of hardware
Cortes M., et al., "Techniques for Injecting non-stuck-at faults," Center for Reliable Computing Technical Report, No. 87-21, Stanford University, 1987.
Goswami, K.K., R.K. Iyer, and L.Y. Young, "DEPEND: A Simulation-Based Environment for System Level Dependability Analysis," IEEE Trans. On Computers, Vol. 46, No. 1, pp. 60-74, January 1997.