Algorithm Level Fault Tolerance for Molecular Dynamic Applications

JQ Liu and G Agrawal, 2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 406-415 (2015).

DOI: 10.1109/HiPC.2015.53

Handling soft errors have recently emerged as an important topic in high performance computing. Though there has been a significant amount of work on algorithm-level fault tolerance (ABFT) solutions, they have been applied to linear algebra problems only. Molecular dynamics represents a popular class of computational applications that are susceptible to soft errors because of their long running nature, and yet there has been no ABFT solution for them. This paper develops such a solution. We show how we are able to map the key computational kernel of molecular dynamic to a matrix vector multiplication (MVM), in which the matrix holds the intermediate data, the vector comprises the coordinate of the atoms, and the final force is the matrix vector product. We adapt existing MVM based solutions to this problem, though additional optimizations are required for efficiency. Our effectiveness evaluation shows that our method can always achieve an F-score of over 0.9, provided an appropriate tolerance threshold is chosen. The overall overhead of detection and recovery is also always less than 10%.

Return to Publications page