Skip to main content

Runtime asynchronous fault tolerance via speculation

Author(s): Zhang, Yun; Ghosh, Soumyadeep; Huang, Jialu; Lee, Jae W; Mahlke, Scott A; et al

To refer to this page use:
Abstract: Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date. Serving as a layer between the application and the underlying platform, RAFT automatically generates two symmetric program instances from a program binary. It detects transient faults in a non-invasive way and exploits high-confidence value speculation to achieve low runtime overhead. Evaluation on a commodity multicore system demonstrates that RAFT delivers a geomean performance overhead of 2.83% on a set of 30 SPEC CPU benchmarks and STAMP benchmarks. Compared with existing transient fault detection techniques, RAFT exhibits the best performance and fault coverage, without requiring any change to the hardware or the software applications.
Publication Date: 2012
Citation: Zhang, Yun, Soumyadeep Ghosh, Jialu Huang, Jae W. Lee, Scott A. Mahlke, and David I. August. "Runtime asynchronous fault tolerance via speculation." Proceedings of the Tenth International Symposium on Code Generation and Optimization (2012): pp. 145-154. doi:10.1145/2259016.2259035
DOI: 10.1145/2259016.2259035
ISSN: 2164-2397
Pages: 145 - 154
Type of Material: Conference Article
Journal/Proceeding Title: Proceedings of the Tenth International Symposium on Code Generation and Optimization
Version: Author's manuscript

Items in OAR@Princeton are protected by copyright, with all rights reserved, unless otherwise indicated.