System-level effects of soft errors
Abstract/Contents
- Abstract
- Radiation-induced transient errors (soft errors) are a major reliability concern for digital systems in advanced silicon CMOS technologies. Soft errors create unexpected changes in signal values during system operation, mostly in on-chip memories and flip-flops, resulting in undetected data corruption or expensive downtimes. This dissertation focuses on soft errors in flip-flops because design techniques to protect flip-flops are generally expensive. To protect on-chip memories, coding techniques are routinely used. Error injection simulations are widely used for characterizing system-level effects of soft errors in a given design. These techniques generally inject single-bit errors into randomly-chosen locations (flip-flops, software-visible registers and memories) during randomly-chosen clock cycles. Flip-flop-level error injections suffer from slow Register-Transfer-Level (RTL) simulations. High-level error injections, that inject errors into software-visible registers or memories, are generally fast. Unfortunately, very little literature exists on the accuracies of high-level error injection techniques. We demonstrate that existing high-level error injections can be highly inaccurate by over an order of magnitude, and present detailed error propagation analysis to quantitatively explain the causes of such inaccuracies. For fast, yet accurate, error injection simulations, we present a new mixed-mode simulation platform that combines simulators at two different abstraction levels. This platform achieves 20,000× speedup over RTL-only simulation for an industrial multi-core chip consisting of approximately half-a-billion transistors. This platform targets soft errors in uncore components (e.g., memory subsystem, I/O controllers) that occupy significant portions of the overall chip area. Using this platform, we demonstrate, for the first time, that flip-flop soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that recovery from uncore soft errors can be challenging for traditional system-level checkpointing techniques. A new replay technique overcomes these challenges for uncore components belonging to the memory subsystem.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2015 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Cho, Hyungmin | |
---|---|---|
Associated with | Stanford University, Department of Electrical Engineering. | |
Primary advisor | Mitra, Subhasish | |
Thesis advisor | Mitra, Subhasish | |
Thesis advisor | Gill, John T III | |
Thesis advisor | Olukotun, Oyekunle Ayinde | |
Advisor | Gill, John T III | |
Advisor | Olukotun, Oyekunle Ayinde |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Hyungmin Cho. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis (Ph.D.)--Stanford University, 2015. |
Location | electronic resource |
Access conditions
- Copyright
- © 2015 by Hyungmin Cho
Also listed in
Loading usage metrics...