System-level effects of soft errors

Placeholder Show Content

Abstract/Contents

Abstract
Radiation-induced transient errors (soft errors) are a major reliability concern for digital systems in advanced silicon CMOS technologies. Soft errors create unexpected changes in signal values during system operation, mostly in on-chip memories and flip-flops, resulting in undetected data corruption or expensive downtimes. This dissertation focuses on soft errors in flip-flops because design techniques to protect flip-flops are generally expensive. To protect on-chip memories, coding techniques are routinely used. Error injection simulations are widely used for characterizing system-level effects of soft errors in a given design. These techniques generally inject single-bit errors into randomly-chosen locations (flip-flops, software-visible registers and memories) during randomly-chosen clock cycles. Flip-flop-level error injections suffer from slow Register-Transfer-Level (RTL) simulations. High-level error injections, that inject errors into software-visible registers or memories, are generally fast. Unfortunately, very little literature exists on the accuracies of high-level error injection techniques. We demonstrate that existing high-level error injections can be highly inaccurate by over an order of magnitude, and present detailed error propagation analysis to quantitatively explain the causes of such inaccuracies. For fast, yet accurate, error injection simulations, we present a new mixed-mode simulation platform that combines simulators at two different abstraction levels. This platform achieves 20,000× speedup over RTL-only simulation for an industrial multi-core chip consisting of approximately half-a-billion transistors. This platform targets soft errors in uncore components (e.g., memory subsystem, I/O controllers) that occupy significant portions of the overall chip area. Using this platform, we demonstrate, for the first time, that flip-flop soft errors in uncore components can significantly impact system-level reliability. We also demonstrate that recovery from uncore soft errors can be challenging for traditional system-level checkpointing techniques. A new replay technique overcomes these challenges for uncore components belonging to the memory subsystem.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2015
Issuance monographic
Language English

Creators/Contributors

Associated with Cho, Hyungmin
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Mitra, Subhasish
Thesis advisor Mitra, Subhasish
Thesis advisor Gill, John T III
Thesis advisor Olukotun, Oyekunle Ayinde
Advisor Gill, John T III
Advisor Olukotun, Oyekunle Ayinde

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Hyungmin Cho.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2015.
Location electronic resource

Access conditions

Copyright
© 2015 by Hyungmin Cho

Also listed in

Loading usage metrics...