Online self-test, diagnostics, and self-repair for robust system design

Placeholder Show Content

Abstract/Contents

Abstract
Electronic system malfunctions resulting from hardware failures are a growing concern. For silicon integrated circuits (ICs) with remarkably small geometries, several hardware failure mechanisms that were largely benign in the past are now becoming visible at the system-level. Therefore, a wide range of future electronic systems will require tolerance of hardware failures during their operation. Hardware failures can be broadly classified as permanent or temporary. In this dissertation, we present cost-effective techniques to enable the design of robust systems with built-in resilience to permanent failures, overcoming a major reliability challenge of electronic systems. To tolerate permanent failures in robust systems, three critical activities are involved: detection, localization, and self-repair. Online self-test and diagnostics enable a system to test itself during normal operation to detect and localize permanent failures. We present three new online self-test ideas: 1. A technique called CASP (Concurrent Autonomous chip self-test using Stored test Patterns) which achieves thorough online self-test while incurring minimal area and power cost. 2. Software orchestration techniques that minimize visible system performance impact during online self-test of processor cores. 3. New design principles that minimize system-level performance impact during online self-test of uncore components. Examples of uncore components include cache controllers, memory controllers, and I/O controllers. Although uncore components occupy significant portions of multi-core ICs, very little existing research targets online self-test of these components. We demonstrate the effectiveness of these online self-test ideas using the OpenSPARC T2 design supporting 8 processor cores and 64 hardware threads. The results show that our CASP technique, assisted by software orchestration techniques for processor cores and online self-test design principles for uncore components, achieves very high online test coverage while imposing only 1% area cost, 1% power cost, and 3% performance cost. We also present new self-repair techniques, which are required to keep the system functioning correctly in the presence of failures. Unlike on-chip memories and processor cores, very little attention has been paid to self-repair of uncore components. Naïve redundancy-based techniques impose significant (e.g., 16%) chip-level area costs. We present new techniques that enable cost-effective self-repair of uncore components. For OpenSPARC T2, the chip-level area, power, and performance costs are 7%, 6%, and 5%, respectively, for self-repair of any single faulty component. Our techniques are capable of self-repairing multiple faulty components with graceful degradation of system performance.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English

Creators/Contributors

Associated with Li, Yanjing
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Mitra, Subhasish
Thesis advisor Mitra, Subhasish
Thesis advisor Gill, John T III
Thesis advisor Olukotun, Oyekunle Ayinde
Advisor Gill, John T III
Advisor Olukotun, Oyekunle Ayinde

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Yanjing Li.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2013.
Location electronic resource

Access conditions

Copyright
© 2013 by Yanjing Li
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...