Microarchitecture Level Reliability Assessment: Throughput and Accuracy
in conjunction with MICRO 2017
October 15, 2017 (Sunday morning)
Early assessment of the vulnerability of microprocessor components to hardware faults can drive effective protection decisions. Microarchitecture-level simulators are employed for such early assessments and can deliver reliability reports for a large number of hardware structures taking into consideration the masking effects of the entire stack of hardware and software layers. Statistical fault injection at the microarchitecture level is a very accurate approach which, however, may suffer from low throughput if a statistically significant assessment is required.
This tutorial focuses on recent advances delivered by the Computer Architecture Lab of the University of Athens in the area of microarchitecture level reliability assessment using statistical fault injection. We present GeFIN (Gem5-based Fault Injector) a state-of-the-art microarchitecture level fault injection framework built on Gem5 simulator. GeFIN supports massive and fast injection campaigns for all different types of faults (transient, permanent, intermittent) on arbitrary combinations of several dozens of microarchitectural components modeled in Gem5. We first present the baseline Gem5 engine as well as AVF (Architectural Vulnerability Factor) and FIT (Failures in Time) measurements reported by the tool which are reports fine-grained fault effects classifications.
We also present two GeFIN add-ons designed to improve the throughput of the injections campaigns but preserve the accuracy of the reliability measurements. The first add-on is a set of speed-up methods on GeFIN individual runs themselves and the second add-on is MeRLiN a fault classification approach based on dynamic instruction profiling which aims at pruning the number of faults in extremely large fault lists. Both add-ons deliver large throughput improvements (several orders of magnitude) for comprehensive (and thus statistically significant) fault injection campaigns while they preserve the reported AVF measurements.
The tutorial includes measurements for different microarchitectural configurations (corresponding to different CPU models), discussion about ACE analysis and fault injection at the microarchitecture level, discussion about CPU and GPU reliability assessment at the microarchitecture level as well as comparison between microarchitecture-level and register-transfer level fault injection on a commercial CPU model.
Introduction to Microarchitecture Level Reliability Assessment
– Early Reliability Assessment
– Throughput and Accuracy
– Statistical Fault Injection
Microarchitecture Simulator Selection – Gem5
The GeFIN Baseline Fault Injection engine
– Fault models
– Reliability measurements
– Fault effects classifications
GeFIN Fast Modes
MeRLiN Fault Pruning
– CPU models
– Components configurations
– Microarchitecture vs. RTL comparison
GPUs Reliability Assessment on Microarchitecture Simulators
The target audience of the tutorial includes researchers and practitioners interested in microprocessor reliability assessment at the early design stages. Basic understanding of microarchitecture and reliability terminology and techniques is required.
Dimitris Gizopoulos, Athanasios Chatzidimitriou, Manolis Kaliorakis (University of Athens)