A Survey of Fault-Injection Methodologies for Soft Error Rate Modeling in Systems-on-Chips

The development of process technology has increased system performance, but the system failure probability has also significantly increased. It is important to consider the system reliability in addition to the cost, performance, and power consumption. In this paper, we describe the types of faults that occur in a system and where these faults originate. Then, fault-injection techniques, which are used to characterize the fault rate of a system-on-chip (SoC), are investigated to provide a guideline to SoC designers for the realization of resilient SoCs.


Introduction
Recently, the development of process technology has increased system performance, but the system failure probability has also significantly increased.It is important to consider that the system must be robust against failures when designing the circuits in addition to the cost, performance, and power consumption.Thus, resilient design becomes particularly prominent in system design to increase reliability.Many studies have been conducted for over half a century on fault tolerance and fault diagnosis to control the system in anticipation of inevitable defects.
Fault avoidance is a technique that addresses the faults of devices to prevent failure.This technique includes improving the reliability of the product through inspection and testing processes.However, a complete fault-free design and manufacturing process for complex devices such as processors is difficult to achieve.Further, these devices often encounter dangerous situations because of aging and internal and external defects of the hardware.Therefore, researchers have actively studied fault-tolerance techniques to ensure normal operation, although the devices may experience some faults.Fault-tolerance techniques have made much progress for nearly half a century.Among others, the redundancy scheme has been highlighted, which tolerates faults using additional resources.This scheme is simple to implement and provides high reliability.The redundancy scheme can be classified into four general types: hardware, software, information, and time redundancies.
To detect the faults in a system, the hardware redundancy technique uses replicated hardware, the software redundancy technique employs an additional program routine, the information redundancy technique inserts extra bits into the data to be transferred, and the time redundancy technique repetitively performs the same process and compares the results.These techniques involve additional costs and delays because of the additional resources; thus, designers should select one of these techniques after considering the tradeoffs.Structures and methods have been studied in recent years to reduce these additional resources.
System diagnosis is naturally important to handle a system failure caused by faults.Methods that test and evaluate the devices have also been developed following the development of fault-tolerance techniques.Fault-injection (FI) techniques have been widely used as a fault-testing plan.The FI techniques can be classified according to which device injects a fault into a target system.These techniques have their own advantages and disadvantages.Because most of the defects can be eliminated during the testing process, the proper FI technique should be selected in accordance with a particular design.In this paper, we describe what types of faults occur in a system and where the faults originate.Then, the FI techniques, which are used to characterize the fault rate of a system-onchip (SoC), are investigated to provide a guideline to SoC designers for the design of resilient SoCs.
The rest of this paper is organized as follows.We first classify the types of faults in Section 2 and explain the causes that lead to faults in Section 3. Section 4 addresses the FI techniques, in which a discussion is included.Finally, we conclude this paper in Section 5.

Type of Faults
A fault can be classified into a hardware or a software fault according to where it occurs.We focus on hardware faults in this study, which greatly affect the device and system.A hardware fault is classified into a permanent, an intermittent, or a transient fault according to how long it exists in a device (see Figure 1  The data errors that result from a hardware fault include hard and soft errors.A hard error causes data corruption because of hardware faults arising from permanent and intermittent faults.A soft error causes data corruption because of a disturbance in the environment, such as alpha particles or neutrons, and originates from transient faults.In contrast to a hard error, a soft error arises under conditions where the device is not damaged.A soft error can be divided into single and multiple bit flips.A single bit flip consists of one data flip, and multiple bit flips consist of several data flips.Further, a single bit-flip can be categorized into a single event upset (SEU) or a single event transient (SET), depending on where it occurs.An SEU occurs at storage element, e.g. in the latch or flip-flop, whereas an SET appears in combinational logic.Erroneous SEU values in storage element can potentially be captured in the following sequential logic.An SET in the combinational logic encounters fewer occurrences of rates of failure than an SEU because the errors are reduced by logical, temporal, or electrical masking.However, higher cost is involved in correcting the error because the result of the operation is directly propagated as soon as the input data are entered.
The soft error rate (SER) is defined as the occurrence rate of a soft error in a device.The number of failures-in-time (FIT) or the mean time between failures (MTBF) are commonly used to express the SER.The main source of SER originates from the flip-flops in most embedded digital applications without a microprocessor [1].

ISSN: 2302-9285 
A Survey of Fault-Injection Methodologies for Soft Error Rate Modeling in SoC (Seung Eun Lee) 171

Cause of Hardware Faults
Understanding the types of faults and how they occur is essential for fault modeling and diagnosis.The majority of causes of hardware faults result from the development of process technology.Following the development of process technology, the probability of encountering process variations or external noise sources (alpha particles or neutrons in cosmic rays) increases in the semiconductor manufacturing process, leading to soft errors.Similarly, VLSI circuits that operate under high operating speeds and low supply voltage are susceptible to process variations; thus, these circuits have a higher error probability owing to the switching delay of transistors.In addition, the defects per chip area in a VLSI process increase with the increasing number of on-chip transistors, which increases the probability of failure.Thus, process-technology improvements result in both hard and soft errors.

Cosmic-Ray Partile
Cosmic rays cause soft errors in the system.Most cosmic rays do not reach the Earth's surface.However, cosmic rays produce energetic secondary particles such as neutrons and protons by collision with a nucleus in the Earth's atmosphere.A neutron by itself cannot interfere with the circuit; however, it is absorbed by the nucleus and causes a "neutron capture" reaction that emits alpha particles.These alpha particles generate an incorrect value when they collide with the circuit.Neutrons also originate from nuclear-fission reactions or from the creation and destruction of radioactive nuclei.Alpha particles originate from various radioisotopes during radioactive decay and are detected in materials such as glasses, fillers, alumina, plastic, and even in the sea [2].
A study that cosmic rays potentially affect devices was presented in 1962 [3].Communication disruption due to cosmic rays actually occurred, and the cosmic-ray event rate was calculated in 1975 through experiments using a scanning electron microscope [4].Devices have become more vulnerable to neutrons and alpha particles from cosmic rays because of the development of the process technology [5].The fact that circuits are more susceptible to atmospheric neutrons was confirmed by a comparison of the SER caused by neutrons depending on the scaling of device size of CMOS transistors [6].By checking the SER caused by alpha particles and radiation, it observed that the circuits are vulnerable to alpha particles when the operating voltage of the devices was lowered in the sequential logic, static combinational logic, and SRAM [7].Reference [8] confirmed that the multi-bit error rate for 90nm SRAM was slightly higher than that for 130-nm SRAM.

Noise Sources
Layman and Chamberlain demonstrated that the various noise sources that cause soft errors are thermal, shot, and l/f noise [9].Thermal noise is caused by heat when the charge carriers (electrons or holes) move erratically in the capacitor.Thermal noise affects the semiconductor threshold voltage and flips the original value in the logic, resulting in a soft error.Thermal noise can be modeled with the voltage or current [10].Shot noise is generated when the carriers pass over the potential barrier in a semiconductor, and the number of carriers becomes irregular.Because the direction and speed of the electron motion is irregular, each carrier introduces a problem in the semiconductor.The 1/f noise is caused by conductance fluctuation, which is inversely proportional to the frequency.The 1/f noise in the internal components increases significantly in the low-frequency region, and the noise decreases in the high-frequency region.Thus, these additional noise sources attack the noise margins of the semiconductors and increase the SER.

Critical Charge
Critical charge is the minimum required amount of charge to change the states of a semiconductor.When enough critical charge is collected, the logic value is changed.By decreasing the semiconductor size, the collected charge required to upset the logic also decreases and becomes susceptible to soft errors.Similarly, critical charge has been confirmed to decrease under lower operating voltages and smaller feature sizes [11].Reference [12] confirmed that the SER is altered depending on several factors, including the critical charge.A device-level 3D simulation was performed to model the relationship between the bit error rate and the critical charge values in 90-nm SRAM [13].

Crosstalk
Crosstalk is electrical interference that occurs when the distance between two conductors is sufficiently small.Narrowing of the distance resulting from deep submicron technology causes electrical distortion and adversely affects reliability.The high-frequency operation of VLSI causes a skin effect propagated along the surface of a conductor [14].This skin effect causes frequency-dependent interconnection resistance.The reliability problem of a circuit can be easily found in other places because of the increasing signal interference of the crosstalk in a smaller transistor and the interconnect dimensions [15].Most designs encounter potentially soft errors from the RC delay, noise interference, and crosstalk [16].To investigate the crosstalk properties, coupled RLC parameter values on four different interconnects were measured for the 0.13-µm and 0.18-µm processes [17].

NBTI
Negative bias temperature instability (NBTI) is a type of aging.The time delay of a circuit increases in proportion to the transistor threshold voltage (Vth).NBTI leads to timing error because the initial value of Vth for a PMOS varies with the negative bias and temperature of a circuit that has been used for a long time.This phenomenon was observed in 1967 [18].Reference [19] demonstrated that a longer exposure time to a negative voltage at the gate results in a larger fluctuation in the threshold voltage.Further, a larger change in V th results is more occurrences of the critical timing problem.
Schroder and Babcock presented many process conditions such as oxide damage; the temperature; the oxide electric field; the presence of hydrogen, boron, nitrogen, water; and the gate length that affect the NBTI sensitivity [20].The reliability of the NBTI significantly decreases when the transistor operates at high temperature, has a small gate length, and has a large content of boron, nitrogen, hydrogen, or water.The fact that hydrogen increases the NBTI was proven in [21].Similarly, the lifetime of a semiconductor is significantly reduced because the change in Vth is different, depending on the boron content of the gate oxide and the thin gate length [22].The fluctuation in Vth increases according to the NBTI stress in nitrogen oxide compared with that in pure SiO2 [1].Water can also affect Vth when the gate oxide layer is formed.As the size of CMOS devices is gradually reduced with the development of process technology, nitride oxide is being used in the gate instead of the existing SiO2 to reduce the gate insulator film and improve the performance.However, the thin nitride oxide is very sensitive to NBTI stress; thus, the PMOS transistor easily acquires defects compared with that using the existing SiO2 [1].

Fault Injection (FI) Techniques
FI is adopted to verify the reliability of a system or to perform fault modeling.In this manner, we can ensure the sensitive part of the system against faults and the potential lack of fault tolerance to create a resilient design.The basic environment of the FI method includes FI system and the target system (see Figure 2).The FI system interacts with the target system for fault generation, control, and fault analysis.The FI methods can be classified into four techniques as follows: Hardware-based FI, Software-based FI, imulation-based FI, and Emulation-based FI.Hardware-based FI is the most realistic method, which makes target system to experience faults in a physical level and measures the occurrence of the failure (see Figure 3(a)).The circuit is tested using the change in the operating power or temperature or the external shocks that cause transient errors.Moreover, this technique directly provides a stimulus at the pins or the sockets.The testing speed is fast owing to the real-time FI structure.By directly changing the environment, a wide range of circuits can be evaluated through these disturbances.However, its processes are difficult to monitor and control because we do not know the exact moment when a fault is injected by the disturbance.In addition, damage can be done to the target system because the actual circuit cannot be restored after testing [23].A circuit was validated by using a pin-level FI tool (MESSALINE) by derivation of the experimental measurements such as defective time distribution and size [24].Wang tested the fault-tolerance capability of a software to changes in the power supply and payload at a satellite's on-board computer [25].He injected the faults through a cable and monitored the changes in the output port.
Laser injection schemes into a system are available.The reliability of time-resolved ICs exposed to a pulsed laser was evaluated [26].The SER was confirmed by calculating and normalizing the cumulative error histogram in accordance with the laser pulse delay of the circuits.Pin-level FI was conducted to verify a fault-tolerant multiprocessor system (FASST) [27].FASST performed a fast fail-silent technique that analyzed the error detection coverage and latencies.A method that used a high-intensity laser in the microcontrollers was proposed [28].Its drawback was that the disturbance of the circuit could be not completely controlled in the experiment.Software-based FI causes a software fault by modifying the execution code of the actual running software in the system (see Figure 3(b)).Software-based FI is practical because the required hardware and software are actually used in the device, and additional hardware to inject a fault is not required.However, the method suffers from limitations in terms of the types of faults injected by the software.In addition, detailed information on the hardware and software is necessary to model and control the fault.
Wulf et al injected faults by using a software-based FI tool for multi-core devices on a cache using MATLAB [29].This tool can generate a cache error by randomly injecting faults in the data accessed by the load instructions.Roberto et al. suggested a method for selecting appropriate fault locations from the analysis of the circuit complexity by 3.8 million experiments [30], which significantly reduced the load size of the fault effectively and improved the performance.A dynamic software fault injection system that targeted the Apache Web server was proposed using the PIN framework-a dynamic binary instrumentation tool-from Intel [31].
To test the reliability of the server, this tool injected faults dynamically after recording the information of the fault locations.The work in [32] injected faults in microprocessors and the main memory circuits.Some of the FI methods were evaluated for software fault tolerance that detects and masks hardware errors, and the results were then compared.Several fault models were experimented on a communication channel between the serial port driver and the OS kernel to evaluate the effect on the system according to software FI [33].Through the tests, the results of the average execution time, implementation complexity, coverage, and injection efficiency were manifested in detail.

Simulation-Based FI
Simulation-based FI injects a fault into the design and observes the failure using computer simulation tools (see Figure 3(c)).Simulation-based FI operates along with the actual workload in the software program and can be used in every process of design for function verification.The reliability can be verified simultaneously by functional verification of the design.Performing fault modeling and control is possible without damaging the real system.Moreover, simulation-based FI can change the data of any location thanks to its superior accessibility.Additionally, environment construction is cheap because additional hardware is not required.However, simulation-based FI suffers from the drawbacks of long simulation setup process and simulation time.
A fault injector that injects board-level component faults was implemented for boardlevel built-in test (BIT) software [34], which is suitable for testing the reliability of a BIT system because it is created to handle the lack of validation in the BIT software.A system C hardware simulation model that uses embedded benchmark software was proposed to reduce the hardware resources [35].This model supports a mixed-level simulation conducted at an electronic system level and RTL.Ruano et al used a simulation-based FI platform that models soft errors to evaluate the reliability of a system [36].This platform has low circuit costs and high controllability and can be performed with both synthesizable and non-synthesizable models.Reference [37] revealed that injecting faults into all places in the RTL and gate-level designs is possible, which supports a C function to add new types of faults.Wang et al. tested a method that modifies the data of a processor using a full system simulator-based FI tool (FSFI) on a system level [38].The FSFI can check the processor components such as the integer register files, ALU, and decoder.

Emulation-Based FI
Emulation-based FI injects faults into a design implemented in the FPGA (see Figure 3(d)).Emulation-based FI is proposed to overcome the long simulation time of simulation-based FI.Diagnosis can be processed quickly with real-time or partial reconfiguration.However, emulation-based FI is constrained by the precondition that the target design must be optimized in the FPGA before the experiment.Further, flexibly checking the response to the failure of the target design is difficult.
An FI method for any microprocessor implemented on an FPGA with an on-chip debugger (OCD) and a JTAG interface was implemented to complement the time bottleneck of an OCD built in a processor for debugging [39].This implementation combined hardware and software FI in the FPGA design.The OCD-based method is a balanced technique in terms of


ISSN: 2089-3191 Bulletin of EEI Vol. 5, No. 2, June 2016 : 169 -177 170 ).A permanent fault (stuck-at, stuck-open, and bridging faults) remains permanently in the circuit, a transient fault appears and disappears within a brief time, and an intermittent fault introduces repetitive broken data in a specific place because of hardware damage.Permanent and intermittent faults occur because of inaccurate specifications, implementation mistakes, or component defects.A transient fault usually occurs because of internal and external noise.

Figure 1 .
Figure 1.Block diagram of fault and error terminology focused on the hardware fault.