Single event effects (SEEs) in FPGAs, ASICs, and processors

by Dagan White , TechOnline India - January 12, 2012

This article has broad applicability to any industry in which safety and reliability are of critical importance. It should be useful to a wide audience comprised of system architects, engineering and program managers, and certification authorities.

Single-event effects (SEEs) are of a growing concern in high-reliability system development, yet there is much disparity among users of application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) with regard to understanding how susceptible their designs might be.

The avionics and industrial system development guidance that currently exists is only broadly beginning to consider SEEs and their impact on system reliability. Unfortunately, standards such as DO-254, DO-178, ARP 4754, ARP 4761, and IEC 61508 provide little or no direction on how to handle SEEs. This white paper highlights concerns regarding effects of SEEs on ASICs and FPGAs and points to analysis and mitigation techniques for handling SEEs.

All sub-micron integrated electronics devices are susceptible to SEEs to some degree. The effects can range from transients causing logical errors, to upsets changing data, to destructive single-event latch-up (SEL). Traditionally, FPGAs were targeted as being more sensitive due to their use of SRAM for the configuration storage. As dimensions shrink to below 90 nm, SEEs in all devices, including ASICs, FPGAs, and application-specific standard products (ASSPs) must be considered.

Although targeted to an avionics audience, this article has broad applicability to any industry in which safety and reliability are of critical importance. It should be useful to a wide audience comprised of system architects, engineering and program managers, and certification authorities.
 
Understanding the SEE phenomenon

SEEs result from interaction of high-energy particles with circuit elements in integrated circuits. When a high-energy particle passes through the silicon substrate of a device, charged particles are created as the result of sub-atomic particle collisions. These particles are generated by an ionization trail along the path of the incoming particle.

If a charged particle impacts at or near a transistor junction, the collected charge can induce an upset to the state of that transistor. If the collected charge is larger than the critical charge of the element, the element changes state. This change in state (or bit flip, in the case of a memory cell) is referred to as an SEU. Similarly, the charged particles can induce a current and voltage spike on a metal interconnect, which is referred to as a single-event transient (SET). If the pulse width of the spike is wide enough, the spike can propagate through the circuit (see Types of Single Event Effects).

Sources of charged particles

Two sources of charge particles are of concern to designers of high-reliability systems: cosmic ray interactions with the atmosphere, and impurities in packaging materials and the silicon substrate.

Atmospheric sources

Galactic cosmic rays (GCR) originate in outer space, are primarily comprised of subatomic particles and light ions, travel at nearly the speed of light, and strike Earth from all directions. As high-energy cosmic rays enter the atmosphere and react with atoms, through a process known as direct nuclear spallation, neutrons are generated in the atmosphere. The result of this phenomenon is often referred to as an air shower. Neutrons with energy greater than 10 MeV carry sufficient energy to cause SEEs in integrated circuits.

Atmospheric depth (density) also plays a significant role in causing neutron-generating reactions and in transporting neutrons to ground level. An intense neutron environment exists at higher altitudes in the atmosphere, 10 km to 40 km and more above the surface. In addition, Earth's magnetic field causes the flux to vary from the equator to the poles, with the equator having the least flux and the poles having the greatest flux. The magnetic field of the sun as it varies during the sunspot cycle also influences the flux of cosmic rays; maximum flux occurs at a solar minimum, for example.{1}

Packaging material impurities

Packaging materials used for integrated circuits often contain impurities, including trace amounts of uranium and thorium isotopes, which emit alpha particles as they decay. Although these particles are low in energy and have limited penetration depth, they are a concern for integrated circuits due to their close proximity to the silicon substrates.

Another source of alpha particles in packaging is the eutectic lead solders used for the solder bumps in flip-chip packaging. Even if the solder is purified of other radioactive impurities, it is impossible to remove the lead isotope 210Pb. Although 210Pb is not an alpha emitter per se, its decay chain contains the strong alpha emitter 210Po.

Substrate impurities

The element boron used in borophospho-silicate glass (BPSG) is another source of ionizing radiation. When one of the common boron isotopes, 10B, is struck by low-energy neutrons, an alpha particle and a lithium ion are generated. Given the significant amount of boron present in substrates plus the number of low-energy neutrons in the air shower, the effect is significant.

Types of single-event effects

A number of events fall under the general category of SEEs (see Figure 1). These events or errors can be divided into two broad categories: soft errors versus hard errors. Soft errors are those events that have no damaging effects and are cleared by normal device operation. Hard errors are events that generally result in lasting damage to the circuitry.

Figure 1: Types of single event effects

Soft (recoverable) errors

Soft errors are upsets to the device operation and are self-correcting in time or are correctable by rewriting a memory element. The three subclasses of soft errors are:

• Single-event transients (SETs) result when a high-energy particle impacts a combinatorial path of a device and can induce a voltage/current spike. If the pulse-width of this spike is sufficient and at the right time, it can propagate through the circuit.
• Single-event upsets (SEUs) are the result of high-energy particles causing a change in the state of a memory element (SRAM, flash, flop, or latch). SEUs can be categorized as single-bit or multi-bit upsets (SBUs or MBUs). SBUs are by far the most common SEE seen in avionics applications.
• Single-event function interrupts (SEFIs) are disruptions to normal device operation that fall beyond a simple corruption of user data. These types of effects alter the functionality of the circuit and typically require reconfiguration/reset or power cycling for recovery.

Note: Failures-in-time (FIT) rates are commonly discussed in relation to SEUs, SETs, and SEFIs, but these are soft errors that affect the functionality and not permanent failures of the device.

Hard (non-recoverable) errors

Errors that cause lasting damage to the device are classified as hard errors. The three subclasses of hard errors are:

• Single-event latch-up (SEL) is a circuit latch-up induced by radiation. This latch-up can be either permanent or clearable with power cycling.
• Single-event burnout (SEB) is a short-circuiting caused when a high-energy ion impacts a transistor source, causing forward biasing. SEBs are typically a threat to power MOSFETs but are also seen in IGBTs, high-voltage diodes, and similar circuits.
• Single-event gate rupture (SEGR) is a plasma spike caused by a high-energy ion impact, resulting in rupture of the gate oxide insulation.

FPGAs can be protected from SEL, SEGR, and SEB caused by neutron radiation by careful fabrication and engineering processes, with little consideration of design, process, and technology variables. Likewise, space-grade parts can be rad-hard by design, and as such they can be made immune to latch-up from heavy ions. The process adds significant design challenge and device cost, however. Heavy ion radiation is not an issue inside of earth's atmosphere, so space-grade parts are not necessary for avionics applications.

ASICs, FPGAs, and SEUs

DRAM was the first technology for which terrestrial SEU became a concern, but these devices are now fairly robust. SRAM soft error rates (SERs) then became a concern, which has persisted because even though the per-bit SER has held steady despite the decreasing feature size, the total amount of SRAM bits per system/device has increased greatly. SRAM is used inside stand-alone memory devices as well as in FPGAs. Concern over FPGAs arises from their use of SRAM for user block memory as well as device-configuration memory. With the latest sub-90 nm technology nodes, concern over ASIC upset rates is rising.

SRAM-based FPGAs hold the device routing in a configuration memory, and they use block RAMs for user memory. Both of these memory structures, along with flip-flops, can be upset by radiation, although at different rates. User block RAM can be protected with error-correcting code (ECC) and parity schemes, as can external memory devices. FPGA configuration memory, however, cannot be directly protected in the same manner as block memory via ECC or parity checks. SEU mitigation techniques that monitor device configuration memories are recommended for FPGA designs. Ideally, a device should have built-in configuration memory error detection capabilities (using ECC) and SEU mitigation IP available to monitor and repair configuration memory. Other FPGA structures are upsettable as well but at an insignificant rate.

Note: The SRAM cells used for the configuration memory of FPGAs should be larger and more robust than the SRAM cells used for general-purpose memory, which are optimized for speed and cost. Moreover, configuration memory cells should be optimized for SEU resistance.

SEE concerns in ASICs

SEE concerns in ASICs have risen because of the decreased operating voltages and element capacitance combined with increased clock speeds. These factors mean that transient upsets are more likely to occur and can easily translate to clocked functional errors. SERs can now easily exceed 50,000 FIT per processor, including logic gates and on-chip memory. System-level consideration and mitigation techniques are necessary for ASICs.{2} Other data shows that ASIC designs below 90 nm have exhibited 1,000 FIT per million gates, and 1,000 FIT per million memory bits.{3} User memory can be protected, but logic upsets, which can account for a substantial portion of the upset rate, cannot be easily protected. Logical SETs, when latched, can lead to logic errors and consequently are no longer negligible in processors manufactured on deep sub-micron processes. System-level solutions are required.{4}

At the same time that ASICs have become more susceptible to upset, some FPGAs have been designed for improved immunity and lower soft error FIT rates. In fact, Xilinx devices at 65 nm and below have shown improved immunity, with nominal rates on the order of less than 100 FIT/Mb for configuration memories and below 500 FIT/Mb for user memories.{5}

For both ASICs and FPGAs, there are non-zero error rates, non-zero detection times, and non-zero correction times. It is imperative to consider SEEs both when using ASICs and when using FPGAs in any high-reliability application. Designers should seek vendors that provide information to assist them in analyzing system FIT rates and estimating the FIT rate for their targeted device.{5,6} Exact processor FIT rates can also be tricky to determine, requiring a combination of analysis, simulation, and beam testing.

Beyond vendor data, some airframe manufacturers have SEE models or estimations that they apply and levy broadly across multiple vendors' technologies. This approach provides a rough estimate for those vendors that do not supply data, but this approach is risky. Products fully tested and characterized provide a safer solution.{7}

SEE rates are probabilistic and vary with geographic location, environmental conditions, and altitude. All FIT rates are estimates based on modeling, analysis, and/or testing, but not all published FIT rates are created equal. It is important that data comply with the JEDEC Standard 89A (JESD89A), which Xilinx played a role in updating.{6} The Xilinx FIT rate calculator applies the models from JESD89A with FPGA FIT rate data to yield an adjusted, application-specific FIT rate.

All radiation testing is not created equal either. Particle test beams can vary in their energy and particle, for example. To counteract this variability, companies need to use control devices when conducting beam testing to adjust for test setup and beam variation from run to run. Flight tests can capture real-world data regarding FIT rates, but geographic location and timing with the solar activity can cause variability in the data. Ultimately, soft-error FIT rates are estimates that enable the developer to assess the probability of fielded system upset rates.

Systems that utilize sub-90 nm geometries, products like ASICs and FPGAs, in any avionics or high-reliability application must adopt proper techniques to mitigate the susceptibility of such technologies to SEEs. In part two of this article, continued after References below, we will discuss approaches to mitigate the effects of SEEs.

 

References

1. “NSEU Mitigation in Avionics Applications," Xilinx application note XAPP1073 (2010).
2. Robert Baumann, Soft Errors in Advanced Computer Systems, IEEE Copublished by the IEEE CS and the IEEE CASS (2005).
3. “Xilinx FPGAs Overcome the Side Effects of Sub-90 nm Technology,” Xilinx whitepaper WP256 (2011).
4. Rebaudengo, Reorda, et al., “Coping With SEUs/SETs in Microprocessors by Means of Low-Cost Solutions: A Comparison Study, IEEE Transactions on Nuclear Science (2002).
5. UG116, Device Reliability Report
6. “Continuing Experiments of Atmospheric Neutron Effects on Deep Submicron Integrated Circuits,” Xilinx whitepaper WP286 (2011).
7. Xilinx Space website

 

Part one (above) of this two-part article covered types of single-event effects (SEEs), the challenges they present, and the susceptibility of application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) used in terrestrial systems. In part two, we will discuss ways to mitigate these effects.

To understand the various mitigation approaches for SEEs, we can examine several scenarios. Consider a processor having a failure-in-time (FIT) rate of 600 at sea level in New York, NY, corresponding to a mean time between failures (MTBF) of roughly 190 years. While an MTBF of this magnitude can seem insignificant, if 1,000 systems are fielded, then the combined MTBF of all systems drops to 70 days—one upset every 70 days on average. This rate might not be tolerable for high-reliability systems such as networking routers or those used in industrial applications.

Alternatively, let's examine an application at high altitude. A FIT rate of 600 at sea level in New York corresponds to a rate of 367,200 at an elevation of 40,000 feet over the poles, which represents a MTBF of 110 days for a single fielded unit. Flying a hundred units results in roughly one upset per day. In other words, one system in the air has the nearly same magnitude of upset as 1,000 systems on the ground.

Both the memory and logical structures in ASICs are susceptible to SEEs, especially at sub-90-nm technology nodes. Similarly, FPGA configuration memory and user block memory are upsettable. This susceptibility does not mean that these technologies are unsuitable for avionics and high-reliability systems; it means that SEEs should be considered in the development process and mitigation tactics must be employed. Designers should assess the following before making a final selection between ASIC or FPGA:

• Frequency of events—FIT rate and MTBF
• Detection time of events and means of detecting the event
• Recovery time after event detection
• Performance, area, and monetary cost of the mitigation solutions
• System performance and system design implications

When designing with both ASIC and FPGA solutions, the following fault detection and mitigation techniques should be considered:

• Soft-error mitigation IP (SEM IP)—good for FPGAs and soft processor only
• ECC or parity checks for user memories in both ASICs and FPGAs
• Software-implemented fault tolerance (SWIFT) for both soft and hard processor solutions
• Hardware mitigation solutions—lockstep operation, dual and triple module redundancy (DMR and TMR) for FPGA solutions or ASIC designs
• Watchdog timers

All mitigation approaches should consider area, performance, detection time, and correction time balanced against fixed and variable costs as well as system safety and reliability costs. Effective FPGA SEE mitigation methods include:

• External watchdog timer with external handling control (lacks full device check)
• Full-device cyclic redundancy check (CRC) with external reset of FPGA (might upset operation when unnecessary)
• Full-device CRC with bit correction and flag to design (design can decide on further actions)
• Full-device CRC with correction and non-essential bit classification (ignores 66% of false positives). See Architectures and Refinement of FIT Rates for a description of essential bits.
• DMR and TMR design techniques, or lockstep operation (area hit)
• Additional built-in fault tolerance checks (custom generated)
• Safe state machines—“safe_implementation” and “when others” statement with recovery state
• Software-implemented fault tolerance (SWIFT) techniques (for processors)
• Memory protections using ECC or parity checks
• Flow checks, range checks, signatures, CRCs, parity, etc.

ASIC SEE mitigation methods include:

• External watchdog timers (can catch every time-dependent behavior)
• Architectural mitigation (costly solutions on top of increasingly costly technology nodes)
• SWIFT techniques (for processors)
• Memory protections using ECC or parity checks

ASIC and processor robustness

With each successive process node, the cost of ASIC non-recurring engineering increases by $5 million or more. At the same time, the ASIC susceptibility to SEEs increases as operating voltage and elemental capacitance decrease. These smaller technology nodes are the critical enablers of power reduction and increased performance with higher clocking speeds. All of these aspects drive greater design density.

Larger and larger end-markets are now necessary to support the non-recurring cost of developing modern ASICs. TMR techniques, lockstepping, or whatever silicon mitigation techniques might be employed to enhance ASIC immunity to SEE are contradictory to the natural evolution of commercial-grade ASICs. Although these reliability enhancing features are desirable for high-reliability markets, they are not necessarily desirable for mainstream COTS markets.{1}Commercial markets might not care about the SEE frequency.

Boeing conducted a detailed study of the effects of SEE on the clock, flip-flop, and logic structures inside of a commercial-grade 90-nm standard-cell ASIC, with the conclusion that hardening techniques must be considered and applied differently across all circuit structures in the device to achieve an appropriately hardened ASIC suitable for avionics applications. This study identifies some of the complex considerations that go into hardening different elements of ASIC structure for the ultimate goal of building a SEE-robust ASIC.{2}

In lieu of hardware solutions or in combination with such solutions, SWIFT is one means of enhancing SEE handling in processor hardware. Much research has been conducted on SWIFT techniques, but more innovation might be required to turn the research into viable market solutions—and this burden will likely fall on the high-reliability market. Many of the techniques for handling SEUs in processors apply to both soft and hard processor solutions, a benefit for both ASIC and FPGA-based processors.

Many possible software methods are available to address processor upsets. Software techniques can include data-flow error monitoring and control-flow monitoring, but these techniques have not reached 100% coverage. Hardware techniques might include memory access checks, consistency checks, control-flow checks, watchdogs, and dynamic verification. Soft and hard processors may require different strategies in some cases. One study has shown that for a soft processor, a hybrid hardware and software approach can yield 100% fault detection with processing time overhead around 150% of the non-mitigated design.{3}

Research work from Brigham Young University (BYU) assesses SWIFT techniques versus DMR/TMR techniques in terms of performance and area solution costs. While this research is geared for space-based applications and is focused on soft processors, the same concepts can be applied to terrestrial and airborne systems that use both soft and hard processors. This work shows that software-implemented techniques can achieve decent detection and correction rates versus DMR and TMR, with all solutions capturing greater than 90% of errors. SWIFT techniques do, however, lead to a performance hit nearing a factor of two. On the other hand, DMR and TMR are costly in terms of area, with 2.5X and 3.7X area hit respectively, but they do achieve greater detection rates with only minimal performance hits.{4} The designer needs to review the trade-offs when selecting mitigation methods.

 

Architectures and refinement of FIT rates

It is a tricky proposition to assess the effect of an SEU to an FPGA configuration bit for any specific end design. First, which bits are truly critical to a user design? And second, if a bit is critical to the design, is it critical to the function at the time that the bit upset occurs and prior to its correction?

Joint research work by BYU and Los Alamos National Laboratory (LANL) assesses the vulnerability of FPGA designs to configuration bit upsets and examines the bits that are critical to a design.{5} For those configuration bits that are critical, the research explores which ones, even if functionally corrected, might not correct a disturbance of the processing state. The study classifies bits as either persistent or non-persistent, referring to state-machine control bits or feedback bits that when upset can corrupt processing versus corruption of passing information in the datapath (such as corruption of a video display data bit). The results demonstrate that the proportion of persistent bits in a design depends on the design architecture.{5}

Similar questions arise when assessing SEEs in the logical structure that controls an ASIC. Intel Corp. and others have recognized this issue and have conducted research in an attempt to quantify an architectural vulnerability factor.{6,7}  The theory of the work applies to any soft or hard processor. Xilinx recognizes similar ideologies and has carried out similar research focused on FPGAs.

Generally, the FIT rate for the configuration memory of an FPGA is calculated simply by multiplying the FIT/Mb by the configuration memory size, after subtracting overhead bits and block RAM content. However, the results are overly pessimistic as only a maximum of 10% of the configuration bit upsets actually result in a functional failure in the design. Similarly, an SEU mitigation strategy that flags every upset to configuration memory as being critical results in many false positives.

Determining which bits are critical to a design is a time-consuming project that requires injecting faults into every configuration bit of an end design, however. To simplify the process, Xilinx developed essential bits technology. The essential bits output produces a list of bits that affect functionality of the design. In contrast, critical bits represent a subset of the essential bits that results in a functional failure in the design if upset. For example, an essential bit upset in a non-active area of the design (in higher order bits of a counter, a rarely used state, or test circuitry) does not result in a functional failure. The essential bits output is conservative but can still allow the user to rule out 66% or more of the configuration bits for a given design.

Using the essential bits output with SEM IP, which detects and corrects upsets, allows the system to ignore non-essential bit upsets. Non-essential bits are still corrected to prevent accumulation of errors, but the design can continue to operate without further intervention. If an essential bit is upset, then that bit is corrected and the user design can determine whether or not a device reset is prudent, depending on architectural knowledge of the design and the effects of persistent and non-persistent errors. Using this technology, the effective FIT rate of a full device is reduced to 33% or less.

Even if an essential and critical bit upset is corrected, an error can still propagate. DMR/TMR and other architectural techniques are required to guarantee uninterrupted operation. An upset that affects a feedback or decision path could propagate or place the design in an unintended mode prior to correction of the upset configuration bit. For this reason, short of robust architectural mitigations, it is prudent to correct all upset bits, and then, if it is an essential bit upset, internally reset the device. Xilinx is continuing to develop technologies that can enhance the fidelity of SEU responses.

 

Recommendations for designers

Xilinx enables users to employ various levels of SEU protection (see Mitigation Approaches) and recommends that designers:

1. Assess the soft error data for device families.{8}
2. Select a device family that supports SEM IP (Virtex®-5 FPGAs and later).{8}
3. Employ the SEU FIT Rate Calculator (available from Xilinx) to assess the soft error FIT rate and MTBF for the design and target device with the level of device utilization and environmental conditions that are expected. This is a preliminary assessment tool.
4. Complete the normal design process incorporating the SEM IP.
5. Simulate the design and use the SEU fault-injection simulation capability to verify the design. They should also simulate forced invalid states in state machines.
6. Use the ISE® Design Suite 13.2 (or later) essential bits output data to assess the estimated SEU rate for the design. These are refinements that can be fed back into the FIT Rate Calculator to yield a more accurate estimate of the design FIT rate. The Essential Bits outputs can be used with certain versions of the SEM IP and target devices to reduce unnecessary handling of false-positive SEU hits.{8}

 
Systems that utilize sub-90 nm geometries, products like ASICs and FPGAs, in any avionics or high-reliability application must adopt proper techniques to mitigate the susceptibility of such technologies to SEEs. FIT rate estimates can be used to assess the MTBF of these technologies for the proper mitigation at the device and system level. Any mitigation strategy ultimately needs to address trade-offs that include area, performance, detection time, and correction time. These factors need to be balanced against fixed and variable costs as well as system safety and reliability costs.

References

1. ReStore: Symptom-Based Soft Error Detection in Microprocessors, Nicholas J. Wang, Sanjay J. Patel, University of Illinois
2. Clock, Flip-Flop, and Combinatorial Logic Contributions to the SEU Cross Section in 90 nm ASIC Technology, David L. Hansen, Eric J. Miller, Aj Kleinosowski, Kirk Kohnen, Anthony Le, Dick Wong, Karina Amador, 2009, IEEE Transactions on Nuclear Science
3. Azambuja, Lapolli, et al., “Detecting SEEs in Microprocessors Through a Non-Intrusive Hybrid Technique,” IEEE Transactions on Nuclear Science (2011).
4. Nathaniel H. Rollins and Michael J. Wirthlin, “Software Fault-Tolerant Techniques for Softcore Processors in Commercial SRAM-Based FPGAs,” NSF Center for High-Performance Reconfigurable Computing (CHREC), Brigham Young University (2011).
5. Johnson, Morgan, et al., “Detection of Configuration Memory Upsets Causing Persistent Errors in SRAM-based FPGAs,” 7th Annual Military and Aerospace Programmable Logic Devices International Conference (2004).
6. Mukherjee, Weaver, et al., “A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor,” Proceedings of the 36th International Symposium on Microarchitecture (2003).
7. Duan, Li, et al., “Versatile Prediction and Fast Estimation of Architectural Vulnerability Factor from Processor Performance Metrics,” Proceedings of the 15th IEEE International Symposium on High-Performance Computer Architecture (HPCA-15), Raleigh, NC (2009).
8. Xilinx Avionics website

About the author

Dagan White has over 15 years of multidisciplinary engineering experience within the A&D industry. His electronics and systems development work has spanned mixed-signal electronics hardware design and FPGA development for lidar systems and radiometers.  He has worked for Lockheed Martin Coherent Technologies and ITT Geospatial Systems, and is now with Xilinx as staff systems architect for avionics, tasked with leading FPGA avionics solutions development; current areas of focus include DO-254, SEU, IP, and design flows. Dagan holds a BSEE and an MBA from the University of Colorado.

Article Courtesy: Military, Aerospace DesignLine

About Author

Comments

blog comments powered by Disqus