Functional safety poses challenges for semiconductor design

by Karl Greb and Riccardo Mariani , TechOnline India - May 15, 2011

Performing safety analysis at the device level introduces a problem of failure rate allocation. Given only a single probability of failure for a die, how can you determine the failure rate of an individual design element, such as a CPU?

To manage systematic and random failures, vendors have applied functional safety techniques at the system level for decades. As the capability to integrate multiple system-level functions into a single component has increased, there’s been a desire to apply those same practices at the semiconductor component or even subcomponent level. 

Although the state of the art in functional safety is not yet well aligned with the state of the art in semiconductors, recent work on the IEC 61508 second edition and ISO 26262 draft standards have brought improvements. Many challenges remain, however. 

Texas Instruments and Yogitech, a company that verifies and designs mixed-signal system-on-chip solutions,
are working together to solve the challenges in standards committees as well as on new TMS570 microcontroller designs. (See Figure 1 below for an example of current-generation designs.)
 

Standards and analysis

All elements that interact to realize a safety function or contribute to the achievement of a semiconductor safety goal must be considered. Regrettably, the available standards aren’t consistent in application or scope. For example, IEC 61508 makes a general distinction between system design and software design, while ISO 26262 respects separate system, hardware component and software component developments. 

So how should we consider reusable subcomponent modules such as an analog/digital converter or processor

core?  “Hard” modules, such as A/D converters, have a fixed implementation and can easily be developed according to hardware component guidelines. “Soft” modules, such as processor cores, are delivered as source code and have no physical implementation until synthesized into a design. 

It is not possible to perform some levels of quantitative safety analysis on the “soft” module until it’s synthesized, as it blurs the line between hardware and software components. Trial synthesis with well-documented targets is thus recommended to allow for the calculation of reference safety metrics, so that potential users can evaluate a module’s suitability for their design.

To ensure functional safety, it is critical to understand the probability of random failure of the elements that constitute a safety function or that achieve a safety goal. In traditional analysis, each component in the safety function is typically treated as a black box, with a monolithic failure rate. 

Traditional failure rates are estimated based on reliability models, handbooks, field data and experimental data. Those methods often generate wildly different estimates; deltas of 10x to 1,000x are not uncommon. Such variation can pose significant system integration hurdles.

How can you perform meaningful quantitative safety analysis without component failure rates estimated to the same assumptions? One solution is to standardize estimation of failure rates based on a single generic model, such as the one presented in IEC TR 62380 (currently considered in the ISO 26262 draft guidelines). Another is to focus on ratiometric safety analyses—calculations that compare ratios of detected faults to total faults, instead of focusing on the absolute magnitude of failure rate.


                              

Figure 1. TMS570 transportation MCUs provide onboard safety mechanisms, such as lockstep CPU compare; ECC, which protects both memories and interconnect; and innovative built-in self-test controllers to simplify system safety development.

 

 

Individual elements

Performing safety analysis at the device level introduces a problem of failure rate allocation. Given only a single probability of failure for a die, how can you determine the failure rate of an individual design element, such as a CPU? Lacking more detailed information, a typical method is to assume that failure rates are equally distributed per unit area on the die.

This can be a valid approach if you consider gate oxide breakdown as the primary failure mode. It falls short, however, when you consider the variety of failure modes recognized in modern semiconductor reliability standards, such as the recently updated Jedec JEP122F standard. For many failure modes, such as a single event upset (SEU) that affects memories or sequential logic, this problem can be solved by the application of failure rates per design element seen in accelerated reliability tests (such as neutron beam bombardment). 

Another challenge is to determine safe vs. dangerous failures. In general, those that do not cause a failure of the safety function are labeled safe and have less or no impact on safety metrics. Dangerous safety failures, by contrast, cause a violation of a safety goal. 

For most black-box analysis, the detailed information necessary to make a safe-vs.-dangerous determination is not available. Standard practice here is to estimate a ratio of 50 percent safe and 50 percent dangerous faults. 

Detailed white-box analysis of a design provides intriguing possibilities for a more thorough quantification of this ratio. For example, analysis of signal propagation and fanout can determine an architectural lower bound for
the safe-vs.-dangerous ratio that is independent of application usage. If additional system-level data flow information or system software is available, a further quantification of the ratio is possible.

Confirmation of the diagnostic effectiveness of implemented safety mechanisms is another challenge. 

Fault insertion is already used at the system level to verify safety mechanism implementation. Many IC faults however, such as the failure of an on-chip memory controller, cannot be injected at the system level. Fault insertion using design models, such as gate-level netlists, can be used to inject faults and determine whether the safety mechanisms detect them within the expected time.

Challenges include the quality of the fault insertion models, setup of the simulation environment, and selection of test benches and faults to be injected to get representative results. For example, you can’t inject all possible
bridging faults in all possible locations, so you must direct that verification process by ranking failure criticality. 

The combined gaps and potential solutions noted reveal the need for a deep integration of functional safety techniques in the development of semiconductor products. Figure 2 shows a simplified example of how functional safety can be incorporated with the IC design flow.

 

                                                                                     

 

Figure 2. A simplified example of how functional safety can be integrated with the IC design flow.

Qualitative safety analysis, concurrent to a specification of functional requirements, identifies potential failure modes, flags early safety gaps and defines the safety requirements. Next, quantitative safety analysis predicts failure rates and allows safety-oriented design exploration, or the identification of design safety trade-offs and selection of optimized safety mechanisms.

The end result is a “safety manual” of the IC that clearly lists all the assumptions of use, instructs the system integrator how to use the product in safety systems and provides safety metrics for use in system-level analysis.
 


About the authors
:

Karl Greb is functional safety technologist for the Texas Instruments TMS570 line of microcontrollers.
Riccardo Mariani is co-founder and chief technical officer of Yogitech SpA.
 

 

Comments

blog comments powered by Disqus