Don't Let Metastability Cause Problems in Your FPGA-Based Design

by Jennifer Stephenson, Altera Corp. , TechOnline India - September 29, 2009

This article describes metastability in FPGAs, explains why the phenomenon occurs, and discusses how it can cause design failures. It also explains how mean time between failures (MTBF) is calculated from design and device parameters, and presents techniques to improve system reliability with increased MTBF.

Metastability is a phenomenon that can cause system failure in digital devices such as FPGAs, when a signal is transferred between circuitry in asynchronous clock domains. This article describes metastability in FPGAs, explains why the phenomenon occurs, and discusses how it can cause design failures. The calculated mean time between failures (MTBF) due to metastability indicates whether designers should take steps to reduce the chance of such failures. This article also explains how MTBF is calculated from design and device parameters, and presents techniques to improve system reliability with increased MTBF.

What Is Metastability?
Each register has defined signal timing requirements that allow it to correctly capture data and produce an output signal. The input to a register must be stable for a minimum setup time before the clock edge (tSU) and a minimum hold time after the clock edge (tH). The register output then is available after a specified clock-to-output delay (tCO). If a signal transition violates tSU or tH requirements, the register output may go into a metastable state. In a metastable state, the register output hovers at a value between the high and low states for some period of time, which means the output transition to a stable state is delayed beyond the specified tCO.

In fully synchronous systems, the input signals must meet the register timing requirements, thus metastability does not occur. Metastability problems occur most often when a signal is transferred between circuitry in unrelated clock domains, or completely asynchronously. The designer cannot guarantee that the signal will meet tSU and tH requirements, because the signal can arrive at any time relative to the destination clock. If a signal transition violates a register's tSU or tH, the likelihood that the register enters a metastable state and the time required to return to a stable state vary depending on the operating conditions and the process technology used to manufacture the device.

A register capturing a data signal at a clock edge can be visualized as a ball being dropped onto a hill, as shown in Figure 1. The sides at the bottom of the hill represent stable states—the signal's old and new data values after a signal transition—and the top of the hill represents a metastable state. If the ball is dropped at the top of the hill, it might balance there indefinitely, but in practice it falls slightly to one side and rolls down the hill. The further the ball lands from the top of the hill, the faster it reaches a stable state at the bottom.

If a data signal transitions after the clock edge and the minimum tH,, it is analogous to the ball being dropped on the "old data value" side of the hill, and the output remains at the original value for that clock transition. When a register's data input transitions before the clock edge and minimum tSU, it is analogous to the ball being dropped on the "new data value" side of the hill, and the output reaches the stable new state quickly enough to meet the defined tCO time. However, when a register's data input violates the tSU or tH,, it is analogous to the ball being dropped on the hill. If the ball lands near the top of the hill, it takes too long to reach the bottom, increasing the delay from the clock transition to a stable output beyond the defined tCO.


Figure 1- Metastability Illustrated as a Ball Dropped on a Hill (click on image to enlarge).
{pagebreak}Figure 2 illustrates the voltage level of metastable signals with passage of time. The input signal transitions from a low to high state while the clock signal transitions, violating a register's tSU requirement. The output signals start in the low state and go metastable, hovering between the high and low states. Signal A resolves to the input data's new logic high state, and signal B resolves to the data input's original logic low state. In both cases, the output transition to a defined state is delayed beyond the register's specified tCO.


Figure 2- Examples of Metastable Output Signals (click on image to enlarge).

When Does Metastability Cause Design Failures?
If the signal resolves to a valid high or low state before the next register captures the data, the metastable signal does not negatively impact operation of a system that is properly designed for asynchronous inputs. Continuing the ball and hill analogy, failure can occur when the time it takes for the ball to reach the bottom of the hill (a stable logic value) exceeds the allotted time. The allotted time includes the register's defined tCO plus any extra timing slack in the path from the register. When a metastable signal does not resolve quickly enough, a failure can result if different destination registers capture different values for the metastable signal.

Note that asynchronous input signals, or signals that transfer between unrelated clock domains, can transition at any point relative to the clock edge of the capturing register. Therefore the designer cannot predict the sequence of a signal's transitions or the number of destination clock edges until data transitions. For example, if a bus of asynchronous signals is transferred between clock domains, the data signals could transition on different clock edges. As a result, the received values of the bus data could be incorrect.

The designer must accommodate this behavior with circuitry such as dual-clock FIFO logic to store the signal values, or hand-shaking logic. FIFO logic transmits control signals between the two clock domains, and data is written and read with dual-port memory. Designers may use gray encoding to ensure that only one bit changes in a bus, which allows detection if data is skewed across clock cycles because of metastability. If an asynchronous signal acts as part of hand-shaking logic between two clock domains, control signals indicate when data can be transferred between clock domains. In a properly-designed system, the design functions correctly as long as each signal resolves to a stable value before it is used.

{pagebreak}Synchronization Registers
When a signal transfers between circuitry in asynchronous clock domains, the signal must be synchronized to the new clock domain. The first register in the new clock domain acts as a synchronization register. To minimize failures due to metastability, designers use a sequence of registers (a synchronization register chain or synchronizer) in the destination clock domain. These registers allow additional time for a potentially metastable signal to resolve to a known value before the signal is used in the rest of the design. With FIFO or handshaking logic, control signals are synchronized to ensure there is enough settling time for any metastable conditions to resolve before data signals are used. The timing slack after each register is the time available for a metastable signal to settle or resolve to a known value, and is known as the available metastability settling time.

A synchronizer is defined as a sequence of registers that meets the following requirements:

  • The registers in the chain are all clocked by the same or phase-related clocks
  • The first register in the chain is driven from an unrelated clock domain, or asynchronously
  • Each register fans out to only one register, except the last register in the chain
  • The length of the synchronizer is the number of such registers in the synchronizing clock domain. Figure 3 shows a sample synchronizer of length two.

    Figure 3- Sample Synchronization Register Chain

    {pagebreak}Calculating Metastability MTBF
    The mean time between failures, or MTBF, due to metastability provides an estimate of the average time between instances when metastability could cause a design failure. A higher MTBF (such as hundreds or thousands of years between metastability failures) indicates a more robust design. The required MTBF depends on the system application. Increasing the metastability MTBF reduces the chance that signal transfers will cause any metastability problems on the device.

    The metastability MTBF of a synchronizer chain is calculated with the following formula and parameters:

    The C1 and C2 constants depend on the device process and operating conditions. The constants are determined by characterizing the FPGA for metastability. The difficulty with this characterization is that MTBFs for typical FPGA designs are in years, so measuring the time between metastability events using real designs under real operating conditions is impractical. Characterizing the device-specific metastability constants must be performed with a test circuit designed to have a short, measurable MTBF. The MTBF versus tMET results are plotted on a logarithmic scale. The C2 constant corresponds to the slope of the trend line for the experimental results, and the C1 constant scales the line linearly.

    The fCLK and fDATA parameters depend on the design specifications: fCLK is the clock frequency of the clock domain receiving the asynchronous signal and fDATA is the toggling frequency of the asynchronous input data signal. Faster clock frequencies and faster-toggling data reduce (or worsen) the MTBF.

    The tMET parameter is the available metastability settling time, or the timing slack available beyond the register's tCO, for a potentially metastable signal to resolve to a known value. The tMET for a synchronizer is the sum of the output timing slacks for each register in the chain.

    The overall design MTBF can be determined by the MTBF of each synchronizer. The failure rate for a synchronizer is 1/MTBF, and the failure rate for the entire design is calculated by adding the failure rates for each synchronizer, as follows:

    The design metastability MTBF is then 1/failure_ratedesign.

    Designers using Altera FPGAs do not need to check the MTBF constant values or perform MTBF calculations manually, because Altera's Quartus II software incorporates the metastability parameters within the tool. The software reports the MTBF for identified synchronizers as well as providing an overall design metastability MTBF. {pagebreak}Improving Metastability MTBF
    Due to the exponential factor in the MTBF equation, the tMET/C2 term has the largest effect on the MTBF calculation. Therefore metastability can be improved by reducing the device's C2 constant with architecture or process enhancements, or optimizing the design to increase the tMET in the synchronization registers.

    FPGA Architecture Enhancements
    The metastability time constant C2 in the MTBF equation depends on various factors related to the process technology used to manufacture the device, including the transistor speed and the supply voltage. Faster process technologies and faster transistors allow metastable signals to resolve more quickly. As FPGAs have migrated from 180-nm process geometries to 90 nm, the increase in transistor speed usually improves metastability MTBF. Therefore, metastability has not been a major concern for FPGA designers.

    However, as the supply voltage reduces with reduced process geometries, the threshold voltage for the circuit does not decrease proportionally. When a register goes metastable, its voltage is approximately one-half of the supply voltage. With a reduced power supply voltage, the metastable voltage level is closer to the threshold voltage in the circuit. When these voltages get closer together, the gain of the circuit is reduced and the registers take longer to transition out of metastability. As FPGAs enter the 65-nm process geometry and lower, with power supplies at 0.9V and lower, the threshold voltage consideration is becoming more important than the increase in transistor speed. Therefore, metastability MTBFs generally get worse unless the vendor designs the FPGA circuitry to improve metastability robustness.

    FPGA vendors can use metastability analysis of the FPGA architecture to optimize circuitry for improved MTBF. For example, architecture improvements in Altera's 40-nm FPGAs and new device development have improved robustness by reducing the metastability time constant C2. {pagebreak}Design Optimizations
    The exponential factor in the MTBF equation means that an increase in the design-dependent tMET value increases a synchronizer MTBF exponentially. For example, if the C2 constant for a given device and set of operating conditions is 50 ps, then an increase of just 200 ps in the tMET makes the exponent 200/50 and increases the MTBF by factor e4, or more than 50 times, while an increase of 400 ps multiplies the MTBF by e8, or almost 3,000 times.

    The synchronizer with the worst MTBF dominates the design's overall metastability MTBF. Consider a design that has nine synchronizers with MTBF of a million years but one synchronizer with MTBF of 100 years. The design failure rate is the sum of each synchronizer failure rate, 1/MTBF, which is 9 chains * 1/1,000,000 + 1/100 = 0.01009. The design MTBF is about 99 years—just slightly less than the MTBF of the worst synchronizer in the design. To improve metastability MTBF, designers can increase tMET by adding extra register stages to synchronization chains. The timing slack on each additional register-to-register connection is added to the tMET value used in the MTBF calculation. Designers commonly use two registers to synchronize a signal, but this may not be enough to produce a high MTBF when a design runs at high clock and data frequencies. Using three registers is recommended for better metastability protection. For designers that use FIFO logic between clock domains, Altera's parameterizable FIFO function offers an option to improve metastability protection with three or more synchronization stages for its internal synchronizers. Adding a register adds an additional latency stage to the synchronization logic, so designers must evaluate whether that is acceptable.

    Designers can also optimize the placement of synchronizers in the FPGA to add timing slack and improve the MTBF. Altera's Quartus II software offers metastability analysis and optimization features to automatically increase the tMET on synchronization register chains. When synchronizers are identified, the software places synchronization registers closer together to increase the output timing slacks.

    Conclusion
    Metastability can occur when signals are transferred between circuitry in unrelated or asynchronous clock domains. The mean time between metastability failures is related to the device process technology, design specifications, and timing slack in the synchronization logic. FPGA vendors can improve metastability with process and architecture enhancements. FPGA designers can increase metastability MTBF and improve system reliability by increasing the tMET with design techniques or placement optimizations. Designers using Altera FPGAs can also take advantage of software features to report metastability MTBF for their design and optimize design placement to increase MTBF. {pagebreak}Jennifer Stephenson—Applications Engineer, Member of Technical Staff— Altera Corp.

    Jennifer Stephenson is an Applications Engineer focusing on FPGA design and synthesis. She works closely with EDA synthesis vendors and Altera software engineering, as well as with Altera's technical support and sales organizations, to understand strategic customer needs and help improve the customer experience. Jennifer holds a B.A.Sc. degree in Electrical Engineering from the University of Toronto.

    Comments

    blog comments powered by Disqus