PRODUCT HOW-TO - Building high-speed FPGA memory interfaces

by Paul Evans, Altera Corp. , TechOnline India - September 02, 2009

This article examines the architecture behind the I/O blocks in high-end FPGAs and how these FPGAs are able to achieve 533 MHz or 1067 Mbps data rates. It also examines the tools that are used to build a memory interface, and provide a brief overview of the timing budget.

Building reliable, high-speed memory interfaces target FPGA I/O structures as well as intellectual property (IP) used within design software to allow rapid configuration of memory interfaces. These techniques use IP to help gain an extra timing margin at high speed operation.

External double data rate (DDR) memory types are a common part of many FPGA designs. This article will examine the architecture behind the I/O blocks in high-end FPGAs (i.e. Altera's Stratix IV devices) and how these FPGAs are able to achieve 533 MHz or 1067 Mbps data rates. It also examines the tools that are used to build a memory interface, and provide a brief overview of the timing budget.

These FPGAs support the five leading double data rate memory types, that is DDR1, DDR2, and DDR3 as well as QDRII+ and RLDRAM as well as other memory interface types. RLDRAM is supported at rates of up to 1,600 megabits per second or 400 megahertz, QDRII+ at 1,400 megabits per second, and DDR3 at speeds of up to 533 megahertz or 1,067 megabits per second.

Figure 1 illustrates the I/O block found in a high-end FPGA. It is comprised of six key areas (A, B, C, D, E, and F), each essential for reliably interfacing to high-speed external memories:

  • Position "A" " Dynamic on-chip termination (OCT)
  • Position "B" " I/O buffering
  • Position "C" " Variable I/O delays
  • Position "D" " 2:1/1:2 muxing/demuxing
  • Position "E" " Read/Write leveling blocks
  • Position "F" " Half-rate registers

  • Figure 1. Calibrated Dynamic OCT for Proper Line Termination and Power Savings (click on image to enlarge).

    In the dynamic termination of the block (Position A expanded view, Figure 1 right side); the termination is used to swap between the parallel termination when reading or swapped to series termination when writing. In this way, the FPGA is always able to provide the ideal line termination for a switching bidirectional bus depending on its operation.

    Being able to dynamically turn on and off the parallel termination not only provides for the proper line termination, but it also provides significant power savings. Power is saved compared to fixed external resistor configuration, which has a constant draw. By switching the termination when the bidirectional DDR bus is set to read the path to ground through the termination in the FPGA is effectively taken out of circuit, consequently saving power. {pagebreak}Having the termination in the FPGA helps reduce costs in several different ways. First by simplifying design complexity and second, by reducing the number of external components. In this case there is no requirement for external resistors network. This in turn leads to savings in board real estate, where shrinking the size of a board reduces overall cost. Lastly reducing the power brings system savings in other areas such as power supplies.

    The I/O buffer (Position B) controls the drive strength and can be used to meet the various JEDEC specification drive strengths. Just left of center (Position C) is the variable I/O delays. Seen at Position D are the standard 2:1/1:2 muxing/demuxing registers that are used to take data from the double data rate domain to a single data rate domain. For DDR3, the read/write leveling blocks (Position E) are seen and, finally, at Position F, a second set of 1:2/2:1 muxing/demuxing registers known as half-rate register.

    The variable I/O delay elements allow designers to compensate for board trace mismatch. Additionally the fine resolution of 25 picoseconds steps allows automatic deskew algorithms to be developed, which helps compensate for or remove the process variations in both the FPGA and memory device. This removes the process variations and helps buy back a big chunk of timing margin when operating at high speeds.

    Another essential section within the FPGA I/O block is the double data rate capture registers (see Figure 1, Position D). The relationship between the data and the data strobe is used to capture that data. Normally, the data strobe, or DQS, leaves the memory device upon a read, such that its edge is aligned.

    The FPGA or controller side must shift the DQS into a position where it can be used to capture the data. The Stratix Series FPGAs are the first FPGAs to use dedicated DQS phase shift circuitry for reliable capture (see Figure 2).

    Figure 2. Double Data Rate Capture and DQS Phase Shift

    There are four DLLs on the FPGA dedicated to providing process, voltage, and temperature compensation to control this DQS phase shift circuit. These DLLs provide a non-intrusive update to the DQS phase shift circuit that keeps it at a constant phase, rather than relying on a single delay element that may vary over process, voltage, and temperature. Each DLL can provide two output controls that stretch over two sides of the chip and allows for multiple memory and phase support around all four sides of the device. {pagebreak}The alignment and synchronization block (shown in Figure 3) is essential for true DDR3 read/write leveling operation. Without it, FPGAs can not easily interface to JEDEC-compliant DDR3 DIMMs. Stratix III and IV are the only FPGAs available with this capability built directly into the I/O block. Competing FPGAs offer only variable delay, but as shown in Table 1, this is not enough to successfully interface to a DDR3 DIMM at high-speed.

    Figure 3. The Alignment and Synchronization Block (click on image to enlarge).

    Table 1. Required DDR3 Leveling Features

    Behind the alignment and synchronization block is a second set of double data rate registers known as the half rate registers. These registers effectively allow the DDR memory interface to have a SERDES factor of four. This allows the I/O to toggle at extremely high rates and expands the data down to a wider bus width at lower frequencies, thus easing the internal design constraints.

    {pagebreak}Examining the IP
    Having examined the hardware, let's now turn the focus to the IP that sits above the silicon and controls the interface. To help configure this I/O block, an IP core is integrated directly into the Quartus II design software. This quickly and easily assists in building a PHY or data path section of the memory interface.

    Finding the ALTMEMPHY IP in the design software directly from the tools menu is easily accomplished (Figure 4). The MegaWizard Plug-In Manager allows selection of different memory types and configures the memory interface to match the preferred memory type and speed. A specific example of this for DDR3 is ALTMEMPHY that integrates all of the I/O features along with setting up a PLL with the required clocks and phases to shift the data across the I/O block and into the FPGA fabric.

    Figure 4. ALTMEMPHY IP Tools Menu in Quartus Design Software

    ALTMEMPHY builds three key blocks in the FPGA soft logic to help buy back timing margin:

    1. Voltage and temperature tracking blocks
    2. A clock domain crossing FIFO
    3. A calibration control block

    The voltage and temperature tracking blocks and the clock domain crossing FIFO presents source-synchronous data into the FPGA, while the calibration control blocks are specifically concerned with techniques that help gain extra timing margin for high-speed operation.

    The calibration control block runs at startup to help increase the timing margin in the resynchronization stage and a deskew algorithm that helps increase the timing margin at the capture stage at the registers closest to the I/O pin. In combination, these three blocks utilize techniques to help gain additional timing margin over a traditional static timing analysis.

    In a traditional static timing analysis approach, all the worst-case conditions, the worst-case possible shifts, and where the data positions with respect to each other are taken into account. Knowing where all these worst-case positions are, a safe window is calculated. The center of this safe window is used to set the phase for the resynchronization clock. This is a time-tested method and this technique allows operation up to approximately 267 MHz to the 333 MHz range. However, to reliably reach higher frequencies such as 400 MHz and 533 MHz, a different technique is required. {pagebreak}This is where voltage and temperature tracking can be used in the resynchronization stage to maximize the data window and operate in conditions where otherwise there would be no safe window using the traditional static timing methodology. The voltage and temperature tracking compensation block (shown in Figure 5) must compensate for environmental changes at all times during the operation and must never interfere with data transfer or require any additional overhead.

    Figure 5. Voltage and Temperature Tracking Compensation Blocks (click on image to enlarge).

    This is achieved by creating a mimic clock that is sent through the bi-directional I/O of the FPGA and back into the FPGA fabric. This mimic clock is periodically swept by another clock known as the measure clock. The rising and falling edges of the mimic clock are found in relationship to the measure clock and compare the results to the last sweep. If the result is different in the last sweep, the resynchronization phase can be adjusted. Updates are always made in steps of plus or minus one PLL phase to prevent hunting or oscillation within the resynchronization clock. The system has been effective where operation of up to 533 MHZ over PVT for DDR3 is used.

    Another technique that can be used is to calibrate at startup. This is used to help remove process variations in both the FPGA and the memory device. Utilizing automatic calibration at startup helps open the timing window. At startup, all the DQ pins are swept by the output phase of the reconfigurable PLL. A known data pattern is written into the memory and it is continuously read back or, in the case of DDR3, the pre-loaded data pattern on DQS zero can be used.

    As the data is read back, it is compared against the known written pattern for each of the PLL phases. The result is recorded as either pass or fail. Once the edges of each DQ bit are known to have a valid safe window, the entire bus can be established and the resynchronization clock can then be placed in the middle of the valid window for maximum setup and hold.

    This is a technique of forcing the valid window to one end where the data would fail, forcing it to the other end where the data would also fail, and then taking the good points in the middle to attain maximum setup and hold.

    The third and final technique is deskew. Deskew is used to increase the timing margin at the capture I/O, so the variable I/O delay elements just before the capture registers can be used to automatically deskew in DDR3 (see Figure 6). This deskew happens at startup and within each DQ group. The eight bits are aligned to each other and this allows positioning of the capture phase or the DQS phase to be positioned to give maximum setup and hold on those capture registers.

    Figure 6. DDR3 De-Skew to Increase Capture Margin (click on image to enlarge).
    {pagebreak}Timing analysis
    Another very important aspect of any memory interface is timing. The TimeQuest Timing Analyzer in the Quartus II design software is a built-in ASIC strength timing analysis tool that accurately analyzes source-synchronous double data rate interfaces. The ALTMEMPHY IP provides the Synopsys Design Constraints (SDC) timing format and TimeQuest uses these constraints in conjunction with a place-and-route timing result to analyze the memory interface for skew and the various relationships between the clocks' DQS signals and the DQ data across the entire bus over PVT.

    Looking at a top-level timing analysis (see Table 2), the benefits of this calibration and the various startup techniques that are used in the timing analysis can be seen. Without calibration, the uncertainties are too high to allow a valid safe window in the traditional static timing analysis when looking at something like a DDR3 operating at 533 MHz.

    Table 2. Read Timing Analysis of DDR3-800 Memory

    When the three techniques are put in place, an approximate 300 picosecond timing margin becomes available for safe operation across PVT while running the DDR3 at 533 MHz. These techniques help provide extra timing margin to memory interfaces running at lower frequencies as well. {pagebreak}Conclusion
    The total result can be seen in Table 2, a very reasonable eye that is extremely clean. This is an eye diagram generated at full speed when the DQ is running at 533 MHz and the FPGA is driving into the DDR3 memory (see Figure 7).

    Figure 7. DDR3 533MHz Eye Diagram (click on image to enlarge).

    These FPGAs achieve the industry's highest external memory interface speeds. A design can be started by downloading the latest version of the Quartus II software at:

    There are also additional resources that can be obtained at:

    Visit the Stratix IV external memory resource center and download application notes, white papers, and other collateral that will assist in designing these memory interfaces.

    {pagebreak}About the author

    Paul Evans
    Altera Corp.
    Paul Evans is the product marketing engineer responsible for Altera's Stratix III FPGAs. He joined Altera in 2000 as a senior application engineer. Prior to joining Altera, Mr. Evans was a technical services manager for Ometron Ltd. where he established Ometron's technical services department. Mr. Evans has also held engineering positions at Image Automation Inc. and Smiths Industries Ltd. He holds a BEng in digital electronic engineering from the University of Kent at Canterbury in England.


    blog comments powered by Disqus