TechOnline India Header
Most Popular
Top 5 Courses
  • Fundamentals of PCB Design
  • Fundamentals of Multicore Programming
  • Eliminating Audible Transients in Audio Systems
  • DC-DC Converter Theory
  • Fundamentals of Wireless
    Most Popular
    Top 5 Technical Papers
  • Digital Signal Processing: A Practical Guide (Part 1)
  • SDRAM Memory Systems: Architecture Overview and Design Verification
  • ARM Platform Technical Overview
  • Reference Design for a SEPIC LED Driver
  • Solving the System-Level Thermal Management Challenges of LEDs
    Most Popular
    Top 5 Webinars
  • Designing embedded HMIs and connecting them to hardware
  • Is Android the Right Foundation for Your Next Device?
  • 2009 Embedded Market Study
  • Maximizing OpenGL(R) ES 2.0 on GPUs for Embedded Applications
    All Articles Products Courses Papers VirtuaLabs Webinars
    Top Search Items
    scsi


    Techpaper Spotlight

    Wind River
    Accelerating the Development of Embedded Linux Devices with JTAG On-Chip Debugging
    /
        Login | Register | Welcome, Guest

    Topics
    POLL
    How much code have you produced in your career?
    A few KLOC
        38%
    100s of KLOC
        44%
    Millions of LOC
        11%
    A trillion
        7%
     



    Efficient interfacing with external memory in high-end video
    TechOnline India

    High-definition multi-media devices like DTVs, set-top boxes, video players and even mobile phones comprise one of the fastest growing segments in consumer electronics market. The main drivers behind this growth are consumer demands for high resolution digital video content, color depth, and higher refresh rates.

    Typically, such a high performance chip could contain blocks of high performance Video Decoder [supporting multiple video standards], picture processing engine, 3-D graphics engine and display controller. All these blocks are expected to handle/process huge amount of data and also store and retrieve them internally and/or externally based on the system requirements. Figure 1 shows one example of a system diagram with only the external memory connectivity.

    The combination of multiple high performance functionalities at the consumer price demands powerful system architectures with appropriate trade-offs. Even though advanced technology process nodes like 65ns or 45ns can reduce the silicon size for such an IC, the extremely high external memory bandwidth consumption will create the bottleneck in the whole system. The peak aggregate system bandwidths of these computing engines can approach 4-5GBps), and a high performance external memory system is necessary to sustain the high-definition workloads.

    The bandwidth required for different processing engines can vary dramatically depending on the image content and processing algorithms used. A careful analysis of all individual bandwidth requirements, their access pattern and latency requirement is very crucial in order to select the external memory and architect the DDR controller and decide the system arbitration mechanism.

    In this article, different aspects of DDR controller and DDR/DDR2 memory module and different operational trade-offs are analyzed.


    Click on image to enlarge.

    I. Peak bandwidth analysis

    The first and most important parameter for selecting an appropriate external memory controller and the memory bus architecture of a complex bandwidth hungry SoC is the peak bandwidth requirement of the whole system. The peak bandwidth requirement directly influences the system performance as well as the cost of the system.

    The peak bandwidth calculation for a video SoC shown in Figure 1 can be very complex due to the varied nature of the latency tolerance, buffering capability, access pattern, peak to average bandwidth ratio etc. of different functional modules. A window for the peak bandwidth calculation should be carefully selected over which all the modules will get their required peak bandwidth and the peak to average ratio of the system bandwidth will be moderate.

    In this section, the window selection is analyzed for a simplified SoC system comprising of H.264 decoder capable of decoding 1080p @30fps at 200MHz, a display controller requiring 4:2:0 input, the system DMA to store the encoded stream and a system CPU. The bandwidth characteristics of individual engines are different and complex in nature.

    For example, the decoding module for H.264 can have average BW of the order of 400-450 MBps and the peak bandwidth over 3-4 macroblocks processing time can be close to 700 " 800 MBps for 1080p @30fps decoding. Again, the occurrence of consecutive peaks is statistical and its latency tolerance depends on the characteristics of the stream, the display picture buffering, granularity of the external memory access switchover etc. Whereas the bandwidth requirement of the display controller is uniform and it can not tolerate any latency beyond its buffering capability. The input stream loader's access pattern completely depends on the buffering limit of the system.

    Considering that the display controller has a ping-pong line buffer; a 1080p @30fps display would expect 1920 bytes data in every 20.4us and as this requirement is mandatory; the 20.4us time slot itself can be considered as the peak bandwidth window. The combined peak of other clients over this 20.4us slot will decide the peak of the system. As the H.264 decoder would need 4us time for a single macroblock processing for 1080p @30fps, its peak requirement needs to be characterized over 20.4us or 5 macroblock time slot. A typical peak bandwidth over 5 macroblock may be 700MBps.

    Let's again consider that the decoder can average out the processing time over 5 macroblock even if the required data for 5 macroblock is served within maximum of 2us delay, i.e. decoder's latency tolerance is 2us. Then the effective peak bandwidth will reduce to 700*20.4/ (20.4+2) = ~637MBps.

    This window will again depend on the buffering at the different function levels. For example, if the display controller can buffer up 2 lines instead of a single line, the display controller would now expect 3840 bytes data in every 40.8us and hence the peak selection window can be increased to 40.8 or 10 macroblock decoding slot.

    When the window becomes wider, decoder's peak requirement reduces (because the decoder will average over more macroblock decoding) and at the same time the latency tolerance also increases. One typical example - the peak bandwidth requirement over 10 macroblock decoding time can be 600 MBps and latency tolerance can be 5us. So the peak bandwidth requirement of the decoder in the new window is 600*40.8/ (40.8+5) = ~535MBps. Again, if the decoder gets an extra display jitter buffer, the peak bandwidth will automatically reduce as the decoder will now get 66.6ms time to decode 2 frames.

    The buffering of the encoded input data will decide whether this BW requirement will influence the peak bandwidth requirement or not. If the buffering is enough, then this bandwidth should not be considered for the calculation as this can work in cycle stealing mode.

    Once the window is selected, the net data to be accessed is known and so the next parameter to be characterized is the efficiency of the memory controller for different clients. The data access pattern for display controller and the bit-stream loading is regular and hence the efficiency can easily be characterized in simulation or even theoretically. But the decoder's access involves frequent context switching, page-bank changeover due to 2-dimentional data access by motion compensation (MC) etc and so a careful analysis is required.

    The following sections describe a few important techniques with respect to video decoder to increase the efficiency of the memory controller and data bus.

    1 | 2 | 3 | 4 NEXT >
     
     
    Latest Webinars
    · The Next Generation of Ethernet: How the New IEEE Standards Enable Energy Efficiency and Quality-of-Service
    · Simplified Physical Layer Receiver Test of Re-timed Architectures Such as USB 3.0, SATA, SAS, PCIe 2
    · How to solve the most common high-speed bus issues in embedded design on a budget
    · Early access to ARM Core Technology with Fast Models from ARM
    · Latest MIPI Standards: PHY and Protocol Testing Guidance
     
    Member Company Spotlight
    Renesas Technology
     

    HEW Target Server Design Contest—Design a Windows application using HEW Target Server (HTS). Over $15,000 in cash prizes! Also qualify for free demo kit! Enter now! Click here to enter contest or for more information.


    Member Companies