Efficient interfacing with external memory in high-end video

by Sutirtha Deb , TechOnline India - January 14, 2010

The bandwidth required for different processing engines can vary dramatically depending on the image content and processing algorithms used. A careful analysis of all individual bandwidth requirements, their access pattern and latency requirement is very crucial in order to select the external memory and architect the DDR controller and decide the system arbitration mechanism. In this article, different aspects of DDR controller and DDR/DDR2 memory module and different operational trade-offs are analyzed.

High-definition multi-media devices like DTVs, set-top boxes, video players and even mobile phones comprise one of the fastest growing segments in consumer electronics market. The main drivers behind this growth are consumer demands for high resolution digital video content, color depth, and higher refresh rates.

Typically, such a high performance chip could contain blocks of high performance Video Decoder [supporting multiple video standards], picture processing engine, 3-D graphics engine and display controller. All these blocks are expected to handle/process huge amount of data and also store and retrieve them internally and/or externally based on the system requirements. Figure 1 shows one example of a system diagram with only the external memory connectivity.

The combination of multiple high performance functionalities at the consumer price demands powerful system architectures with appropriate trade-offs. Even though advanced technology process nodes like 65ns or 45ns can reduce the silicon size for such an IC, the extremely high external memory bandwidth consumption will create the bottleneck in the whole system. The peak aggregate system bandwidths of these computing engines can approach 4-5GBps), and a high performance external memory system is necessary to sustain the high-definition workloads.

The bandwidth required for different processing engines can vary dramatically depending on the image content and processing algorithms used. A careful analysis of all individual bandwidth requirements, their access pattern and latency requirement is very crucial in order to select the external memory and architect the DDR controller and decide the system arbitration mechanism.

In this article, different aspects of DDR controller and DDR/DDR2 memory module and different operational trade-offs are analyzed.


Click on image to enlarge.

I. Peak bandwidth analysis

The first and most important parameter for selecting an appropriate external memory controller and the memory bus architecture of a complex bandwidth hungry SoC is the peak bandwidth requirement of the whole system. The peak bandwidth requirement directly influences the system performance as well as the cost of the system.

The peak bandwidth calculation for a video SoC shown in Figure 1 can be very complex due to the varied nature of the latency tolerance, buffering capability, access pattern, peak to average bandwidth ratio etc. of different functional modules. A window for the peak bandwidth calculation should be carefully selected over which all the modules will get their required peak bandwidth and the peak to average ratio of the system bandwidth will be moderate.

In this section, the window selection is analyzed for a simplified SoC system comprising of H.264 decoder capable of decoding 1080p @30fps at 200MHz, a display controller requiring 4:2:0 input, the system DMA to store the encoded stream and a system CPU. The bandwidth characteristics of individual engines are different and complex in nature.

For example, the decoding module for H.264 can have average BW of the order of 400-450 MBps and the peak bandwidth over 3-4 macroblocks processing time can be close to 700 " 800 MBps for 1080p @30fps decoding. Again, the occurrence of consecutive peaks is statistical and its latency tolerance depends on the characteristics of the stream, the display picture buffering, granularity of the external memory access switchover etc. Whereas the bandwidth requirement of the display controller is uniform and it can not tolerate any latency beyond its buffering capability. The input stream loader's access pattern completely depends on the buffering limit of the system.

Considering that the display controller has a ping-pong line buffer; a 1080p @30fps display would expect 1920 bytes data in every 20.4us and as this requirement is mandatory; the 20.4us time slot itself can be considered as the peak bandwidth window. The combined peak of other clients over this 20.4us slot will decide the peak of the system. As the H.264 decoder would need 4us time for a single macroblock processing for 1080p @30fps, its peak requirement needs to be characterized over 20.4us or 5 macroblock time slot. A typical peak bandwidth over 5 macroblock may be 700MBps.

Let's again consider that the decoder can average out the processing time over 5 macroblock even if the required data for 5 macroblock is served within maximum of 2us delay, i.e. decoder's latency tolerance is 2us. Then the effective peak bandwidth will reduce to 700*20.4/ (20.4+2) = ~637MBps.

This window will again depend on the buffering at the different function levels. For example, if the display controller can buffer up 2 lines instead of a single line, the display controller would now expect 3840 bytes data in every 40.8us and hence the peak selection window can be increased to 40.8 or 10 macroblock decoding slot.

When the window becomes wider, decoder's peak requirement reduces (because the decoder will average over more macroblock decoding) and at the same time the latency tolerance also increases. One typical example - the peak bandwidth requirement over 10 macroblock decoding time can be 600 MBps and latency tolerance can be 5us. So the peak bandwidth requirement of the decoder in the new window is 600*40.8/ (40.8+5) = ~535MBps. Again, if the decoder gets an extra display jitter buffer, the peak bandwidth will automatically reduce as the decoder will now get 66.6ms time to decode 2 frames.

The buffering of the encoded input data will decide whether this BW requirement will influence the peak bandwidth requirement or not. If the buffering is enough, then this bandwidth should not be considered for the calculation as this can work in cycle stealing mode.

Once the window is selected, the net data to be accessed is known and so the next parameter to be characterized is the efficiency of the memory controller for different clients. The data access pattern for display controller and the bit-stream loading is regular and hence the efficiency can easily be characterized in simulation or even theoretically. But the decoder's access involves frequent context switching, page-bank changeover due to 2-dimentional data access by motion compensation (MC) etc and so a careful analysis is required.

The following sections describe a few important techniques with respect to video decoder to increase the efficiency of the memory controller and data bus.

{pagebreak}II. DDR/DDR2 burst length selection

In a video decoder the extra pixels fetched by MC during the 2-d reference fetch is one of the factors which influence the external memory bandwidth requirement. For example, if the MC requires N bytes of data and the DDR controller limits the minimum data per burst to M bytes (assuming burst termination is not allowed), then M-N byte is the pixel overhead. This section analyses the selection of the burst length (BL) in order to minimize the pixel overhead.

Before starting the analysis, please remember that the data is accessed from SDRAM as a burst (if not prematurely terminated) where the burst length (2, 4 or 8 in DDR and 4 or 8 in DDR2) is fixed during the memory initialization and the DDR needs only the base address for the complete burst. If the DDR memory data bus width is "W" bits, an atomic data access per address will be BLxW bits. A smaller fetch than the atomic unit needs a burst terminate (BST) command in case of DDR1. But it generates bubbles in the data bus reducing the efficiency as explained in Figure 2.

Whereas DDR2 needs a new command issued early to terminate the previous burst which mandates that the next address to be available or guaranteed before the current read command is issued. So unless otherwise absolutely required, a complete burst fetch is recommended.


Click on image to enlarge.

The data bus efficiency and the BL does not have any compulsory relationship - it all depends on the memory controller implementation and the nature of the successive addresses. As long as the successive addresses access from the same page or from an already open page, the BL doesn't make any difference with respect to the efficiency. But when the addresses traverse across different banks, the BL directly impacts the efficiency. Before we start analyzing the efficiency due to smaller BL, let's first analyze the pixel overhead due to different burst lengths.


Click on image to enlarge.

Consider the case as in Figure 3, where the reference frame is stored in the raster scan order in a 32-bit DDR memory and the MC needs to fetch an 8x21 reference block from it. If the BL is kept as 8, the atomic data unit will be 32-byte and hence a 8x21 block would actually fetch 32x21 = 672 bytes (300% pixel overhead) in the best alignment (A) and 64x21 = 1344 bytes (700% pixel overhead) in the worst alignment (B). But if the burst length is reduced to 4, the atomic unit will become 16-bytes and the pixel overhead will drop to 50% and 300% in best and worst cases respectively. It is obvious from the diagram that the bigger the referenced block size, the lesser the pixel overhead.

In the contrary, smaller burst length has adverse effect in DDR controller efficiency when the consecutive addresses change banks. This is explained in detail in Figure 4 and in Figure 5. In Figure 4, it's assumed that all the consecutive addresses change bank and they open a page in that new bank, the BL is set to 8 and the CAS is 3 cycles. An efficient DDR controller will pre-charge the page of the first bank immediately after activating the page in the 0th bank and after tRCD from the activation in 0th bank; it will place the 0th read command. Due to this out-of-order activity in the DDR controller, data will come continuously in the data bus.


Click on image to enlarge.

Figure 5 depicts the same address sequence for BL of 4. After putting the read command for the 3rdbank, it has to wait for tRCtime before it can activate a new page in 0thbank. So there will be 3 idle clock cycles in the DDR data bus.


Click on image to enlarge.

Due to this trade-off between the data overhead and the DDR bus efficiency, a rigorous simulation is required to decide the burst length of the DDR/DDR2 memory.

{pagebreak}III. Distribute reference frame in different banks

The multibank architecture of the SDRAM helps in concurrent access from the different banks. If an address needs to open a new page in the same bank, it will create page change-over latency, but if the next address needs to open a page in a separate bank, the bank change-over latency can be hidden (refer to Figure 4). This concurrent access from different bank and early pre-charge/ activation by the DDR controller should be utilized when filter stores and MC fetches 2-d chunks. If the reference frame is stored continuously in raster scan order as in Figure 6 (C) each line change in the reference frame may need to open a new page in DDR depending on the page size and the resolution. If the resolution is high and the page size is small, the number of page changes will be significantly high.

One better way of storing may be to distribute the adjacent raster lines in different banks as in Figure 6 (B) so that data from different raster lines will be accessed from different banks.


Click on image to enlarge.

IV. Reference data access in 2-dimensional chunk

The pixel overhead and the DDR controller efficiency trade off have been already explained while MC accesses data a 2-d chunk. In this section an alternate approach is explained to address both the factors.

Consider the example of Figure 7 (A) where MC needs to access 4 lines a, b, c and d as a 2-d chunk. If the BL is set smaller to reduce the pixel overhead, the efficiency will drop and if the BL is kept higher to improve the efficiency, the pixel overhead will be higher. In contrary, if 4 lines are stored adjacent to each other in contiguous address space in the external memory as shown in Figure 7 (B); entire chunk will be accessed from single page. The smallest 2-d tile dimensions can be decided such that BL which is calculated from floor(Σ(size(a))/W,BLMIN, BLMAX) becomes BLMAX. The adjacent tiles should be stored in different banks so that the DDR controller can fetch adjacent 2-d chunks without any bubble in the data bus.


Click on image to enlarge.

V. Higher DDR frequency versus wider DDR bus

The available bandwidth from the memory can be increased in two ways;

1) By using wider memory 2) By increasing the frequency of the memory

Wide memory bus means more data per cycle and thus higher bandwidth in proportion; but at the same time wider bus causes higher pixel overhead, bigger memory size (most of the cases larger than required) and associated memory cost, more number of pins in the IP etc. Hence wider bus is not feasible always.

Although the higher frequency reduces the system reliability, it's preferable to use higher frequency with a smaller DDR data width so that the pixel overhead will be lesser and the BL can be set to the maximum for higher efficiency.

{pagebreak}VI. Address look ahead and out of order execution of commands by DDR controller

The advanced bus protocols like AMBA AXI or OCP (Open Core Protocol) allows the system to decouple the address and data chronology. Addresses come along with their tags and all the data corresponding to a particular address will have same tag as its address. Data corresponding to different address tags can come in any order while data corresponding to the same tag has to maintain the chronology within the same tag. This helps the system to enhance the DDR controller performance.

The out-of-order execution can be performed in three stages. If the DDR controller has multiple clients, the first out of order execution can be performed at the AXI/OCP interconnect level itself. Interconnect should append the client id along with the address id to separate out the addresses from different clients. It should have look-ahead capability for few addresses for different clients. If the next address for the same client needs a page change, but some other clients address will access from an open page, it should schedule the different client's address before so that the number of page changes is minimized.


Click on image to enlarge.

The second out-of-order executions can be performed inside the DDR controller. If immediate next address needs page activation but some other address in the pipeline will access data from the open page, it should execute this address before the immediate next address and send the data along with corresponding tag.

The third type will not alter the data chronology, but some DDR commands should be executed out of order to improve the DDR bus utilization. If an address in the pipeline needs to pre-charge/activate a new bank, it should perform the pre-charge and activation in advance so that after BL/2 cycles of putting the last read/write command in the current bank, it should be able to put the read/write command for the newly opened bank (refer to Figure 4 or Figure 5). Thus the read or write commands are put in order with the address came, but the pre-charge or activations are done out of order.

VII. Multiple smaller external memories instead of a single wider memory

In a complex video SoC, multiple clients are connected to the DDR controller and many of them need to access memory very frequently but for shorter duration. Due to this kind of access pattern, there will be frequent context switching and the associated page break latency will reduce the overall system efficiency. Again, a very complex arbitration mechanism is required to serve the different access patterns of the clients. Instead of a single DDR, if two DDRs each of half the size and half the width are used, the load sharing will be better, client switchover latency will reduce and at the same time the smaller DDR bus will reduce the pixel overhead in the decoder.

The benefit in the system efficiency needs to be compared against the cost of additional DDR controller in the chip. But in most cases, that kind of die size increment is not a big concern as the technology is shrinking. So, multiple smaller memories are preferable over a single memory.

VIII. Conclusion

According to a press release of the U.S. Federal Communications Commission on July 2, 2009, 65% of the total television broadcast has been migrated to digital by June, 2009. Even in Europe and Japan, analog broadcast will be replaced by digital by almost the same time frame. So today's very high-end digital video system will be available at every consumers place by 2010 - 2011 time frame. But to satisfy never-ending consumer demand, even more complex applications like 3DTV, multiple full HD picture in picture, beyond 1080 resolution, 120Hz refresh rate are also started getting incorporated in the system.

Even today's high end applications will migrate to mid-end or even low-end systems at a low cost. This huge bandwidth demand can't just be met by using wider DDR2/DDR3 at very high frequency because of silicon pad count limitations, EMI and design complexity, which translate into high system costs and longer time to market. We have to use different memories like QDR (Quad Data Rate), XDR (eXtreme Data Rate) memories, and very sophisticated memory controller to address the rapidly increasing bandwidth requirement.

Author's Bio
Sutirtha Deb is a lead design engineer of the Video IP group at Ittiam Systems (P) Ltd, Bangalore, India. He can be reached at sutirthadeb@rediffmail.com

About Author

Comments

blog comments powered by Disqus