Beefing up the Cortex-M3-based MCU to Handle 480 Mbps High-speed USB

by Jacko Wilbrink , TechOnline India - May 24, 2009

The right mix of DMAs & memories can boost a deterministic Cortex3 32-bit MCU throughput to easily handle the new high-speed 480 Mbps USB spec, providing an effective solution for current and future embedded system data requirements.

The universal serial bus (USB) has completely replaced UART, PS2, and IEEE-1284 parallel interfaces on PCs, and now is gaining wide acceptance in embedded applications. Most of the I/O devices (keyboards, scanners, mice) used with embedded systems are USB-based for good reason.

Since the USB is a well-defined standard that is guaranteed by the USB consortium, any USB-certified device from any vendor will work in a plug-and-play fashion with any USB-certified device from any other vendor.

Multiple devices can operate on the same bus without affecting each other at all. It is not at all surprising that the majority of 32-bit flash MCU and MPU vendors offer some form of USB interface as a standard peripheral: USB host, USB device, USB OTG " usually limited to the "full-speed" specification of 12 Mbps (Figure 1, below).

Figure 1. USB Block Diagram

Now, the USB standard is set to solve another issue for embedded systems: the exponential growth in data rates. Five years ago, a data rate of 10 Mbps was considered high. A 12 Mbps full-speed USB, 10 Mbps SPI or 400 kps I2C interface could cover the data requirements of nearly any embedded application.

Today, however, with the advent of ever increasing sizes of log files and the growing sophistication of user interfaces, data rates of even tens of Mbps are not high enough to provide an adequate user experience. Log files of Gbytes must be transferred between systems in 10s of seconds and information collected from different PCBs within a system transferred to a graphical display unit.

Enter "high-speed" USB. With a bandwidth of 480 Mbps high-speed USB can meet today's demand for tens of Mbps and tomorrows demand for hundred of Mbps. It provides a well-understood, easy-to-use vehicle to handle large amounts of data and interconnection both between systems and between the printed circuit boards (PCBs) inside a system.

It will undoubtedly soon be adopted for inter-chip communication on a single PCB, bypassing the physical layer and removing the need for cable connections. Systems supporting this technology are forecasted to ramp in 2009.

Clearly microcontrollers will have to add USB HS support to keep up with the market. But the question is: How do you build a microcontroller that can handle the data load from a 480 Mbps USB interface, while meeting the power consumption and real-time constraints of an embedded system.

Sustaining a 480 Mbp data rate in a 400MHz ARM9-based microprocessor with an on-chip cache memory connected to a single plane high-speed external memory is pretty easy. Running it on a 100 MHz Cortex M3 flash MCU, executing from relatively slow flash memory is a whole other story.

Can the Cortex-M3 handle all these data transfers while running process-intensive tasks such as data processing algorithms, file systems and communication protocols?

The solution to this problem is to adapt the multi-layer bus architecture (Figure 2 below) used in ARM9 microprocessors to the Cortex M3 and divide the memory space into multiple blocks distributed within the architecture to secure real-time critical processing when high-speed data is transferred via DMA in parallel.

Figure 2. High-data-rate-architecture.

Direct memory access (DMA) is critical. Using the CPU for transfers would overload the CPU, likely preventing it from processing the application and real-time control tasks. Ideally, three types of DMAs must be connected to all low- and high-speed peripherals to minimize the data transfer load on the bus and memories, and to free the processor for the data processing and system control tasks.

DMAs with built-in buffers for improved tolerance to bus latency and burst transfer and linked list support are relatively bulky and reserved only for the highest speed interfaces.

They offer a high level of CPU independency and minimum bus usage. Due to the higher cost per channel, global usage of these full featured DMAs on each on-chip peripheral is not feasible.

{pagebreak}A multi-type DMA implementation offers a balance between cost and performance, with a combination of peripheral DMA controller (PDC) that directly links low data rate peripherals to memories in a minimum complexity level and simplest use, a tuned DMA controller dedicated to and optimized for the highest bandwidth peripherals and a central DMA optimized for memory-to-memory and memory-to-high-speed peripheral block transfers.

Since the DMA controller operates completely independently of the processor, it eliminates interrupt overhead and substantially reduces the number of clock cycles required for a data transfer.

A cost reduced DMA implementation is the Peripheral DMA Controller, or PDC, which is tightly integrated in the peripheral programmer's space. The PDC brings the benefit of overall system performance, cost and ease of programming.

The user-interface of a PDC channel is integrated in each peripheral's memory space. It contains a 32-bit memory pointer register, a 16-bit transfer count register, a 32-bit register for next memory pointer, and a 16-bit register for next transfer count.

When the peripheral receives an external character, it sends a Receive Ready signal to the PDC which then requests access to the system bus. When access is granted, the PDC starts a read of the peripheral Receive Holding Register (RHR) and then triggers a write in the memory.

After each transfer, the relevant PDC memory pointer is incremented and the number of transfers left is decremented. When the memory block size is reached, the next block transfer is automatically started or a signal is sent to the peripheral and the transfer stops. The same procedure is followed, in reverse, for transmit transfers.

When the first programmed data block is transferred, an end-of-transfer interrupt is generated by the corresponding peripheral. The second block data transfer is started automatically and the processing of the first block can be performed in parallel by the ARM processor, thereby removing heavy real-time interrupt constraints for updating the DMA memory pointers on the processor, and sustaining up to 15 to 20Mbps data transfers on any peripheral.

If simultaneous requests of the same type (receiver or transmitter) occur on identical peripherals, the priority is determined by the numbering of the peripherals. If transfer requests are not simultaneous, they are treated in the order they occurred. Requests from the receivers are handled first and then followed by transmitter requests.

Above 10 Mbps the time to get access the internal bus and memory becomes critical and a two-level deep receive and transmit buffer is no longer sufficient to sustain a continuous transfer. A central DMA with a deeper buffer and ability to burst data transfers is required.

This central DMA features a built-in FIFO for increased tolerance to bus latency, programmable length burst transfers that optimize the average number of clock cycles per transfer, scatter, gather and linked list operations. It can be programmed for memory to memory transfers or memory to peripheral like a 48Mbps SPI, 4-bit 192 Mbps SDIO/SDCard 2.0 or 8-bit 384 Mbps MMC 4.3 Host Interface.

The dedicated DMAs each have their own layer on the bus matrix that eliminates any bus access latency. These DMAs are optimized for high speed peripherals like a High Speed USB interface, Ethernet MAC or LCD Controllers maximizing data transfers in the shortest time.

For a 7-endpoint high-speed USB device interface, such a DMA would integrate a 4kB DPRAM memory to store full packet payload and prevent buffer underflow and overflows. This DMA would be composed of several channels each of which would be dedicated to an endpoint. The DMA channels would transfer data between multiple banks within the DPRAM and the USB controller.

Figure 3. SAM3U Block Diagram

Up to 3 banks should be allocated to a single endpoint to store micro-frames. Such a High Speed USB implementation (Figure 3, above) on a Cortex M3 MCU at 96MHz with a 5-layer bus matrix and two central SRAM memories can sustain the maximum bandwidth of 480 Mbps.

{pagebreak}Memories. The Cortex M3 processor has three memory busses: the Instruction bus (I), Data bus (D) and System bus (S). This bus architecture on the M3 is a major difference from the single bus ARM7 architecture. Except for instruction fetches on the S bus, all other busses offer the processor single-cycle memory access.

There are two options for getting the highest instruction fetch speed from the Cortex M3 and overcome the access time limitation of the flash memory: Either connect a single cycle access SRAM on the I bus or increase the width of the flash memory enabling multiple instruction reads and herewith sequential instruction fetch acceleration.

Only SRAM is fast enough to match the maximum 96 MHz operating frequency of the Cortex M3 processor. However, the optimization for performance by using multiple on-chip SRAM memories for instruction and data storage increases both cost and standby power consumption. Using less expensive, low-leakage non-volatile flash memory poses a different problem: Flash is unable to run at the same frequency as a Cortex M3 processor.

The typical Flash access time is in the range of 35 to 50ns, while the Cortex M3 clock cycle is around 13 ns. To compensate for the long flash access time, the width of the flash memory can be increased to 64- or 128-bits, corresponding respectively to four and eight instructions (Thumb 2).

This approach allows multiple sequential instructions to be read each cycle, stored in a register and then executed at the full 96 MHz clock rate without any wait states.

SRAM should also be used to store data and as scratch-pad memory. In order to maximize performance, the SRAM memory needs to be split into multiple blocks to allow the processor and peripherals to move data in and out via the parallel busses of the multi-bus architecture. Such blocks can be directly integrated in high bandwidth peripherals or used as central memory assigned to the CPU and peripherals under software control.

Data memories connected to the Cortex M3 D-bus are not shared with other on-chip peripherals or DMAs. This poses a dilemma: 1) Should the data memory be shared on the S bus and herewith give up some raw performance in exchange for allowing the DMAs to load and store data in the memory without CPU intervention Or 2) should the data memory connect to the D bus and use the processor to move data.

The PDC benchmarks in Table 1 below run on a SAM7X processor clearly show the benefits of maximizing DMA usage for data transfers in a microcontroller with multiple communication channels.

Table 1: Benchmark reading from Flash memory at 55MHz in Thumb Mode on a SAM7X Flash Microcontroller

A Cortex M3 microcontroller with a high speed USB interface will in general have other high speed communication or storage interfaces on-chip. To sustain high data transfer rates between these interfaces and do the protocol conversions, data processing and system control functions should be implemented in parallel, multiple shared SRAM blocks for the best system performance.

The DMAs and the processor S-bus can be connected to an on-chip multi-layer bus matrix as bus masters and have direct access to the blocks of shared memories.

If the objective it to create a number cruncher and stream limited amount of data, putting SRAM on the I & D-bus is mandatory. In this case, a portion of the performance-critical code should be shadowed in the SRAM that is connected to the I-bus to release the full 1.25 DMIPS/MHz at 100MHz operation.

When off-chip memory is required for program and data storage, the Cortex M3 S-bus is typically used since it combines instruction and data fetches limiting the external bus to a single bus. This compromise results an instant drop in performance from 1.25 to 0.9 DMIPS/MHz due to the fact that instruction fetches on the S-bus take one extra cycle over the I-bus.

The efficiency of the internal bus architecture and the external memory bus controller implementation determines the final performance obtained when executing out of external memory.

Table 2: Maximum performance benchmark, Dhrystone 2.1, RVCT compiler and linker, optimization for performance, zero wait state memory

Table 2 above compares an ARM7 to a Cortex M3 MCU implementation both with an external bus interface (EBI). The ARM7-based EBI can sustain between 0.54 and 0.67 DMIPS/MHz, while the EBI on the Cortex M3 device is limited to only 0.117 DMIPS/MHz. Clearly benchmarking the execution out of external memories needs to be considered for the microcontroller selection process.

{pagebreak}Multi-layer Bus. Another problem facing data-intensive applications is on-chip bus bandwidth. When multiple DMA controllers and the processor push massive amounts of data over a single bus, the bus can become overloaded and slow down the entire system. A 32-bit bus clocked at 48 MHz has a maximum data rate of 1.5 billion bits per second (Gbps).

Although that sounds like a lot, in data-intensive applications, there may be so much data that the bus itself becomes a bottleneck. Such is the case in a USB High Speed to SDIO gateway where each has a data rate of more than 100 Mbps. Using a DMA that optimizes data transfers little bandwidth is left for the processor to access the shared data memories.

This situation can be avoided by providing multiple, parallel on-chip busses (also called a multi-layer bus matrix). In contrast to the ARM7, which supports only a single-layer ASB bus, the Cortex M3 processor has native support for a multi-layer bus. As a result, the Cortex M3 is a better processor for microcontrollers with high speed USB.

On the multi-layer (AHB) bus (Figure 4, below), the processor and the different DMA controllers in the chip each have their own dedicated layer on the bus.

Figure 4. SAM3U Multi-layer bus & Memories

When multiple peripherals share a DMA channel or when memories are shared, an arbiter sets the priorities. Examples of arbitration strategies being used are fixed priority master, Round-Robin Arbitration, either with no default master, last accessed default master or fixed default master.

Benefits of the AHB bus over the ASB bus
The AHB bus offers a major advantage over the ASB bus, because it is designed for parallel data transfers. A configurable number of buses can be routed in parallel to the main bus connected to the processor.

A Direct Memory Access controller (DMA) should use one of these additional busses to transfer data from peripherals to memory or from memory to memory without the intervention of the processor. On an ARM7 ASB architecture, the DMA would steal the bus from the processor and prevent it from fetching instructions or data from on-and off-chip memory.

Finally by multiplying the number of volatile memory blocks in the controller, conflicts in the access to shared memories can be reduced and another performance bottleneck removed. Memory blocks can be directly integrated in high bandwidth peripherals or used as central memory assigned to the processor or peripherals under software control.

Thus, it is possible to create Cortex-M3-based implementation that can stand up to 480 Mbps data transfer rate with a maximum clock rate of 96 MHz @ 1.8V and 84MHz @ 1.62V, which at 1.25 DMIPS/MHz with single-cycle multiply and divide.

The key to achieving this feat is to employ a low gate count, highly efficient multi-channel DMA architecture in the bus interface that employs low-gate-count, easy to use DMA capable of sustaining up to 20 Mpbs per channel on each of the low and mid-range data rate peripherals.

In addition, central DMA channels and dedicated USB High Speed DMA should be used to support the higher data rates of several 100 Mbps per peripheral. Using this setup the processor only intervenes to start each data transfer, and be alerted by an interrupt when the transfer is complete.

This arrangement takes almost the entire data transfer load off the processor, enabling it to prioritize control and data processing tasks. Interrupts are handled by the Cortex M3 interrupt controller deterministically in a maximum of 12 instruction cycles. In the case of tail chaining interrupt latency falls to 6 cycles, further reducing the processor overhead when switching tasks.

Finally, the increased speed offered by USB 2.0 high-speed (480 Mbps) makes it faster than the vast majority of other interfaces enabling larger amounts of data to be transferred in a minimum amount of time, improving the user experience of a system.

The right mix of DMAs and memories can boost the resources of a deterministic Cortex3 MCU to handle easily high-speed USB, providing an effective solution for current and future data requirements in microcontrollers.

Jacko Wilbrink, Product Marketing Director, Atmel, fostered the development of the industry's first ARM-based standard product microcontroller in 1998. He has a degree in engineering degree from the University of Twente, the Netherlands. He has over 20 years of experience in the semiconductor industry. He can be contacted at

About Author


blog comments powered by Disqus