PRODUCT HOW-TO: Taking the delay out of your multicore design's intra-chip interconnections

by Steve Leibson , TechOnline India - January 07, 2009

How to incorporate custom instructions into multicore designs based on the Xtensa LX2 processor to stream line protocols and arbitration mechanisms that would require multi-cycle bus transactions that could slow system performance.

Today's SOC designers readily accept the idea of using multiple processor cores in their complex systems to achieve design goals. Unfortunately, a 40-year history of processor-based system design has made the main processor bus the sole data highway into and out of most processor cores. The widespread use of processor cores in SOC designs and the heavy reliance on the processors main buses for primary on-chip interconnect, produces SOC architectures based on bus hierarchies like the one shown in Figure 1 below.

Figure 1: SoCs with multiple processors often employ bus hierarchies

Because processors interact with other types of bus masters " including other processors and DMA controllers " main processor buses feature sophisticated transaction protocols and arbitration mechanisms that enable such design complexity.

These protocols and arbitration mechanisms usually require multi-cycle bus transactions that can slow system performance. As more processors are designed into a chip to perform more processing, the hierarchy of buses architecture shown in Figure 1 becomes increasingly inefficient, because more processors are using inefficient bus arbitration and transaction protocols to gain access to and to use relatively limited bus interconnect resources.

For example, the Xtensa LX2 processor's main bus, called the PIF, uses read transactions that require at least six cycles and write transactions that require at least one cycle, depending on the speed of the target device. Using these transaction timings, we can calculate the minimum number of cycles needed to perform a simple flow-through computation: load two numbers from memory, add them, and store the result back into memory. The assembly code to perform this computation might look like this:

 L32I reg_A, Addr_A ; Load the first operand
L32I reg_B, Addr_B ; Load the second operand
ADD reg_C, reg_A, reg_B ; Add the two operands
S32I reg_C, Addr_C ; Store the result

The minimum cycle count required to perform the flow-through computation is:

 L32I reg_A, Addr_A ; 6 cycles
L32I reg_B, Addr_B ; 6 cycles
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle
Total cycle count: 14 cycles

The large number of required cycles for processor-based flow-through operations often becomes a major factor that favors the design of a purpose-built block of RTL hardware to perform the flow-through task because a conventional processor core communicating over its main bus would be too slow.

One frequently used way to solve this problem is to use a faster bus. For example, like many RISC processor cores, some Xtensa processor configurations have a local-memory bus interface called the XLMI that implements a simpler transaction protocol than the processor's PIF. XLMI transaction protocols are simpler than PIF protocols because the XLMI bus is not designed to support multimaster protocols; load and store operations can occur in one cycle.

Conducting loads and stores over the processor's XLMI bus instead of the PIF results in the following timing:

 L32I reg_A, Addr_A ; 1 cycle
L32I reg_B, Addr_B ; 1 cycle
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle

Total cycle count: 4 cycles (with the same caveat regarding the processor pipeline)

This timing represents a 3.5x improvement in the function's cycle count and this improvement can mean the difference between acceptable and unacceptable performance yet it requires no increase in clock rate. However, the XLMI bus still conducts only one transaction at a time; Loads and stores still occur sequentially, which is still too slow for many processing tasks.

Consequently, processor core vendors offer much faster alternatives for on-chip, block-to-block communications. For example, Tensilica has boosted the I/O bandwidth of the Xtensa LX2 processor core with two different features called TIE ports and queue interfaces. (Note: TIE is Tensilica's Instruction Extension language, used to customise Xtensa processors for specific on-chip roles.) These two features can easily boost I/O transaction speeds by as much as three orders of magnitude with no clock-speed increase.

Ports and queue interfaces are simple direct communication structures. Transactions conducted over ports and queue interfaces are not conducted by the processor's load/store unit using explicit memory addresses. Instead, customised processor instructions initiate port and queue transactions. Port and queue addresses reside outside of the processor's memory space and that address, and are implicitly specified by the custom port or queue instruction. One designer-defined port or queue instruction can initiate transactions on several ports and queues at the same time, which further boosts the processor's I/O bandwidth.

Using this interconnect, it's possible to create queue interfaces that are especially efficient for the simple flow-through problem discussed above (load two operands, add them, output the result). Three queue interfaces are needed to minimise the amount of time required for this task; two input queues for the input operands and one output queue for the result. With these three queue interfaces defined, it's possible to define a customised instruction that: implicitly draws input operands A and B from their respective input queues; adds A and B together, and; outputs the result of the addition (C) on the output queue.

The problem becomes more interesting if we make the three operands and the associated addition operation 256 bits wide. An off-the-shelf, 32bit processor core would need to process the 256bit operands in eight 32bit chunks using 16 loads, eight adds, and eight stores. A customised processor core can perform the entire task as one operation. The TIE code needed to create the three 256bit queue interfaces is:

queue InQ_A 256 in
queue InQ_B 256 in
queue OutQ_C 256 out

The first two statements declare 256bit input queues named InQ_A and InQ_B. The third statement declares a 256bit output queue named OutQ_C. Each TIE queue statement adds a parallel I/O port along with the handshake lines needed to connect to an external FIFO memory.

The following describes a TIE instruction ADD_XFER that reads 256bit operands from each of the input queues defined above, adds the values together and writes the result to the 256bit output queue.

operation ADD_XFER {} {in InQ_A, in InQ_B, out OutQ_C} { assign OutQ_C = InQ_A + InQ_B; }

With this new instruction, the target task reduces to one instruction:

ADD_XFER

The hardware added to the processor to perform the ADD_XFER operation appears in Figure 2 below.

Figure 2: Creating a dedicated instruction requires very few additional gates, but delivers impressive results

Very few additional gates are required to add this ability to a processor core, yet the performance increase is immense. The ADD_XFER instruction takes five cycles to run through the processor's 5-stage pipeline but the instruction has a flow-through latency of only one clock cycle because of processor pipelining.

By placing the ADD_XFER instruction within a zero-overhead loop, the processor delivers an effective throughput of one ADD_XFER instruction per clock cycle, which is 112 times faster than performing the same operation over the processor's PIF using 32bit load, store, and add instructions (14 cycles per 32 bits of I/O and computation equals 112 clock cycles per 256 bits).

Steve Leibson is with Tensilica Inc. This article is adapted from the author's book, Designing SOCs with Configured Cores, published by Morgan Kaufmann.

About Author

Comments

blog comments powered by Disqus