Today's SOC designers readily accept the idea of using multiple
processor cores in their complex systems to achieve design goals.
Unfortunately, a 40-year history of processor-based system design has
made the main processor bus the sole data highway into and out of most
processor cores. The widespread use of processor cores in SOC designs
and the heavy reliance on the processors main buses for primary on-chip
interconnect, produces SOC architectures based on bus hierarchies like
the one shown in
Figure 1 below.
 |
| Figure
1: SoCs with multiple processors often employ bus hierarchies |
Because processors interact with other types of bus masters "
including other processors and DMA controllers " main processor buses
feature sophisticated transaction protocols and arbitration mechanisms
that enable such design complexity.
These protocols and arbitration mechanisms usually require
multi-cycle bus transactions that can slow system performance. As more
processors are designed into a chip to perform more processing, the
hierarchy of buses architecture shown in Figure 1 becomes increasingly
inefficient, because more processors are using inefficient bus
arbitration and transaction protocols to gain access to and to use
relatively limited bus interconnect resources.
For example, the Xtensa LX2 processor's main bus, called the PIF,
uses read transactions that require at least six cycles and write
transactions that require at least one cycle, depending on the speed of
the target device. Using these transaction timings, we can calculate
the minimum number of cycles needed to perform a simple flow-through
computation: load two numbers from memory, add them, and store the
result back into memory. The assembly code to perform this computation
might look like this:
L32I reg_A, Addr_A ; Load the first
operand
L32I reg_B, Addr_B ; Load the second operand
ADD reg_C, reg_A, reg_B ; Add the
two operands
S32I reg_C, Addr_C ; Store the
result
The minimum cycle count required to perform the flow-through
computation is:
L32I reg_A, Addr_A ; 6 cycles
L32I reg_B, Addr_B ; 6 cycles
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle
Total cycle count: 14 cycles
The large number of required cycles for processor-based flow-through
operations often becomes a major factor that favors the design of a
purpose-built block of RTL hardware to perform the flow-through task
because a conventional processor core communicating over its main bus
would be too slow.
One frequently used way to solve this problem is to use a faster
bus. For example, like many RISC processor cores, some Xtensa processor
configurations have a local-memory bus interface called the XLMI that
implements a simpler transaction protocol than the processor's PIF.
XLMI transaction protocols are simpler than PIF protocols because the
XLMI bus is not designed to support multimaster protocols; load and
store operations can occur in one cycle.
Conducting loads and stores over the processor's XLMI bus instead of
the PIF results in the following timing:
L32I reg_A, Addr_A ; 1 cycle
L32I reg_B, Addr_B ; 1 cycle
ADD reg_C, reg_A, reg_B ; 1 cycle
S32I reg_C, Addr_C ; 1 cycle
Total cycle count: 4 cycles
(with the same caveat regarding the processor pipeline)
This timing represents a 3.5x improvement in the function's cycle
count and this improvement can mean the difference between acceptable
and unacceptable performance yet it requires no increase in clock rate.
However, the XLMI bus still conducts only one transaction at a time;
Loads and stores still occur sequentially, which is still too slow for
many processing tasks.
Consequently, processor core vendors offer much faster alternatives
for on-chip, block-to-block communications. For example, Tensilica has
boosted the I/O bandwidth of the Xtensa LX2 processor core with two
different features called TIE ports and queue interfaces. (Note: TIE is
Tensilica's Instruction Extension language, used to customise Xtensa
processors for specific on-chip roles.) These two features can easily
boost I/O transaction speeds by as much as three orders of magnitude
with no clock-speed increase.
Ports and queue interfaces are simple direct communication
structures. Transactions conducted over ports and queue interfaces are
not conducted by the processor's load/store unit using explicit memory
addresses. Instead, customised processor instructions initiate port and
queue transactions. Port and queue addresses reside outside of the
processor's memory space and that address, and are implicitly specified
by the custom port or queue instruction. One designer-defined port or
queue instruction can initiate transactions on several ports and queues
at the same time, which further boosts the processor's I/O bandwidth.
Using this interconnect, it's possible to create queue interfaces
that are especially efficient for the simple flow-through problem
discussed above (load two operands, add them, output the result). Three
queue interfaces are needed to minimise the amount of time required for
this task; two input queues for the input operands and one output queue
for the result. With these three queue interfaces defined, it's
possible to define a customised instruction that: implicitly draws
input operands A and B from their respective input queues; adds A and B
together, and; outputs the result of the addition (C) on the output
queue.
The problem becomes more interesting if we make the three operands
and the associated addition operation 256 bits wide. An off-the-shelf,
32bit processor core would need to process the 256bit operands in eight
32bit chunks using 16 loads, eight adds, and eight stores. A customised
processor core can perform the entire task as one operation. The TIE
code needed to create the three 256bit queue interfaces is:
queue InQ_A 256 in
queue InQ_B 256 in
queue OutQ_C 256 out
The first two statements declare 256bit input queues named InQ_A and
InQ_B. The third statement declares a 256bit output queue named OutQ_C.
Each TIE queue statement adds a parallel I/O port along with the
handshake lines needed to connect to an external FIFO memory.
The following describes a TIE instruction ADD_XFER that reads 256bit
operands from each of the input queues defined above, adds the values
together and writes the result to the 256bit output queue.
operation ADD_XFER {} {in InQ_A,
in InQ_B, out OutQ_C} { assign OutQ_C = InQ_A + InQ_B; }
With this new instruction, the target task reduces to one
instruction:
ADD_XFER
The hardware added to the processor to perform the ADD_XFER
operation appears in Figure 2 below.
 |
| Figure
2: Creating a dedicated instruction requires very few additional gates,
but delivers impressive results |
Very few additional gates are required to add this ability to a
processor core, yet the performance increase is immense. The ADD_XFER
instruction takes five cycles to run through the processor's 5-stage
pipeline but the instruction has a flow-through latency of only one
clock cycle because of processor pipelining.
By placing the ADD_XFER
instruction within a zero-overhead loop, the processor delivers an
effective throughput of one ADD_XFER instruction per clock cycle, which
is 112 times faster than performing the same operation over the
processor's PIF using 32bit load, store, and add instructions (14
cycles per 32 bits of I/O and computation equals 112 clock cycles per
256 bits).
Steve Leibson is with Tensilica Inc.
This article is adapted from the author's book, Designing SOCs with
Configured Cores, published by Morgan Kaufmann.