In August, 2006 I attended the Girvan Workshop for
the Cell Broadband Engine and it's an experience I'll never forget. For
two solid days, IBM engineers explained the processor's architecture,
tools and the many software libraries available for building Cell
applications.
I was stunned, not only by the processor's extraordinary
capabilities, but also by how much there was to learn: spulets and CESOF
files, Altivec and Single instruction, Multiple Data (SIMD) math,
debuggers and simulators. I did my best to comprehend it all, but most
of the material flew over my head in a fierce whoosh.
When Sony released the Playstation 3, I grabbed the first console
off the shelf and ran home for a more thorough investigation. It was
daunting at first. Then, as now, IBM's Software Development Kit
provided a vast number of documents that covered three essential
subjects: development tools, software libraries and the processor
itself, The docs were helpful, but there was no overlap or coordination
between them. This is a serious problem because any practical Cell
developer needs to understand these subjects an an integrated whole.
It took time before the whooshing sound dissipated, but when it did,
I genuinely understood how to program the Cell's PowerPC Processor Unit
(PPU) and Synergistic Processor Units.
It wasn't that hard, really " just regular C/C++ and a set of
communications mechanisms. Yet the blogs and discussion groups
disagreed: to them Cell programming was much too complex for normal
developers to understand. However, they hadn't really given the Cell a
chance; they saw the disjointed pieces, but not how they fit together.
Programming
the Cell Processor is my best attempt to reduce the whoosh
associated with Cell development. My goal is to tie together the Cell's
tools, architecture and libraries in a straightforward progression that
appeals to intuition. And I've included many code examples so that you
can follow the material in a hands-on fashion. To download the
examples, go to
http://informit.com/title.
Editor's Note: Reproduced by permission of the book's
publisher, Pearson Education, Inc., this series of five articles
describes how developers can use a collection of open source GCC
development tools, the Linux operating system and the Eclipse IDE to do
development on the Cell multicore architecture. Part 1 starts on the next page, and describes the basics of the Cell Processor architecture, introducing the developer briefly to the Cell Software Development Kit. Part 2 is on Building
Applications for the Cell processor; Part 3 covers debugging the Cell processor; Part 4 covers simulating applications and Part 5 is about the Cell SDK IDE, including Eclipse and the C/C++
development tooling as well as detailing how to manage an SPU project
with the Cell IDE.
{pagebreak}In Randall Hyde's fine series of books,
Write
Great Code, one of his fundamental lessons is that, for optimal
performance, you need to know how your code runs on the target
processor. Nowhere is this truer than when programming the Cell
processor.
It isn't enough to learn the C/C++ commands for the different cores;
you need to understand how the elements communicate with memory and one
another.This way, you'll have a bubble-free instruction pipeline, an
increased probability of cache hits, and an orderly, nonintersecting
communication flow between processing elements. What more could anyone
ask?
Figure 1.1 below shows the primary building blocks of the
Cell: the Memory Interface Controller (MIC), the PowerPC Processor
Element (PPE), the eight Synergistic Processor Elements (SPEs), the
Element Interconnect Bus (EIB), and the Input/Output Interface (IOIF).
Each of these is explored in greater depth throughout the book, but for
now, it's a good idea to see how they function individually and
interact as a whole.
 |
| Figure
1.1 The top-level anatomy of the Cell processor |
The Memory Interface Controller (MIC)
The MIC connects the Cell's system memory to the rest of the chip. It
provides two channels to system memory, but because you can't control
its operation through code, the discussion of the MIC is limited to
this brief treatment. However, you should know that, like the
PlayStation 2's Emotion Engine, the first-generation Cell supports
connections only to Rambus memory.
This memory, called eXtreme Data Rate Dynamic Random Access Memory,
or XDR DRAM, differs from conventional DRAM in that it makes eight data
transfers per clock cycle rather than the usual two or four.This way,
the memory can provide high data bandwidth without needing very high
clock frequencies.The XDR interface can support different memory sizes, and the Playstation 3, for example, uses 256MB of XDE DRAM as its system memory.
The PowerPC Processor Element (PPE)
The PPE is the Cell's control center. It runs the operating system,
responds to interrupts, and contains and manages the 512KB L2 cache. It
also distributes the processing workload among the SPEs and coordinates
their operation. Comparing the Cell to an eighthorse coach, the PPE is
the coachman, controlling the cart by feeding the horses and keeping
them in line.
As shown in Figure 1.2 below, the PPE consists of two
operational blocks.The first is the PowerPC Processor Unit, or PPU.This
processor's instruction set is based on the 64-bit PowerPC 970
architecture, used most prominently as the CPU of Apple Computer's
Power Mac G5.The PPU executes PPC 970 instructions in addition to other
Cellspecific commands, and is the only general-purpose processing unit
in the Cell.This is why Linux is installed to run on the PPU and not on
the other processing units.
 |
| Figure
1.2 Structure of the PPE |
But the PPU can do more than just housekeeping. It contains IBM's
VMX engine for Single Instruction, Multiple Data (SIMD) processing.This
means the PPU can operate on groups of numbers (e.g.,multiply two sets
of four floating-point values) with a single instruction. The PPU's
SIMD instructions are the same as those used in Apple's image
processing applications, and are collectively referred to as the
AltiVec instruction set.
Another important aspect of the PPU is its capacity for symmetric
multithreading (SMT).The PPU allows two threads of execution to run at
the same time, and although each receives a copy of most of the PPU's
registers, they have to share basic on-chip execution blocks.
This doesn't provide the same performance gain as if the threads ran
on different processors, but it allows you to maximize usage of the PPU
resources. For example, if one thread is waiting on the PPU's memory
management unit (MMU) to complete a memory write, the other can perform
mathematical operations with the vector and scalar unit (VXU).
The second block in the PPE is the PowerPC Processor Storage
Subsystem, or PPSS. This contains the L2 cache along with registers and
queues for reading and writing data. The cache plays a very important
role in the Cell's operation: not only does it perform the regular
functions of an L2 cache, it's also the only shared memory bank in the
device. Therefore, it's important to know how it works and maintains
coherence.
{pagebreak}
The Synergistic Processor Element (SPE)
The PPU is a powerful processor, but it's the Synergistic Processor
Unit (SPU) in each SPE that makes the Cell such a groundbreaking
device.These processors are designed for one purpose only: high-speed
SIMD operations. Each SPU contains two parallel pipelines that execute
instructions at 3.1GHz.
In only a handful of cycles, one pipeline can multiply and
accumulate 128-bit vectors while the other loads more vectors from
memory. SPUs weren't designed for general-purpose processing and aren't
well suited to run operating systems. Instead, they receive
instructions from the PPU, which also starts and stops their execution.
The SPU's instructions, like its data, are stored in a unified 256KB
local store (LS), shown in Figure 1.3 below.The LS is not cache; it's
the SPU's own individual memory for instructions and data.This, along
with the SPU's large register file (128 128-bit registers), is the only
memory the SPU can directly access, so it's important to have a deep
understanding of how the LS works and how to transfer its contents to
other elements.
 |
| Figure
1.3 Structure of the SPE |
The Cell provides hardware security (or digital rights management,
if you prefer) by allowing users to isolate individual SPUs from the
rest of the device.While an SPU is isolated, other processing elements
can't access its LS or registers, but it can continue running its
program normally.The isolated processor will remain secure even if an
intruder acquires root privileges on the PPU.
Figure 1.3 above shows the Memory Flow Controller (MFC)
contained in each SPE. This manages communication to and from an SPU,
and by doing so, frees the SPU for crunching numbers. More
specifically, it provides a number of different mechanisms for
inter-element communication, such as mailboxes and channels.
The MFC's most important function is to enable direct memory access
(DMA).When the PPU wants to transfer data to an SPU, it gives the MFC
an address in system memory and an address in the LS, and tells the MFC
to start moving bytes.
Similarly, when an SPU needs to transfer data into its LS, it can
not only initiate DMA transfers, but also create lists of
transfers.This way, an SPU can access noncontiguous sections of memory
efficiently, without burdening the central bus or significantly
disturbing its processing.
The Element Interconnect Bus (EIB)
The EIB serves as the infrastructure underlying the DMA requests and
inter-element communication. Functionally, it consists of four rings,
two that carry data in the clockwise direction (PPE > SPE1 > SPE3
> SPE5 > SPE7 > IOIF1 > IOIF0 > SPE6 > SPE4 > SPE2
> SPE0 > MIC) and two that transfer data in the counterclockwise
direction. Each ring is 16 bytes wide and can support three data
transfers simultaneously.
Each DMA transfer can hold payload sizes of 1, 2, 4, 8, and 16
bytes, and multiples of 16 bytes up to a maximum of 16KB. Each DMA
transfer, no matter how large or small, consists of eight bus transfers
(128 bytes)..
The Input/Output Interface (IOIF)
As the name implies, IOIF connects the Cell to external peripherals.
Like the memory interface, it is based on Rambus technology: FlexIO.The
FlexIO connections can be configured for data rates between 400MHz to
8GHz, and with the high number of connections on the Cell, its maximum
I/O bandwidth approaches 76.8GB/s.
In the PlayStation 3, the I/O is connected to Nvidia's RSX graphic
processor.The IOIF can be accessed only by privileged applications, and
for this reason, interfacing the IOIF lies beyond the scope of this
book.
{pagebreak}
The CBE Software Development Kit
This book uses a hands-on approach to teach Cell programming, so the
development tools are very important. The most popular toolset is IBM's
Software Development Kit (SDK), which runs exclusively on Linux and
provides many different tools and libraries for building Cell
applications.
IBM provides the SDK free of charge, although some of the tools have
more restrictive licensing than others. For the purposes of this book,
the most important aspect of the SDK is the GCC-based toolchain for
compiling and linking code.
The two compilers, ppu-gcc and spu-gcc, compile code for the PPU and
SPU, respectively.They provide multiple optimization levels and can
combine scalar operations into more efficient vector operations.
The SDK also includes IBM's Full-System Simulator, tailored for Cell
applications. This impressive application runs on a conventional
computer and provides cycle-accurate simulation of the Cell processor,
keeping track of every thread and register in the PPU and SPUs. In
addition to basic simulation and debugging, it provides many advanced
features for responding to processing events.
The SDK contains many code libraries to ease the transition from
traditional programming to Cell development. It provides most standard
C/C++ libraries for both the PPU and SPU, POSIX commands for the PPU,
and a subset of the POSIX API on the SPU. Many of the libraries are
related to math, but others can be used to profile an SPU's operation,
maintain a software cache, and synchronize communication between
processing units.
All of these tools and libraries can be accessed through the Cell
SDK integrated development environment (IDE).This is an Eclipse-based
graphical user interface for managing, editing, building, and analyzing
code projects. It provides a powerful text editor for code entry,
point-and-click compiling, and a feature-rich interface to the Cell
debugger. With this interface, you can watch variables as you step
through code and view every register and memory location in the Cell.
Conclusion
Some time ago, I had the pleasure of programming assembly language on a
multicore digital signal processor, or DSP. The DSP performed matrix
operations much,much faster than the computer on my desk, but there
were two problems: I had to write all the routines for resource
management and event handling, and there was no file system to organize
the data.And without a network interface, it was hard to transfer data
in and out of the device.
The Cell makes up for these shortcomings and provides many
additional advantages. With SIMD processing, values can be grouped into
vectors and processed in a single cycle. With Linux running on the PPE,
memory and I/O can be accessed through a standard, reliable API. Most
important, when all the SPEs crunch numbers simultaneously, they can
process matrices at incredible speed.
The goal is to enable you to build applications with similar
performance. As with the DSP, however, it's not enough just to know the
C/C++ functions. You have to understand how the different processing
elements work, how they're connected, and how they access memory. But
first, you need to know how to use the tools.
Next in Part 2, Building Applications for the Cell Processor.
Matthew Scarpino lives in the San Franciso Bay area and
develops software to interface embedded devices. He has a master's
degree in electrical engineering and has spent more than a decade in
software development. His experience includes computing clusters,
digital signal processors, microcontrollers and fiedld programmable
gate arrays and, of course, the Cell Processor.
This series of articles is reproduced from the book "Programming
the Cell Processor", Copyright © 2009, by permission of
Pearson Education, Inc.. Written permission from Pearson Education,
Inc. is required for all other uses.
To read more about the Cell processor architecture on Embedded.com,
go to:
1) A
glimpse inside the Cell processor
2) Programming
the Cell Broadband Engine
3) Programming
the Cell Processor
4) Cell
Processor makes computing more connected