Using open-source GNU, Eclipse & Linux to develop multicore Cell apps: Part 1

by Matthew Scarpino , TechOnline India - August 24, 2009

In a five part series this week, Matthew Scarpino, author of "Programming the Cell Processor," describes how to use open source GCC tools, Linux and the Eclipse IDE to develop multicore apps. Up first: Introducing the Cell Processor.

In August, 2006 I attended the Girvan Workshop for the Cell Broadband Engine and it's an experience I'll never forget. For two solid days, IBM engineers explained the processor's architecture, tools and the many software libraries available for building Cell applications.

I was stunned, not only by the processor's extraordinary capabilities, but also by how much there was to learn: spulets and CESOF files, Altivec and Single instruction, Multiple Data (SIMD) math, debuggers and simulators. I did my best to comprehend it all, but most of the material flew over my head in a fierce whoosh.

When Sony released the Playstation 3, I grabbed the first console off the shelf and ran home for a more thorough investigation. It was daunting at first. Then, as now, IBM's Software Development Kit provided a vast number of documents that covered three essential subjects: development tools, software libraries and the processor itself, The docs were helpful, but there was no overlap or coordination between them. This is a serious problem because any practical Cell developer needs to understand these subjects an an integrated whole.

It took time before the whooshing sound dissipated, but when it did, I genuinely understood how to program the Cell's PowerPC Processor Unit (PPU) and Synergistic Processor Units.

It wasn't that hard, really " just regular C/C++ and a set of communications mechanisms. Yet the blogs and discussion groups disagreed: to them Cell programming was much too complex for normal developers to understand. However, they hadn't really given the Cell a chance; they saw the disjointed pieces, but not how they fit together.

Programming the Cell Processor is my best attempt to reduce the whoosh associated with Cell development. My goal is to tie together the Cell's tools, architecture and libraries in a straightforward progression that appeals to intuition. And I've included many code examples so that you can follow the material in a hands-on fashion. To download the examples, go to http://informit.com/title.

Editor's Note: Reproduced by permission of the book's publisher, Pearson Education, Inc., this series of five articles describes how developers can use a collection of open source GCC development tools, the Linux operating system and the Eclipse IDE to do development on the Cell multicore architecture. Part 1 starts on the next page, and describes the basics of the Cell Processor architecture, introducing the developer briefly to the Cell Software Development Kit. Part 2 is on Building Applications for the Cell processor; Part 3 covers debugging the Cell processor; Part 4 covers simulating applications and Part 5 is about the Cell SDK IDE, including Eclipse and the C/C++ development tooling as well as detailing how to manage an SPU project with the Cell IDE.

{pagebreak}In Randall Hyde's fine series of books, Write Great Code, one of his fundamental lessons is that, for optimal performance, you need to know how your code runs on the target processor. Nowhere is this truer than when programming the Cell processor.

It isn't enough to learn the C/C++ commands for the different cores; you need to understand how the elements communicate with memory and one another.This way, you'll have a bubble-free instruction pipeline, an increased probability of cache hits, and an orderly, nonintersecting communication flow between processing elements. What more could anyone ask?

Figure 1.1 below shows the primary building blocks of the Cell: the Memory Interface Controller (MIC), the PowerPC Processor Element (PPE), the eight Synergistic Processor Elements (SPEs), the Element Interconnect Bus (EIB), and the Input/Output Interface (IOIF). Each of these is explored in greater depth throughout the book, but for now, it's a good idea to see how they function individually and interact as a whole.

Figure 1.1 The top-level anatomy of the Cell processor

The Memory Interface Controller (MIC)
The MIC connects the Cell's system memory to the rest of the chip. It provides two channels to system memory, but because you can't control its operation through code, the discussion of the MIC is limited to this brief treatment. However, you should know that, like the PlayStation 2's Emotion Engine, the first-generation Cell supports connections only to Rambus memory.

This memory, called eXtreme Data Rate Dynamic Random Access Memory, or XDR DRAM, differs from conventional DRAM in that it makes eight data transfers per clock cycle rather than the usual two or four.This way, the memory can provide high data bandwidth without needing very high clock frequencies.The XDR interface can support different memory sizes, and the Playstation 3, for example, uses 256MB of XDE DRAM as its system memory.

The PowerPC Processor Element (PPE)
The PPE is the Cell's control center. It runs the operating system, responds to interrupts, and contains and manages the 512KB L2 cache. It also distributes the processing workload among the SPEs and coordinates their operation. Comparing the Cell to an eighthorse coach, the PPE is the coachman, controlling the cart by feeding the horses and keeping them in line.

As shown in Figure 1.2 below, the PPE consists of two operational blocks.The first is the PowerPC Processor Unit, or PPU.This processor's instruction set is based on the 64-bit PowerPC 970 architecture, used most prominently as the CPU of Apple Computer's Power Mac G5.The PPU executes PPC 970 instructions in addition to other Cellspecific commands, and is the only general-purpose processing unit in the Cell.This is why Linux is installed to run on the PPU and not on the other processing units.

Figure 1.2 Structure of the PPE

But the PPU can do more than just housekeeping. It contains IBM's VMX engine for Single Instruction, Multiple Data (SIMD) processing.This means the PPU can operate on groups of numbers (e.g.,multiply two sets of four floating-point values) with a single instruction. The PPU's SIMD instructions are the same as those used in Apple's image processing applications, and are collectively referred to as the AltiVec instruction set.

Another important aspect of the PPU is its capacity for symmetric multithreading (SMT).The PPU allows two threads of execution to run at the same time, and although each receives a copy of most of the PPU's registers, they have to share basic on-chip execution blocks.

This doesn't provide the same performance gain as if the threads ran on different processors, but it allows you to maximize usage of the PPU resources. For example, if one thread is waiting on the PPU's memory management unit (MMU) to complete a memory write, the other can perform mathematical operations with the vector and scalar unit (VXU).

The second block in the PPE is the PowerPC Processor Storage Subsystem, or PPSS. This contains the L2 cache along with registers and queues for reading and writing data. The cache plays a very important role in the Cell's operation: not only does it perform the regular functions of an L2 cache, it's also the only shared memory bank in the device. Therefore, it's important to know how it works and maintains coherence.

{pagebreak}The Synergistic Processor Element (SPE)
The PPU is a powerful processor, but it's the Synergistic Processor Unit (SPU) in each SPE that makes the Cell such a groundbreaking device.These processors are designed for one purpose only: high-speed SIMD operations. Each SPU contains two parallel pipelines that execute instructions at 3.1GHz.

In only a handful of cycles, one pipeline can multiply and accumulate 128-bit vectors while the other loads more vectors from memory. SPUs weren't designed for general-purpose processing and aren't well suited to run operating systems. Instead, they receive instructions from the PPU, which also starts and stops their execution.

The SPU's instructions, like its data, are stored in a unified 256KB local store (LS), shown in Figure 1.3 below.The LS is not cache; it's the SPU's own individual memory for instructions and data.This, along with the SPU's large register file (128 128-bit registers), is the only memory the SPU can directly access, so it's important to have a deep understanding of how the LS works and how to transfer its contents to other elements.

Figure 1.3 Structure of the SPE

The Cell provides hardware security (or digital rights management, if you prefer) by allowing users to isolate individual SPUs from the rest of the device.While an SPU is isolated, other processing elements can't access its LS or registers, but it can continue running its program normally.The isolated processor will remain secure even if an intruder acquires root privileges on the PPU.

Figure 1.3 above shows the Memory Flow Controller (MFC) contained in each SPE. This manages communication to and from an SPU, and by doing so, frees the SPU for crunching numbers. More specifically, it provides a number of different mechanisms for inter-element communication, such as mailboxes and channels.

The MFC's most important function is to enable direct memory access (DMA).When the PPU wants to transfer data to an SPU, it gives the MFC an address in system memory and an address in the LS, and tells the MFC to start moving bytes.

Similarly, when an SPU needs to transfer data into its LS, it can not only initiate DMA transfers, but also create lists of transfers.This way, an SPU can access noncontiguous sections of memory efficiently, without burdening the central bus or significantly disturbing its processing.

The Element Interconnect Bus (EIB)
The EIB serves as the infrastructure underlying the DMA requests and inter-element communication. Functionally, it consists of four rings, two that carry data in the clockwise direction (PPE > SPE1 > SPE3 > SPE5 > SPE7 > IOIF1 > IOIF0 > SPE6 > SPE4 > SPE2 > SPE0 > MIC) and two that transfer data in the counterclockwise direction. Each ring is 16 bytes wide and can support three data transfers simultaneously.

Each DMA transfer can hold payload sizes of 1, 2, 4, 8, and 16 bytes, and multiples of 16 bytes up to a maximum of 16KB. Each DMA transfer, no matter how large or small, consists of eight bus transfers (128 bytes)..

The Input/Output Interface (IOIF)
As the name implies, IOIF connects the Cell to external peripherals. Like the memory interface, it is based on Rambus technology: FlexIO.The FlexIO connections can be configured for data rates between 400MHz to 8GHz, and with the high number of connections on the Cell, its maximum I/O bandwidth approaches 76.8GB/s.

In the PlayStation 3, the I/O is connected to Nvidia's RSX graphic processor.The IOIF can be accessed only by privileged applications, and for this reason, interfacing the IOIF lies beyond the scope of this book.

{pagebreak}The CBE Software Development Kit
This book uses a hands-on approach to teach Cell programming, so the development tools are very important. The most popular toolset is IBM's Software Development Kit (SDK), which runs exclusively on Linux and provides many different tools and libraries for building Cell applications.

IBM provides the SDK free of charge, although some of the tools have more restrictive licensing than others. For the purposes of this book, the most important aspect of the SDK is the GCC-based toolchain for compiling and linking code.

The two compilers, ppu-gcc and spu-gcc, compile code for the PPU and SPU, respectively.They provide multiple optimization levels and can combine scalar operations into more efficient vector operations.

The SDK also includes IBM's Full-System Simulator, tailored for Cell applications. This impressive application runs on a conventional computer and provides cycle-accurate simulation of the Cell processor, keeping track of every thread and register in the PPU and SPUs. In addition to basic simulation and debugging, it provides many advanced features for responding to processing events.

The SDK contains many code libraries to ease the transition from traditional programming to Cell development. It provides most standard C/C++ libraries for both the PPU and SPU, POSIX commands for the PPU, and a subset of the POSIX API on the SPU. Many of the libraries are related to math, but others can be used to profile an SPU's operation, maintain a software cache, and synchronize communication between processing units.

All of these tools and libraries can be accessed through the Cell SDK integrated development environment (IDE).This is an Eclipse-based graphical user interface for managing, editing, building, and analyzing code projects. It provides a powerful text editor for code entry, point-and-click compiling, and a feature-rich interface to the Cell debugger. With this interface, you can watch variables as you step through code and view every register and memory location in the Cell.

Conclusion
Some time ago, I had the pleasure of programming assembly language on a multicore digital signal processor, or DSP. The DSP performed matrix operations much,much faster than the computer on my desk, but there were two problems: I had to write all the routines for resource management and event handling, and there was no file system to organize the data.And without a network interface, it was hard to transfer data in and out of the device.

The Cell makes up for these shortcomings and provides many additional advantages. With SIMD processing, values can be grouped into vectors and processed in a single cycle. With Linux running on the PPE, memory and I/O can be accessed through a standard, reliable API. Most important, when all the SPEs crunch numbers simultaneously, they can process matrices at incredible speed.

The goal is to enable you to build applications with similar performance. As with the DSP, however, it's not enough just to know the C/C++ functions. You have to understand how the different processing elements work, how they're connected, and how they access memory. But first, you need to know how to use the tools.

Next in Part 2, Building Applications for the Cell Processor.

Matthew Scarpino lives in the San Franciso Bay area and develops software to interface embedded devices. He has a master's degree in electrical engineering and has spent more than a decade in software development. His experience includes computing clusters, digital signal processors, microcontrollers and fiedld programmable gate arrays and, of course, the Cell Processor.

This series of articles is reproduced from the book "Programming the Cell Processor", Copyright © 2009, by permission of Pearson Education, Inc.. Written permission from Pearson Education, Inc. is required for all other uses.

To read more about the Cell processor architecture on Embedded.com, go to:

1) A glimpse inside the Cell processor
2) Programming the Cell Broadband Engine
3) Programming the Cell Processor
4) Cell Processor makes computing more connected

About Author

Comments

blog comments powered by Disqus