Parallel software plays catch-up with multicore

TechOnline India - June 22, 2009

Microprocessors are marching into a multicore future to keep delivering performance gains without frying in their own heat. But mainstream software has yet to find its path to using the new parallelism.

Microprocessors are marching into a multicore future to keep delivering performance gains without frying in their own heat. But mainstream software has yet to find its path to using the new parallelism.

Proprietary programming approaches are gaining traction in a handful of applications. It could take a decade or more, however, for the brunt of the industry to catch up in any organized fashion, and the way forward goes through some tough terrain.

"Anything performance-critical will have to be rewritten," said Kunle Olukotun, director of the Pervasive Parallelism Lab at Stanford University, one of many research groups working on the problem seen as the toughest in computer science today.

Click on image to enlarge.

"Eventually, you either have to rewrite that code or it will become obsolete," said Olukotun, who will deliver a keynote on the topic this month during the Multicore Virtual Conference.

"This is one of the biggest problems telecom companies face today," said Alex Bachmutsky, a system architect and author of an upcoming book on telecom design. "Their apps were not written for multiple cores and threads, and the apps are huge; they have millions and tens of millions of lines of code."

The ubiquitous C language "is the worst [tool] because it is inherently sequential and obscures any parallelism inherent in an algorithm," said Jeff Bier, president of DSP consulting firm Berkeley Design Technology Inc.

In a study conducted earlier this year by TechInsights, the publisher of EE Times, 62 percent of the embedded systems developers polled said their latest project was written in C. A further 24 percent said they used C++.


Researchers have developed a handful of parallel programming languages, although none is a panacea and all face a long road to commercial adoption. Ultimately Olukotun foresees a set of high-level application-specific tools that automate the process of finding parallelism.

In the meantime, multicore processors are working their way into mainstream designs, so systems, chip and tool vendors are fielding a range of tools for harnessing them.

Some are existing multiprocessing tools, such as OpenMP, now applied at the chip level. Intel and others have released libraries to mange software threads. Startups such as Critical Blue (Edinburgh, Scotland) and Cilk Arts Inc. (Burlington, Mass.) have developed tools to help find parallelism in today's C code.

Click on image to enlarge.

"There are pros and cons to each of these," said Rob Oshana, a director of software research and development at Freescale Semiconductor who tracks the options.

Developers will need new modeling tools to figure out how to partition their apps early in the design stage, Oshana said. Also on the horizon are parallel debuggers that will help developers optimize their code by letting them visualize data as it travels through the various cores, accelerators and interconnects in a complex system-on-chip.

In the short term, chip vendors will try to package as full a software stack as possible of parallel code. In some cases, they will go so far as to offer some generic parallel applications, said Steve Cole, a senior system architect at Freescale.

"It could include our applications and third-party apps, and operating systems and tools," Cole said. "That's where the industry is headed."

Freescale has doubled the size of its multicore software team in preparation for such offerings, Cole said.

Meanwhile, OEMs are already finding their own ways to address the problems.


Telecom architect Bachmutsky said control plane designs are adopting system-level symmetric multiprocessing tools to harness multicore chips using an SMP operating system. The resulting designs "look [to software] like systems with multiple line cards and a load balancer that divides traffic between blades," he said.

Data plane designs that require tenfold greater performance are tougher because they often use assembly language coding. That environment cannot afford the shared memory overhead of SMP constructs.

Developers find themselves carefully dividing up tasks to each core, watching for data dependencies. They must craft detailed messaging schemes between tasks, then figure out ways to communicate between the data and control plane software stacks, Bachmutsky said.

With the assembly language code, "you are closely linked to the silicon provider and their libraries, and you cannot port software easily to another processor," he said. "Whatever you choose, you wind up married to those solutions."

Progress at the bleeding edge

Some specialized apps are moving even further down the road to parallelism, albeit using proprietary chips and tools. For instance, Nvidia has been pioneering work in massively parallel programming using versions of its graphics chips with its Cuda environment in a range of vertical apps, such as oil and gas exploration.

Some designers report success living on the bleeding edge of parallel processing. For example, England's Cambridge Consultants has done contract design work on 3G and WiMax basestations using devices from PicoChip (Bath, England) that pack 250 cores per chip.

For those applications, the consulting company has actually found the PicoChip devices a better approach than some quad-core digital signal processors. "It seems strange to people at first, but we get shorter, more reliable programs with higher-quality output [using PicoChip] than with conventional single or low-core-count DSPs," said Monty Barlow, who leads a DSP group at Cambridge.


"The high [core count] multicore architecture lets you split functions between cores, develop and test them separately and then move on to other parts of the system knowing those parts will not interact in negative ways," said Barlow. "The alternative is to write programs as threads and rely on an operating system to share time, but the tasks run at unrelated rates, and one day things may conspire against you and something will run late," causing a crash.

The approach requires rewriting software for the PicoChip devices. But Barlow said he finds it a worthwhile trade-off doing more work up front on the architecture so that the development process that follows flows smoothly.

In a recent column, analyst Bier noted massively multicore startups such as PicoChip and Tilera use radically different software tools, a fact that makes migrating from their architectures very difficult--and risky. "It's valuable innovation, but it's long odds because the startups have to succeed in chips and in parallel software," he said in an interview.

"I see these specific [architectures] being short-lived," said Olukotun of Stanford. "As more general-purpose environments become more capable and energy-efficient, they will overtake them."

Olukotun believes research efforts from labs like his will ultimately yield innovations at multiple levels of the software stack. They will automate the process of generating parallel code, eliminating the need for developers to work with threads, locks, message passing, synchronized access to memory and other constructs.

In their place, developers will code in high-level domain-specific languages that automatically generate parallel tasks for a new class of runtime environments with sophisticated schedulers. Those runtime systems will "combine components to create different execution modes so you could generate streaming, atomics, fault tolerance, security or performance monitoring operations," Olukotun said.

Oshana of Freescale agreed. "Many apps are willing to accept more abstraction to get better integration," he said. "For example, hypervisors that let you run multiple OSes will become more and more popular."

"The uptake of these new ideas will depend on the amount of pain developers feel" programming multicore processors, said Olukotun.

That pain may not be widespread for some time. Only about 7 percent of embedded developers in the TechInsights survey said they were using multicore processors. That's up from 4 percent two years ago.

Given the complexity of parallel programming, processors with four or more cores will probably represent little more than 10 percent of the communications systems market in 2012, Linley Gwennap, principal analyst with The Linley Group (Mountain View, Calif.), said in a talk in March.

Gwennap projected that dual-core designs could command as much as 20 percent of the market by 2012.

See also: How SonicWall scaled multicore barriers


blog comments powered by Disqus