Researchers report progress on parallel path

TechOnline India - August 24, 2009

At Hot Chips, labs from Berkeley, Stanford and Illinois will deliver what amounts to their first report cards on their efforts to beat a path to programming tomorrow's many-core microprocessors, a task many have characterized as the most ambitious in the history of computer science.

SAN JOSE, Calif. — Researchers are gaining traction in their efforts to beat a path to programming tomorrow's many-core microprocessors. In a Monday (August 24) session at the Hot Chips conference, three top labs will give what amounts to their first report cards on a task many have characterized as the most ambitious in the history of computer science.

Researchers at UC-Berkeley, University of Illinois and Stanford University have been at work for a little over a year with grants from Intel, Microsoft and a handful of other backers mainly from the PC industry. All three are forming ideas about the future of chips with dozens or hundreds of cores and coding prototype parallel software to harness them.

All sides see the need to rewrite today's software stack, starting with performance-hungry applications then plugging in new languages and runtime environments and rewriting or scraping traditional operating systems. They are already beginning to develop and test early versions of their code.

Work on hardware is in a more nascent stage. Most agree changes will focus on developing new memory structures, but fundamental debates about how to architect memory for many-core processors remain unresolved.

"I think we've made significant strides," said Kunle Olukotun, director of the Pervasive Parallelism Lab at Stanford. "We had this vision and have started to fill in the pieces and it feels like the vision really in really coming to pass," he said.

The labs hope to be able to show "reasonably complete" parallel software stacks running on simulators or prototype hardware within one or two years when their grants are up for renewal. By the end of the five-year grants, they hope to have work that is solid enough to show a path for commercial use.

"We believe within four years we can show companies like Intel and Microsoft how they can make their offerings better support parallel programming," said Marc Snir, co-director of the Universal Parallel Computing Research Center (UPCRC) at the University of Illinois. "There is no silver bullet, but we hope we can make developing parallel software as easy as developing today's software," he said.

The industry expects processors with 64 cores or more will arrive by 2015, forcing the need for parallel software, said David Patterson of the Berkeley Parallel Lab. Although researchers have failed to create a useful parallel programming model in the past, he was upbeat that this time there is broad industry focus on solving the problem.

{pagebreak}Patterson will describe Berkeley's work on a two-level approach to scheduling parallel jobs in software. At the lowest level, the group's Tessellation OS allocates to an application a set of hardware resources such as cores, cache and bandwidth, essentially creating a logical partition for coarse-grained parallelism.

Above Tessellation, the Lithe runtime environment provides protocols for sharing resources. Lithe lets users tap into multiple parallel libraries, something that hasn't been possible to date.

The Berkeley Parallel Lab foresees many-core processors needing a nuanced software stack

Berkeley hopes to release this summer a version of Lithe working both with the Intel's Thread Building Blocks and the OpenMP libraries and running on today's operating systems. The Tessellation environment is up and running on an x86 multicore processor and is being ported both to Intel's Nehalem server CPU and to the Ramp FPGA simulator board developed at Berkeley.

In a separate project, one graduate student used new data structures to map a high-end computer vision algorithm to a multicore graphics processor, shaving the time to recognize and image from 7.8 to 2.1 seconds. The effort was one example of developing a new stack to better harness parallelism.

"Our goal is to understand what are the recurring problems in applications and come up with frameworks so the next time we parallelize code it doesn't take as much time or a graduate student to do it," said Krste Asanovic, an associate professor at Berkeley.

In a separate project, students used a method for automatically generating in the popular Python and Ruby languages C code geared for multicore environments such as OpenMP and Nvidia's CUDA. Results showed the automatically generated code could be just as fast as hand-written parallel code that requires much more time and effort.

{pagebreak}Likewise researchers at the UPCRC in Illinois are working to extend languages such as Java and C# so that they prohibit thorny parallelism errors such as race conditions just as they do not allow memory and tag errors in serial programs today.

"We already have results showing this will work" with the group's prototype of Deterministic Parallel Java, said Snir. "Now we're trying to incorporate this into C#," he said.

"This is perhaps the most important direction of our work in software," Snir added.

For its part, Stanford has prototyped in C++ a language it calls Liszt, one of several domain-specific languages it aims to develop. It leverages existing work at the university to create a high-level language for simulating hypersonic vehicles using partial differential equations over meshes.

Liszt is being used in tandem with a lower-level language called Scala which developers at Twitter adopted when then found they could not get the performance they needed from Ruby. Scala provides support for both functional languages and objects and generates bytecodes that can run in a Java virtual machine.

The Stanford Pervasive Parallelism Lab replaces a traditional operating system with multiple parallel languages and runtime environments

At the next level down, Stanford is developing a homegrown scheduler called Delite that extracts parallelism implicit in higher-level languages. Delite finds code dependencies and manages task execution.

"As you go above 128 cores most OSes are struggling to manage the parallelism," said Olukotun. "Going forward OSes should just give you a bunch of processors, get out of the way and let runtime environments [like Delite] work," he said.

The lab hopes to have by the end of the year a working prototype of Delite for multiple multicore processors. It has parallel efforts in other emerging applications for areas such as virtual worlds and machine learning.

{pagebreak}Hardware represents a smaller slice of the work at the labs so far. Nevertheless, they all expect to define new features to optimize future multicore chips.

"Within the next year or so, once we have a full software stack in place, we can start to understand the real way the apps behave and have fairly good ideas of what hardware support we'll need," said Olukotun. "Most of innovation will be in memory systems, but some will be in execution units," he added.

The Stanford lab includes Mark Horowitz, a lead developer behind Rambus, and William Dally who defined the concept of stream processors. For his part, Olukotun helped design what became Sun Microsystems' Niagara processors, some of the most aggressive multicore microprocessors currently available.

The labs expect to define new ways of communicating between processors and between processors and memory. That means today's cache coherency protocols will probably be completely rewritten.

"There's a raging argument about whether you need coherence or not," said Olukotun. "[Renowned game developer] Tim Sweeney wants it for ease of programming, and he's willing to give up performance to get it, but graphics chip designers say you don't need it," Olukotun said.

Researchers at Illinois are studying how to map 1,000 cores on to a design. Snir called for a global memory model that gives programmers more insight into where data resides and when and where it moves.

"You cannot afford to have a traditional cache structure, instead you need to build local caches and a coherence protocol for local clusters to move chunks of data from one cluster to another," said Snir.

Several projects, including the Bulk Multicore effort at Illinois, are ways to group transactions into buckets that can be executed in parallel, saving memory resources. Microsoft's concept of transactional memory pursues this vision.

The Illinois project adds the idea of a cache coherency protocol that can identify data dependencies and data races, Snir said.

A separate Illinois project called DeNovo is in the early stage of exploring ways high level languages and multicore processors synchronize efforts to avoid data clashes.

"Most of area in a chip already is not in the processor cores, it's in the caches and buses," said Snir. "That's where the budget is in transistors and energy and that's where our projects are focused," he said.

For its part, Berkeley has contributed the Ramp FPGA board as a platform to quickly try out the new parallel software on different kinds of simulated multicore processors. The latest version can simulate a 64-core processor at a cost of about $750 per board and turn out results 50 to 100 times faster than a software simulation.


blog comments powered by Disqus