Multicore technologies and software challenges

by Rajagopal Nagarajan , TechOnline India - January 05, 2010

Multicore processors, which are basically processors with more than one core, are entering mainstream. Today, even desktops are having two or four cores and this trend is picking up and will only accelerate in coming years. This article looks at the drivers for the multicore, the challenges posed to the software community by the emergence of multicore technologies, the different options available in software and how the software community is likely to react to the challenges.

Multicore processors, which are basically processors with more than one core, are entering mainstream. Today, even desktops are having two or four cores and this trend is picking up and will only accelerate in coming years. This article looks at the drivers for the multicore, the challenges posed to the software community by the emergence of multicore technologies, the different options available in software and how the software community is likely to react to the challenges.

In recent times, there has been a perceptible slow down in the Moore’s Law. Moore’s Law says that the number of transistors will double in every 18 months. Well, transistor count is doubling, but performance is not keeping in pace. Performance kept pace till 2002 due to technologies like pipelining, caching and superscalar designs. After that however, the gap has started becoming visible as the returns from these technologies began to yield diminishing returns1.

Fig 1: Gap between performance and number of transistors (Courtesy: Embedded Systems Design).

For example, between 1993 and 1999, CPU speeds increased 10 fold. The first 1GHz CPU was released in 2000 but, in last 9 years, it has gone up only to 3.3 GHz, a growth that is considerably slower than the previous six years2.

Power is another driver. Power consumed is related to frequency and increasing frequency makes a huge drain on the power. The challenges of designing appropriate heat syncs, airflows in servers and desktops become pronounced as frequency increases. Another impact is the high opex costs of datacenters due to higher costs in air conditioning and cooling systems and there is pressure to reduce opex costs of enterprises and service providers. This is referred to as 'Power wall'.

Having exploited many optimization techniques and hit the power wall, semiconductor companies are now packing more cores in the processor, instead of increasing the speed of the processors. First versions of multicore processors supported a technology called Hyper threading. This allowed two or more thread contexts but they shared the same cache and external bus. The first real multicore processor was Intel Dual core Pentium processor, released in 2005.

Recent products from Intel sport quad cores. New ATOM cores and Cortex A9 cores support multiple cores and are aimed at smartphones and netbook markets. The leader in multicore is Tilera which has processors that support up to 64 cores. There are predictions that the processors will support 100’s of cores in near future3.

The onus of improving performance has now moved from hardware to software and the big question is whether the software world prepared for it?

{pagebreak}If software has to use the multiple cores effectively, it has to move to the parallel programming model where each core works independently of other cores.

Parallel programming paradigms are not new to the software world and there are classic textbooks written twenty years ago on how to address the parallel programming. There are “embarrassingly parallel” operations like matrix multiplications, Fast Fourier Transformations, graphics rendering which are ideally suited for parallel programming and there are well known programming models like SIMD.

But, for a long time, parallel programming stayed in the fringes of the mainstream world, remaining as an interest area for scientists who work in areas like weather forecasting, modeling or weapons designers in national laboratories. The rest of the world proceeded in normal programming without much interest4.

But the advent of multicore processors has made the scene different. The “free lunch” of speed improvements in processors is over and performance improvement has to come from software. Unfortunately, the news is not very good on the software side and the community is woefully unprepared for the transition5.

From the software point of view, there are three aspects: - How to develop new parallel software - How to re-factor existing code to extract performance - How to avoid processor / tool / platform lockins in the process. The challenge is pronounced for existing software which is already in use.

Issues of Concurrent programming

When two or more threads access a piece of data, there is always a chance of wrong sequencing of access and creating a race condition. To prevent the race conditions, the programmers are expected to lock the access to the shared resource with a semaphore or a lock. But concurrent programming is prone to errors and can cause deadlocks in the system. It is very difficult for someone to qualify that multithreaded application does not have any deadlocks.

Difficulty of visualization

It is difficult to visualize the interplay of threads of execution after the number of threads goes beyond a handful of numbers. The complexities of order of access and sequences increase exponentially and human mind is not used to thinking in parallel.

How can one be sure of bug free code?

Whenever code is modified there is always a possibility of introducing errors. With introduction of multiple threads and concurrency, there is the near certainty of introducing errors. When a programme is executed sequentially, certain assumptions are taken for granted with the most important of these being that the code will execute in the order presented in the programme. Once the programme is split across processors this assumption is no longer valid and confidence in the code nosedives.

So what is the way out?

There are two proposals being discussed in the engineering community: To move fully to concurrent programming using “Functional programming” or use tools with existing programming methods to address the problem.{pagebreak}As we can see, the main issue with concurrent programming is the parallel access to shared data and complications rising from them. All procedural languages (like C/C++/Java) suffer from the same issue.

There is a school of thought that advocates that we should move to the “functional programming” paradigm to avoid the problem itself6. Functional programming languages have their roots in the branch of theoretical computer science called Lambda Calculus invented by Alan Church in 1940s. This evolved into a language called LISP and the functional languages are characterised by the concept of treating all computation as an evaluation of a function and the functional languages avoid state and changeable data. There is no concept of global variables and functions have no side effects. All relevant data is stored in stack and is local to the function. Because of this reason, functional programming is ideal for distributed computing over multiple cores. And, programming in such languages ensures that one need not bother about race conditions, deadlocks and so on.

So, is functional programming the answer to challenges of parallel programming? Unfortunately, there are issues with practical adoption of the model7. All programmers are used to the procedural way of programming and learning LISP or other functional languages is not easy as it has a steep learning curve. Like Object Oriented Programming needed a “mindset” change, one needs to change the way of thinking.

Some “midway approaches” have been proposed to address this issue. These include Scala – which is a Java/FP hybrid8. Scala allows a programmer to use existing Java code, use familiar Eclipse plugins, and compile the Scala code to a Java bytecode. Similarly, Microsoft has come up with F# for .NET environment. These hybrid models allow you to try out some modules to functional programming without leaving the familiar environment of existing libraries and code.

For pure functional programming folks, there are choices like Erlang. This is a language used extensively by Ericcson for their mission critical products. But the long learning curve and academic nature of these languages make it difficult for wider adoption.

So, is the world ready to adopt functional programming? From all accounts it looks like a NO. We cannot expect a desktop application writer or web application writer to forget his language like C#/Java and move to Erlang or Scala.

So what is the way out? Instead of creating revolutions and expecting everyone to move to new promised lands, some companies are creating tools that can help refactor existing code to get better performance9.{pagebreak}These tools look at the existing code and suggest optimisations and help a programmer to refactor the code to extract performance.

RapidMind is a tool that helps to parallelize existing code to exploit multiple cores10. The code is required to be changed with constructs, which get recompiled by the tool. RapidMind was acquired by Intel in August 2009.

The plus points of the tool include ability to target different architectures (like nVidia, x86, Cell processors) and ability to use native tool chains (Windows VC++ or Linux GNU tool chains). There is no lockin to the processor architecture as the code gets retargeted for different runtimes. The product claims that it can create 1000’s of threads and shows impressive gains in performance and the package comes with performance analysis and debugging tools. Cons include support of only C++ language and the lockin to the tool.

Fig 2: Architecture of RapidMind tool (Source: RapidMind)

To see a bigger version of this graphic click here.

OpenMP is a standard for shared memory11 multiprogramming and is defined for C/C++/Fortran languages 12. It consists of compiler directives, runtime library and environment variables. The code is instrumented with directives and it gets compiled with the openMP enabled compiler.

Fig 3: OpenMP working (Source: https://computing.llnl.gov)

The standard allows automatic creation of multiple threads, forking and joining of them automatically. It is very useful in operations like array manipulation. Here is an example code:

The pros of the openMP include relative easiness of techniques, hiding of the thread semantics, ability to do incremental parallelisation across the code, support of wide variety of platforms and languages and support of coarse/fine grained parallelism. Cons include the need of tool chains (compilers, runtime) that support openMP. Sun Studio tool chain and GNU 4.3.1 support openMP. OpenMP is popular among mathematical/scientific communities.

CUDA is a development environment promoted by nVidia13. The model allows for scalable parallel programming model over thousands of threads14. Like OpenMP, the model proposes extensions to C/C++ languages. A set of threads get grouped as a 'thread block' and they work together, accessing a part of shared memory and each set of threads are mapped to a thread block. There are extensions for synchronisation and locking and the CUDA is designed for nVidia processors and involves a good amount of learning curve. CUDA is being adopted in scientific computing, visualization, high end graphics, financial modeling and digital signal processing. Outside of scientific computing, there is little current interest in the model as it is wedded tightly to the processor family right now.

Ct is Intel’s answer to nVidia’s CUDA. Ct supports C++ language extensions as part of STL library. The model is supposed to provide determinism to avoid deadlocks and races. The tool chain is (obviously) optimised for Intel x86 processors. This is a product that is still under development.

{pagebreak}The various software market segments like server, scientific computing and desktop computing are likely to adopt different strategies for multicore:

1. Desktop applications like spreadsheets, word processors or editors are not amenable for multi threading. There is huge effort involved in refactoring the code and making them exploit the power of parallel processing. It is likely that mainstream working software will not be refactored in a big way and they will continue to run as they are running today.

2. There are some “embarrassingly parallel” applications which can be rewritten using techniques like OpenMP/RapidMind/CUDA and they will enter mainstream. One example is the high end visualization where each pixel can be rendered in parallel and it lends it self to parallel programming. Similarly, in packet processing, each core can process an incoming packet independently. Such applications are easily scalable across multiple cores. Another example is “transcoding” where video encoded in one format needs to be changed to another format in realtime for streaming.

3. New languages like Erlang or hybrids like Scalia/F# will be tested in specific environments but widespread programming and deployment are unlikely to happen in near future, given the paradigm shift required.

4. On the Desktop side, new operating systems like Windows 7 allow processor affinity features. These allow the user to attach an application to a core, for example, antivirus software can be run on one core, multimedia playing on one core and rest of applications on other cores. This coarse grain parallelisation will improve user experience in desktops.

5. On the server side, virtualisation technology will allow each core to run different images or operating systems. One can have a core running Windows Server, another running Linux 2.6 and third running a legacy application on a Windows 95. There will be a good leveraging of the multicores in server side, but that will be at a coarser level of OS itself, rather than at application levels.

6. In the Embedded world, it is possible to refactor the code. It is also possible to logically separate code to large areas and run on different cores. For example, one could have a hypervisor, have the embedded code run on one core and have Windows running on another core. The data provided by embedded code can be visualised by Windows and they can cut down the overall cost of system.

{pagebreak}Conclusion

Multicore is here to stay and the software community is being forced to do something about it. The challenges of concurrent programming are proposed to be addressed by functional programming languages/hybrids and tools like CUDA, RapidMind, openMP that allow developers to refactor code.

In the short to medium term, there will be wider adoption in some areas like server applications (in form of virtualisation), specialised processing applications (like visualisation and high end graphics) and embedded applications. But, in the mainstream programming, progress will be slow.

In spite of tools and new approaches, it will be a quite some time before the software community is able to get decent increase in performance for applications.

References:

1. http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869.

2. http://www.slideshare.net/tabithascatena/build-high-performance-apps-on-multicore-systems-using-sun-studio-compilers-and-tools - gives good idea on Solaris tools based on openMP.

3. http://www.geek.com/articles/chips/thousands-of-cores-in-our-multi-core-future-20070330/.

4. http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf - University of California Berkeley report on trends in parallel processing.

5. http://www.mdronline.com/mpr_public/editorials/edit21_18.html - Classic “Dread of threads” article in MPR that gives a bleak view of unpreparedness of software world.

6. http://www.defmacro.org/ramblings/fp.html - good intro on functional programming.

7. http://www.ddj.com/development-tools/212201710;jsessionid=RECHGGDHI45WKQSNDLPCKH0CJUNN2JVN?_requestid=589724 – article in Dr. Dobb’s journal on functional programming.

8. http://www.artima.com/scalazine/articles/twitter_on_scala.html

9. http://www.mdronline.com/mpr_public/editorials/edit22_30.html - article in Microprocessor Review that discusses different tools for multicore.

10. http://www.rapidmind.net/pdfs/WP_RapidMindPlatform.pdf - RapidMind platform.

11. http://www.slideshare.net/timcrack/10-multicore-07 - good tutorial on shared memory parallel processing

12. https://computing.llnl.gov/tutorials/openMP/ - Tutorial on OpenMP from Lawrence Livermore Labs.

13. http://www.nvidia.com/docs/IO/55972/220401_Reprint.pdf - Report in MDR on CUDA.

14. http://www.slideshare.net/npinto/iap09-cudamit-6963-lecture-01-gpu-computing-using-cuda-david-luebke-nvidia-presentation - Good presentation on CUDA.

Rajagopal Nagarajan is head of the System Software practice at MindTree Ltd. He has more than 20 years of experience in embedded and networking product development. At MindTree, he has led the Networking industry group which was responsible for VoIP product development and the building of intellectual property on network processors. Before joining MindTree, he worked on device drivers development and networking product development at Wipro. Nagarajan holds a BE degree in Electronics and Communication engineering from The College of Engineering, Guindy, India.

About Author

Comments

blog comments powered by Disqus