The keys to success in multicore application development

by Rob Oshana , TechOnline India - August 11, 2009

Configuration, virtualization, and visualization are the keys to successful embedded multicore system integration, says Robert Oshana, Freescale's director of engineering in the Development Technology Group.

Multi-core processors are becoming ubiquitous in embedded processing. But as these processors become more and more complex, the application developer needs to understand many important architectural details to facilitate proper partitioning of applications across multiple processing elements.

These processing elements could include multiple heterogeneous or homogeneous CPU's as well as function acceleration blocks, and complex peripheral subsystems. Multicore processors are complex systems and require the following to be successfully adopted:

1. System configuration and partitioning, to achieve the best overall performance of the application
2. System virtualization, , to abstract the complexity from the developer and provide flexibility in the solution model
3. System visualization, , to understand the system performance and profile as data flows through the cores, accelerators, peripherals, and communicate interconnect.

Multicore in Networking Applications
As an example, we will consider the networking space where multicore processing is growing. In the embedded networking area, a network processor is a processor which has a feature set specifically targeted at the networking application domain. These processors are software programmable devices because they are used in many different domains, including:

* Routers and switches
* Firewalls
* Intrusion detection devices
* Intrusion prevention devices
* Network monitoring systems.

Networking applications require both control and data plane processing (Figure 1 below). Data plane processing consists of both ingress and egress processing. Ingress processing requires high performance since packet types can be of various lengths and protocols and all packets must be parsed, classified, checks for denial of service attacks and other security checks, and possibly edited and modified in various ways.

All this must be done at line rates (the data rate of the raw bit stream of a communication link) so performance is key to ingress processing. Egress processing is easier and mainly consists of traffic management functions.

Control processing essentially controls the state of the network elements including route selection, capability signaling, etc. Control plane processing can be performed with standard RISC based processing elements and CPU MIPS are the key focus in this area.

Figure 1. The Network Processing domain consists of data plane and control plane requirements

So the key question is how to translate this user domain to the device domain? In other words, how do we partition the application onto a device that meets both the functional as well as non-functional (e.g. performance and QoS) requirements?

System configuration and partitioning
Embedded systems are designed for efficiency in performance, power and memory. Many embedded systems have the most significant computational requirements driven by a relatively small number of algorithms, which can be identified using common profiling techniques.

These algorithms can then be optimized using software techniques or converted to hardware acceleration using design automation tools. The "accelerators" can then be efficiently interfaced to the offloaded processor, significantly increasing overall system performance.

Figure 2 below is an example of an embedded processing system using these techniques. This device has eight e500 Power Architecture processing cores and acceleration blocks used to manage several important system functions including pattern matching, encryption, buffer management, queue management, and frame management.

If you map this back to the control and data plane processing requirements mentioned earlier, then the partitioning becomes clearer. The data plane processing requirements can be mapped to the acceleration blocks and some of the CPU cores, and the control plane requirements can be performed using the remaining e500 cores. Since there are eight cores, they can be partitioned, if necessary, to perform a combination of data plane as well as control plane processing.

But there are a number of complicating factors that make this easier said than done. For example, it may be necessary to run a light weight OS like an RTOS on the data plane cores due to performance and QoS requirements.

A heavy weight OS like Linux may be required to control the complicated control processing on the control plane. There may even be a requirement to run two or more operating systems on a single core if we want to preserve an earlier system configuration and quickly migrate this legacy system to a Multicore processor.

Figure 2. Multicore processor with 8 processing elements and acceleration blocks

Embedded processing has been adopting various forms of parallelism for many years. Bit level parallelism has, of course, been addressed using larger and larger word sizes; 8 bit, 16 bit, 32 bit, 64 bit, etc.

Instruction level parallelism (ILP) has been addressed by adding more execution units and then using the compiler to do the hard work of managing the data dependencies and scheduling the parallel instructions on the execution units.

Data parallelism has been addressed using technologies like Single Instruction Multiple Data (SIMD) which is addressed using both hardware as well as software libraries and compiler technology to implement.

{pagebreak}Whigh approach to parallelism is best?
Some applications map well to this sort of parallelism such as certain matrix operations. But other applications like cryto and security algorithms are more sequential in nature and do not lend themselves well to data parallelism.

Task parallelism is the last form of parallelism and involves mapping parallel threads of instructions onto the multiple cores. This is primarily a manual effort to identify the parallel execution threads possible in an application and possibly instrumenting the code to identify to a tool how to partition the application. The key challenge is identifying the parallelism opportunities.

Multicore processors are suited for both data and task parallelization. Figure 3 below shows an example of a multicore processor which a configuration that supports both data and task parallelism. Each of the cores in the data plane is running the same packet processing application but can operate on different types of data such as TCP, UDP, and RTP.

The control plane is primarily a task parallel model, running various threads on multiple cores performing tasks such as error handling, table updates, execution management, and other management functions.

Now we need to discuss how to actually "operationalize" a system that is conceptually partitioned like this. How do we actually make this work using operating systems as the primary programming interface for a complicated system like this?

Figure 3. Data parallelism and Task parallelism on the same device

The system configuration of multicore solution involves decisions of how to partition the application services across multiple cores and accelerators. In this example networking application the key partitioning decisions included how many cores are required to host the application control plane to achieve the desired performance and capability as well as how many cores are needed for the data plane.

It's common for embedded systems to also share many of the system resources between the various tasks in the system, including peripherals, memory, and of course the cores themselves.

Two approaches to multicore configuration
There are two approaches to configuring a Multicore system using operating system resources. The first approach is to control the multiple cores on the device with a single operating system. This approach is referred to as SMP (Symmetric Multi-Processing).

This is how many desktop systems work. The OS controls all of the major management functions of the system including scheduling, messaging, synchronization, memory management, and the other services required to implement the complete system. This approach works well for relatively simple systems but is hard to scale as the system complexity grows.

An alternative approach is to run an independent copy of the OS on each of the cores. This is referred to as AMP (Asymmetric Multi-Processing) . This approach provides support for added system complexity and flexibility.

For example, each core can be independently restarted if needed using an AMP approach. The OS's can also be different (Linux plus a RTOS for example running on different cores). The key challenges in this approach include the additional difficulty for the developer to setup a system like this (detailed knowledge of the processor and the OS is required to configure a system like this).

There are additional system level issues such as preventing software running on one core from interfering with the memory of another core. Fortunately, there are ways of managing this as well using additional software we will discuss shortly.

There is nothing that prevents us from running a combination of SMP and AMP configurations for systems that require the benefit of both. For example, next generation wireless standards such as Long Term Evolution (LTE) have requirements for both the physical layer 1 (PHY) software which is very MIPS intensive and requires low latency processing, and the Medium Access Control (MAC) layer 2 software which is more control and state machine oriented.

Figure 4. A networking application with SMP configuration for the control plane and AMP configuration for the Data plane

In a system like this it's possible to use one or more cores running an AMP configuration to process layer 1 using an RTOS or some other light weight executive, and then hand off the processed data to layer 2, which is running with SMP with an OS such as Linux which maps better to the control and management processes for MAC. Figure 4 above shows an example system configuration for this telecom/networking application.

{pagebreak}System Virtualization
It's obvious from the previous discussion that multicore computing has several different configuration options based on the system functional and non-functional requirements.

There are advantages to this flexibility but this flexibility can also lead to complexity which must be managed. We have seen the scenarios where multiple operating systems may be required on a multicore device.

There may even be the need to have more than one OS share the same core if we choose to enable a legacy application or provide performance capability where low latency is required. To accomplish this and also manage the inherent complexity this leads to, we must take additional steps to transform the multicore system to a virtual machine.

A virtual machine is created by running the OS's on a software implementation of the device. This software is called a hypervisor and sits below the operating system(s) and manages access to the system resources such as memory, peripherals, and other system resources that are not duplication for each core.

One of the key capabilities of a hypervisor is to provide memory protection, often in the form of a memory management and protection capability. Due to the latency requirements of embedded systems, an embedded hypervisor must be real-time capable, as well as low memory and low overhead.

As such, hypervisors usually have support at the device level to accelerate some of the common tasks. A hypervisor with this type of support is referred to as a "bare metal" hypervisor.

Once again, the choice of whether a hypervisor is needed depends on the system requirements. A simple multicore configuration may be a master-slave configuration where one core manages the others and is the only interface to I/O. The memory map may be static (does not change during run time).

In this scenario a hypervisor may not be required. But for more complex system configurations, the extra complexity must be managed using another layer of software abstraction. Hypervisors can also manage system flexibility.

For example, in low power applications where we may want to dynamically shut down cores in times of low network demand and run multiple OS's on the same core, and then bring additional cores back on line in times of peak demand can use a hypervisor to manage this dynamic scenario.

Figure 5 below shows a conceptual diagram of a multicore system that is virtualized using virtual software machines on which an operating system is running various data and control plane applications.

Each virtual machine has access to a subset of the system cores, memory, and I/O/peripherals. The Hypervisor provides virtualization of cores, memory, I/O, and asynchronous events such as interrupts from the environment.

Figure 5. Several virtual machines managed by a Hypervisor software layer System Visibility

One of the disadvantages of integrating multiple cores and accelerators on a single die is the vanishing visibility that this causes. There is limited access to the on chip capability from the outside world (e.g. the device pins), so bus analyzers and logic analyzers become more difficult to use in these scenarios.

In order to overcome this, more of the instrumentation, debug, and profiling logic is being moved from the logic analyzer to the chip itself by integrating the various cores, accelerators peripherals, and interconnect with the appropriate triggers, counters, and trace capability.

This all provides the information necessary to aid the system integrator with system integration, debug, and profiling. Figure 6 below shows a system wide debug/profiling architecture that maps onto the Multicore architecture shown in Figure 2 earlier.

Figure 6. A system debug architecture to support a multicore device

Capability to configure, trigger, decode, sample, and window debug and profiling information is now part of the on chip debug logic. The challenge becomes using this complex logic and circuitry in a way that supports the debug use case scenario and providing enough coverage to prevent "black boxes" in the system where there is no visibility into certain aspects of system performance.

Full system visibility provides access to debug and profiling information from the cores, peripherals, accelerators, and interconnect. For system integration, this information is used in different ways, including:

* System level debug. Debug information is collected from a specific problem area and "shallow" visibility is required, if the proper window can be created to capture the relevant debug information.

* System profiling. Profiling information is collected from across the system and "deep" visibility is required since the application may have to be run for an extended period of time to capture the relevant system profile.

As an example of the system profiling challenge, Figure 7 below shows the processing path for the network processing application we have been discussing.

Packets come into from the Gigabit Ethernet peripheral, through a Frame Manager accelerator which manages the network packets and creates the required packet queues, into the data plane cores for dedicated processing.

They then go either back out through the Gigabit Ethernet or possibly onwards to additional data plane processing and perhaps additional accelerator processing such as packet security or a pattern matching engine for aid in deep packet processing.

The developer needs visibility into this complicated processing path and the use of the debug/profiling architecture provides the needed visibility into this application use case for debug as well as system profiling.

Figure 7. An application use case showing packet flows through a multicore device

In order to provide this needed visibility the developer must use the appropriate development tools or scripting to;

* Configure and window the on chip debug/profiling logic,
* Trigger the capture of data when the appropriate conditions are met,
* Extract the data and decode the data into coherent information,
* Display the data to the user in an appropriate way

The solution should provide system-wide visibility as shown in Figure 8 below. Its not just about what's going on inside the core(s), its now a system wide debug/profiling challenge and the multicore system integrator needs this data to perform the required system integraiton, test, and profiling.

Figure 8. A system level visualization support for a multicore device provides full system visibility

Conclusion
Multicore processors are complex systems and can be used to solve complex problems. In order to take full advantage of these powerful devices, they must be configured properly to achieve the best overall performance of the application

They must also be virtualized to abstract the complexity from the developer and provide flexibility in the solution model, and visualized to understand the system performance and profile as data flows through the cores, accelerators, peripherals, and communicate interconnect.

Proper support and understanding of these three multicore paradigms will give the developer the tools needed to effectively produce high performing multicore solutions.

References
[1] Embedded Multicore: An Introduction, Freescale Semiconductor, EMBMCRM, Rev. 0, 07/2009

Robert Oshana, Director of Engineering in the Development Technology group at Freescale has 25 years of experience in the real-time embedded industry in both applications as well as tools technology development. He currently manages an international engineering team with global product development responsibility. He is widely published in the industry and speaks regularly at the embedded systems conference. Rob has chaired international standards committees in the embedded space and is a licensed professional engineer. He is also an adjunct lecturer at Southern Methodist University.

About Author

Comments

blog comments powered by Disqus