Multi-core processors are becoming ubiquitous in embedded processing. But as these processors become more and more complex, the application developer needs to understand many important architectural details to facilitate proper partitioning of applications across multiple processing elements.
These processing elements could include multiple heterogeneous or homogeneous CPU's as well as function acceleration blocks, and complex peripheral subsystems. Multicore processors are complex systems and require the following to be successfully adopted:
1. System configuration and partitioning, to achieve the best overall performance of the application
2. System virtualization, , to abstract the complexity from the developer and provide flexibility in the solution model
3. System visualization, , to understand the system performance and profile as data flows through the cores, accelerators, peripherals, and communicate interconnect.
Multicore in Networking Applications
As an example, we will consider the networking space where multicore processing is growing. In the embedded networking area, a network processor is a processor which has a feature set specifically targeted at the networking application domain. These processors are software programmable devices because they are used in many different domains, including:
* Routers and switches
* Firewalls
* Intrusion detection devices
* Intrusion prevention devices
* Network monitoring systems.
Networking applications require both control and data plane processing (Figure 1 below). Data plane processing consists of both ingress and egress processing. Ingress processing requires high performance since packet types can be of various lengths and protocols and all packets must be parsed, classified, checks for denial of service attacks and other security checks, and possibly edited and modified in various ways.
All this must be done at line rates (the data rate of the raw bit stream of a communication link) so performance is key to ingress processing. Egress processing is easier and mainly consists of traffic management functions.
Control processing essentially controls the state of the network elements including route selection, capability signaling, etc. Control plane processing can be performed with standard RISC based processing elements and CPU MIPS are the key focus in this area.
 |
| Figure 1. The Network Processing domain consists of data plane and control plane requirements |
So the key question is how to translate this user domain to the device domain? In other words, how do we partition the application onto a device that meets both the functional as well as non-functional (e.g. performance and QoS) requirements?
System configuration and partitioning
Embedded systems are designed for efficiency in performance, power and memory. Many embedded systems have the most significant computational requirements driven by a relatively small number of algorithms, which can be identified using common profiling techniques.
These algorithms can then be optimized using software techniques or converted to hardware acceleration using design automation tools. The "accelerators" can then be efficiently interfaced to the offloaded processor, significantly increasing overall system performance.
Figure 2 below is an example of an embedded processing system using these techniques. This device has eight e500 Power Architecture processing cores and acceleration blocks used to manage several important system functions including pattern matching, encryption, buffer management, queue management, and frame management.
If you map this back to the control and data plane processing requirements mentioned earlier, then the partitioning becomes clearer. The data plane processing requirements can be mapped to the acceleration blocks and some of the CPU cores, and the control plane requirements can be performed using the remaining e500 cores. Since there are eight cores, they can be partitioned, if necessary, to perform a combination of data plane as well as control plane processing.
But there are a number of complicating factors that make this easier said than done. For example, it may be necessary to run a light weight OS like an RTOS on the data plane cores due to performance and QoS requirements.
A heavy weight OS like Linux may be required to control the complicated control processing on the control plane. There may even be a requirement to run two or more operating systems on a single core if we want to preserve an earlier system configuration and quickly migrate this legacy system to a Multicore processor.
 |
| Figure 2. Multicore processor with 8 processing elements and acceleration blocks |
Embedded processing has been adopting various forms of parallelism for many years. Bit level parallelism has, of course, been addressed using larger and larger word sizes; 8 bit, 16 bit, 32 bit, 64 bit, etc.
Instruction level parallelism (ILP) has been addressed by adding more execution units and then using the compiler to do the hard work of managing the data dependencies and scheduling the parallel instructions on the execution units.
Data parallelism has been addressed using technologies like Single Instruction Multiple Data (SIMD) which is addressed using both hardware as well as software libraries and compiler technology to implement.