Beyond cores: Unlocking multicore’s full potential

by Stephen Lau, , TechOnline India - March 09, 2011

Instead of focusing on cores alone, developers should turn their attention to system architecture and the performance of key interfaces when gauging the quality of a new platform. This will enable them to measure the function of key interfaces and correlate those to software operations to unlock the full potential of a multicore device.

The number of cores in a system isn’t the be-all and end-all when it comes to power; when developers switch from a single-core to a multicore device, they can still run into limits in performance. These caps aren’t just about processor speed. They could also be due to the constraints of the overall system configuration.

So instead of focusing on cores alone, developers should turn their attention to system architecture and the performance of key interfaces when gauging the quality of a new platform. That will enable them to measure the function of key interfaces and correlate those to software operations to reach what should be their end goal: unlocking the full potential of a multicore device. 

In the past, with single-core devices, processor speed was the key determinant of performance. Now—with the addition of four, six or more processing elements—measuring system performance can be trickier. 

Links in the chain

You’ll notice we’ve changed our terminology from cores to processing elements. That’s because multicore devices typically do not consist of just cores. They also include hardware accelerators and intelligent peripherals. Those processing elements are all part of the processing chain, so they must be considered in order to assess and tune system performance accurately.

Processor performance continues to increase exponentially, as integrating multiple cores, accelerators and even subsystems into a single device has become routine. This is an extension of the current trend to integrate card-level functions into a single device. 

Because interoperability between devices relies on standards, however, the pace of change for card-level I/O has been much slower. At the same time, the number of interfaces on devices has increased while physical package sizes remain constrained, forcing interfaces to become both fast and narrow. 

As a result, as is illustrated in Figure 1, real I/O performance is leveling off.


Figure 1. CPU performance vs. I/O performance.


Where to focus
Focusing on the system level requires a thorough understanding of the data flow. Figure 2 shows a block diagram of a typical multicore device, with a concentration on the shared aspects of the device. 



Figure 2. Monitoring performance on shared interfaces on a TI KeyStone multicore device.


The processing elements consist of cores and coprocessors that share I/O, external memory interface and internal memory, and are connected by on-chip buses. Typically, the external memory interface is the most heavily used. One potential solution is to increase internal memory, but this is expensive compared with leveraging commodity external memory. On-chip buses are normally designed with the appropriate performance level. 

From a software perspective, maximum efficiency is obtained when all processing elements are utilized. 

Modern processing elements achieve maximum performance with repetitive operations on blocks of data—analogous to single instruction, multiple data (SIMD)—in a core. Those operations are data intensive, however. 

The more parallel an algorithm becomes, the greater the data demands on the system. Therefore, having the ability to monitor performance on the memory interface that feeds the processing element is crucial. It is also valuable to measure performance at all key processing element interfaces to gain an overall view of system performance. 

Maximizing performance

Understanding data flow enables developers to focus their efforts on the appropriate interface to boost system performance.

In Figure 2, the external memory interface is of key interest. In many devices, counters in the memory interface provide throughput information. This may not be sufficiently fine-grained if the memory interface is shared among processing elements.  

Users need the ability to narrow performance measurements to the processing element of interest. Correlating system-level performance to program operation helps developers make decisions on system performance in the proper context. 

Multicore devices deployed in real-time applications must maintain real-time performance while achieving high power efficiency and improved costs. Real-time systems have deadlines, so performance information, such as accumulated wait time, helps developers understand why a processing element is not performing as expected when working with a specific interface.

For example, if the accumulated wait time for a processing element to access a RapidIO serial interface is longer than expected, it may be the result of other processing elements monopolizing the interface. Information such as average access width is also valuable, as it helps developers understand whether the performance level is due to the type of accesses being made or is the result of some other issue. 

Designing a debug architecture to solve key challenges of performance analysis on important interfaces and to provide visibility into multicore is often overlooked during device selection, but it can have a major effect on product schedule and performance. 

Industry standards, such as the IEEE 1149.7 debug standard and the MIPI Alliance’s System Trace Protocol (STP) specification, have helped make system visibility and debug capability easily recognizable and more straightforward for software developers to request. 

Users must also consider the cost of utilizing advanced debug capabilities. For example, you can collect system trace information through an on-chip buffer or a low-expenditure debug and test controller. 

System tracing using the STP standard provides software developers with a hardware-accelerated multicore “printf” capability, through which messages from each processing element are identified and are globally time-stamped by hardware. Software developers receive a global time-correlated view of software execution across processing elements. 

The capability also provides multiple channels that aid in filtering messages. 

For instance, developers can leverage different channels for each type of software function: Low-level device drivers would be on channel 1, operating system messages on channel 10 and application threads on channel 100. In this instance, a developer working on a hardware problem could view device driver messages by filtering for channel 1 on cores 1 and 2. That would result in the system showing just the device driver messages from those cores. The developer could also correlate these to the performance on a system interface. 

In Figure 3, transactions made to an accelerator are measured by the CTools 4 Bus watchpoint and traffic monitors. The transactions are shown in context with the software threads.


Figure 3. Visualization of system trace information showing measurement of transactions on a key interface in a TI KeyStone multicore device.


The software threads, which include messages from two processing elements, provide developers with the ability to quickly find and eliminate inefficiencies at the device level.

When working with multicore information, visualization can be a challenge, as trends quickly identified graphically may not be so obvious in a textual display. 

Figure 4 highlights performance analysis on a memory interface from the perspective of a processing element in the device. It also shows throughput (red) and the average access width (green) on the interface. Developers can use such information to fine-tune their systems in order to improve performance.



Figure 4. Performance analysis on memory interface showing throughput (red) and average access width (green).


Although the potential is there for multicore processors to  increase processing performance significantly, developers must change their focus from the core to the system. 

The ability to obtain a time-correlated view of system performance and software activity is necessary for developers wishing to unlock the full potential of their multicore devices.

About the Author:



Stephen Lau is in emulation technology product management at Texas Instruments. He is responsible for the definition of on-chip debug technology and associated emulator products deployed through TI’s Third-Party Emulation Developer Community.

About Author


blog comments powered by Disqus