The ITRS process roadmap and nextgen embedded multicore SoC design

by Fawzi Behmann, Power.org , TechOnline India - March 22, 2009

The chairman of the networking System Drivers Working Group at ITRS outlines the impact of that group's process technology roadmap will have on the architecture and capabilities of next generation multicore SoC designs.

Driven by such general macro trends such as Internet everywhere, IP everywhere and Seamless Mobility, in its 15-year assessment of semiconductor technology requirements, the International Technology Roadmap for Semiconductors projects that as technologies and structures push the limits of Moore's law and productivity, new semiconductor approaches to scaling and new functionality on- and off-chip will be required. Figure 1, belowshows macro trends for cellular technology.

The semiconductor technologies that will be required can be broadly categorized into three categories: "Moore" " Geometric Scaling; "More Moore" " Equivalent Scaling and "More than Moore" functional diversification " all of which will have significant impact on the embedded networking space with new Systems on chip architectures that make extensive use of

1) Multi-Core (MC),
2) Cache hierarchy,
3) On-chip fabric,
4) On demand Accelerator Engine (AE), and
5) Connectivity,

all engineered to provide a scalable, software-based Multi-Core/Accelerator Engine SoC (SOC-MC/AE) solution that targets a wide range of applications from ultra low-end to high-end that preserve & extend the user experience through new services.

Figure 1. Macro trends for cellular technology.

The three "Moore's"
While technologies and structures push the limits of Moore's Law and productivity, the ITRS initiated the concept of "More Than Moore," which first appeared in the 2005 ITRS publication, calls for the integration of Functionality that does not scale. It is mostly analog functionality, but also includes passives, high voltage, sensors, actuators and enablement.

During the ITRS summer conference, an overall definition was introduced grouping three aspects of "Moore" concept:

Moore: Geometric Scaling
More of Moore: Equivalent Scaling
More Than Moore: Functional Diversification

While "Moore's Law" is mostly focused on geometric scaling in continuing shrinking of horizontal and vertical physical feature sizes of the on-chip logic and memory in order to improve density (cost per function reduction) and performance (speed, Power) and reliability values to the applications and end customers.

"More of Moore" is about equivalent scaling which occurs in conjunction with, and also enables, continued Geometrical Scaling plus non-geometrical process techniques that affect the electrical performance of the chip. The third element is "More Than Moore" is about functional diversification.

The "More Than Moore" refers to the incorporation into devices of functionalities that do not necessarily scale according to "Moore's Law," but provide additional value to the end customer in different ways.

The "More-Than-Moore" approach typically allows for the non-digital functionalities (e.g. RF communication, power control, passive components, sensors, actuators, 3rd party IP/ennoblements) that to migrate to system board-level/particular package level (SiP) or Chip-Level (SoC) potential solution.

There is increasing tendency to have more functions on a chip which are not scaling according to the same pattern [as defined in Moore's Law]. This is functional diversification rather than scaling, but it's part of the same business and same technology.

The combination "Moore's Law" and "More Than Moore" enables the creation of system-on-a-chip and system-in-a-package and, as such, adds value to systems rather than just integrating more of the same functions on a chip.

Functional diversification in SoC design
The ITU-R is currently studying user demand predictions in future systems such as the amount of traffic in the year 2010 onwards in calculating required spectrum bandwidth for the future development of IMT-2000 and IMT-Advanced.

The IMT-2000 (International Mobile Telecommunications) systems are 3rd generation mobile systems, which provide access to a wide range of telecommunication services, supported by the fixed telecommunication networks (e.g. PSTN/ISDN/IP), and to other services which are specific to mobile users. Among the key features of IMT-2000 are:

1) Capability for multimedia applications within a wide range of services and terminals
2) High degree of commonality of design worldwide
3) Compatibility of services within IMT-2000 and with the fixed networks
4) High quality
5) Worldwide roaming capability, and,
6) Small terminal suitable for worldwide use

The next 5-15 years will also mark trends towards:

1) Scalable networks that deliver high rich multimedia content at broadband speed anywhere and anytime and on any device;
2) Markets in which the consumer will play a major role in creating high rich multimedia content;
3) Emergence of advanced IP-based applications and services that drive high bandwidth scalable networks;
4) Complex multi-processing platforms equipped with multi-core/multi-threading and accelerators that support advanced applications and services;
5) Advancement in process technology from 65-45-32, 22 and sub 10nm technology
6) Scalable encryption and antivirus everywhere in the network;
7) Home networking will be a complex network converging data communications, entertainment;
8) Seamless mobility in the home, in the office/vertical market, on the road

In contrast with PC & Server Applications, and due to the fundamental difference between core speeds and memory/IO latencies, today's embedded processor architectures are unable to deliver meaningful performance for the connected computing scenarios outlined earlier.

Nearly every commercially available integrated general-purpose processor shipping in volume today is designed using a single-threaded architecture, which is performance and application limited by today's standards. As applications are becoming more and more network-centric, this legacy processor design approach fails to address the throughput requirements of today's converging compute and networking paradigm.

This evolving packet-oriented environment is characterized by high memory access latencies, which are not effectively managed by conventional processor architectures. This weakness can severely impact processor performance and workload efficiency. When a memory access cannot be serviced immediately and no additional instructions are ready to be executed, conventional processors stall and waste valuable processing cycles.

{pagebreak}

SOC-PE Consumer & SOC-MC/AE Networking Architectures

Adding "More of Moore" to the mix (Table 1, below) provides a converged/ integrated heterogeneous platform (Figure 2, below) that makes possible the creation of a scalable, intelligent, compact value-add ecosystem. This SoC-PE/Platform implementation based on the scaling achieved by the use of the 3-Moore's is becoming an important paradigm moving forward.

Table 1. SOC-MC/AE Networking Platform and Moore's classification

Early in 2005, ITRS introduced SOC-PE Architecture Template, where a PE is a processor customized for a specific function targeting portable and wireless applications such as smart media-enabled phones or digital camera chips, but also high-performance computing and enterprise applications.

To complement this SOC-PE architecture, a Multi-Core/Accelerator Engine SoC Architecture template is defined to address the networking embedded space as shown in Figure 2 below. As shown, the MC/AE SOC networking platform contains the necessary building blocks to:

1) Support Multi-Core (MC) for high processing performance within 30 watt power envelope,

2) Support an unprecedented Tri-level Cache Hierarchy, with back-side L2 caches, multiple L3 shared caches and multiple memory controllers

3) Support high-speed inter-connectivity

4) Introduces a scalable On-chip Fabric for concurrent, non-blocking, hardware-based 100% cache-coherent platform connectivity which scales to more than 32 cores and supports heterogeneous cores.

5) Eliminate shared bus contention and supports dramatically higher address issue bandwidth to "feed" multiple cores

6) Include an On-demand Acceleration Engine (AE) that offers performance advantages over pure core processing cycles, enables lower power implementations and reduces silicon area / cost

7) support a ybrid Simulation Environment combining cycle-accuracy and functional-accuracy that enable ease of software development , performance prediction and optimization

8) Network/System Enablement & Ecosystem looking into software partitioning and virtualization leveraging multi-core hardware architecture

Figure 2. A Multi-Core/Accelerator Engine SoC Architecture template defined to address embedded networking apps

The MC/AE SOC network platform contains the necessary building blocks to provide a scalable, software-based solution and addresses a wide range of applications from ultra low-end to high-end that preserve & extend the user experience through new services.

Multi-Core (MC). The Multi-Core's frequency in a wide multi-core product will be targeted over one GHz. This platform targets the highest instruction-per-cycle (IPC) and highest frequency for a given watt per area.

The MC's are also designed to offload repetitive and computing intensive operations to high-performance acceleration blocks, increasing the number of processing cycles for higher throughput or new services and applications.

Each MC core in the platform will have its own L2 backside cache. Backside cache is connected to the CPU through a direct channel, enabling extremely high application performance.

It allows the cache to match the full speed of the CPU, resulting in latency improvements well over 50 percent of "shared bus/shared-cache" architectures. L2 backside cache also enables tuning the contents of the cache between instruction and data, according to different application needs, easing partitioning and improving performance by drastically reducing CPU stalls.

In addition, the L2 backside cache reduces traffic on the on-chip fabric and main memory, which reduces latencies and improves bandwidth for other users of the fabric and system memory.

Multithreading and multiprocessing are closely related. Indeed, one could argue that the difference is one of degree: Whereas multiprocessors share only memory and/or connectivity, multithreaded processors share those, but also share instruction fetch and issue logic, and potentially other processor resources.

In a single multithreaded processor, the various threads compete for issue slots and other resources, which limit parallelism. Some "multithreaded" programming & architectural models assume that new threads are assigned to distinct processors, to execute fully in parallel.

Cache Hierarchy. Recognizing the limitations of existing processors that rely on a shared cache model, a new approach calls for incorporating a three-tiered cache hierarchy into the MC Networking Platform. L1 cache is retained on the core.

As previously mentioned, L2 cache is attached to the cores as a backside implementation that can significantly improve performance. Each core has own back-side L2 cache that provides:

1) an aggregate bandwidth that could never be sustained by a single shared cache

2) Results in latency improvements vs. front-side (shared) cache

3) Back-side cache enables tuning of policies by core(s) according to different work sets for easier implementation of performance, isolation, priority, and QoS

4) A private cache is more self-contained (vs. a single shared cache) and can serve as a natural unit for resource management (e.g., powering off to save energy)

However, there are some tasks for which a shared cache is desirable, such as inter-processor communication and operating on shared data structures. For those instances, we are also providing a multi-megabyte L3 cache. This high-bandwidth, shared cache maximizes hit-rates while providing fast memory access for input/output (I/O) and accelerator blocks.

On-Chip Fabric. The on-chip fabric works in concert with the caching hierarchy to enable cache-coherent and concurrent accesses. The innovative backside cache implementation combined with the fabric is designed to enable data replication, modified intervention and full hardware coherence tracking.

The MC Networking Platform will employ highly scalable and modular on-chip fabric, the result of multi-year research and development, which enables cache-coherent, concurrent, low-latency connectivity among cores.

Unlike a shared bus as interconnecting medium among cores, memory and peripherals, the on-chip fabric helps to reduce the bus arbitration and contention issues that other multi-core architectures face as more traffic is introduced into the system. It behaves like a mesh, allowing concurrent traffic to enter and exit the system from any point within the fabric rather than through a single point.

Inherently scalable, the fabric is designed to sustain multiple, fully-coherent transactions every cycle and easily expand to accommodate more cores. On-Chip fabric also supports the option for heterogeneous clustering, allowing full portfolio of MCs, which spans a wide range of power and performance design points, to be mixed and matched in a product with full coherency among the cores.

Connectivity. The MC Networking Platform integrates an extensive set of networking and I/O resources to support its high-throughput architecture. This section focuses on these resources, which provides system designers a wide range of choices for scalable, high-performance systems.

{pagebreak}

SOC-MC/AE Networking Platform Interfaces and building blocks
The SOC-MC/AE Networking Platform supports multiple interfaces including RGMII, XGMIII, and SPI-4.2 Interface controller. Additional high speed interfaces include: PCI-X interface and serial RIO interfaces.

Peripherals Interface. Peripheral devices and ROMs are connected to the MC Networking Platform through the various ports of the Peripherals Interface. The ports are created with different combinations of a 32-bit Peripherals I/O Bus and the programmable General-Purpose Input/Output (GPIO) signals.

The MC Networking Platform has essential standard busses such as standard I2C bus ports where each consists of two bidirectional bus lines; the Serial Data (SD) line and the Serial Clock (SCLK) line.

On-Demand Accelerator Engine (AE). On-demand acceleration provides accelerator Engines (AEs) technologies to take MC networking architecture to a new level of performance and flexibility. An asynchronous, shared-resource architecture enables lower-latency and multi-task handling without the overhead of thread switching.

On-demand application acceleration offers performance advantages over pure core processing cycles, enables lower power implementations and reduces silicon area thus reducing cost. On-demand, high-performance Accelerator Engine's (AE's) technologies include:

1) Pattern matching for deep packet inspection and full content processing
2) Decompression/Compression to unpack data for inspection and pack it for delivery
3) Crypto security for confidentiality, integrity and authentication
4) Table lookups for packet parsing and flow classification
5) Data path resource management to efficiently allocate on-chip resources
6) Packet distribution and queue management

Hybrid Simulation Environment
The SOC-MC/AE networking platform will require full system simulation model, a hybrid that combines cycle-accurate modeling technology with functional modeling technology that enables ease of software development, performance prediction and optimization of customer applications for the MC networking platform.

Using the hybrid simulation environment, which allows easy switching between functional and cycle accurate models, developers will be able to migrate and partition operating systems, middleware and applications onto the virtualized MC networking platform for development, debugging and benchmarking - even prior to silicon availability.

The environment also enables safe and easy experimentation with partitioning, parallelizing and optimizing systems and applications. Software developers can perform "what if" scenarios and tune the performance for specific situations without real-world hardware constraints. The hybrid simulator provides a programmer's view of the hardware, and features the following elements:

1) A fast, functional model for the MC networking platform
2) A detailed cycle-accurate model of the MC networking platform
3) A comprehensive package with infrastructure and tools for software development, code partitioning and debugging, profiling and visualization
4) Visibility into system state both architectural and micro architectural including caches and registers pipelines.
5) Run-time control of execution software including break pointing, stepping and reverse execution
6) Ability to boot multiple operating systems

A major advantage of a hybrid simulator is its ability to dynamically switch back and forth from a high speed functional mode to a more detailed cycle-accurate mode.

This allows software developers to quickly boot an operating system and execute code at critical points and then switch to the more detailed cycle accurate mode to analyze specific areas of interest - no more waiting days for results.

As a development platform for multi-core systems, the hybrid simulation environment is designed to enable an extensive amount of flexibility and experimentation in a non-invasive environment - no instrumentation is needed in the operating system or application. Software developers are able to decrease bring-up time for the target system all while improving the overall quality of their code.

{pagebreak}

The MC/AE Enablement Ecosystem (EE)
MC/AE Networking platforms will require software engineers to spend significantly more time thinking about software architecture. Exploiting the performance potential of MC processors means embracing parallel processing, which can be a challenge given the long and successful history of single core systems that are largely self synchronizing.

Networking applications offer coarse grained parallelism in the form of packet processing, and the interactions between a networking data path and the control plane are sufficiently decoupled to create an additional level of parallelism.

While this immediate parallelism is easy to envision, things get interesting when the performance requirements of a data path flow exceed a single CPU's capabilities, or when a single core can't provide sufficient control plane responsiveness. Load balancing and mixed asymmetric/symmetric multi-processing environments on the same device are challenges that MC Networking Platform is designed to address.

While software architects are thinking about distribution of tasks, the processing densities offered by MC Networking Platform will cause hardware architects to think about consolidation and re-partitioning of functions that have been distributed across discrete CPUs or modules.

These decisions will interact strongly with the introduction of new services and capabilities in the system. For both software and hardware architectures, there is a need for a great deal of flexibility in a multi-core processor and for good mechanisms to help facilitate experimentation with future architectures.

The SoC-MC/AE networking platform implements cores, each with their private L2 cache, also known as backside cache. In addition, the platform is equipped with on-demand accelerator engine that can be application specific.

While the Multi-core Platform is designed with aggressive performance targets, ease of use has also figured prominently in the platform definition. One of the significant obstacles in multi-core implementations today is programming efficiency and debugging. The two most likely scenarios (shown in Figure 3 below) are::

Scenario 1: Number of cores are normalized to 1-core in 2007 and System performance normalized to 1-core in 2007.

In this scenario, system performance at 45nm delivers 3.6x the performance at 65nm that required 3.7 cores against 1 core at 65nm. Similarly, at 32nm, system performance is 13.5x performance with 7.5 cores compared to 1 core at 65nm. The graph shows that performance is linear.

Scenario 2: Number of cores are normalized to 4-core in 2007 and System performance normalized to 4-core in 2007.

In this scenario, system performance at 45nm delivers 14.7x the performance in at 65nm that required 10.9 cores against 4 cores at 65nm. Similarly, at 32nm, system performance is 54x performance with 30 cores compared to 4 core at 65nm. The graph in Figure 3 below shows that performance is linear.

Figure 3. Two likely scenarios for the evolution of multicore SoCs as process technologies move from 65 down to 32 nanometer and below geometries.

The SOC-MC/AE Platform Value Proposition
Tomorrow's networking needs can no longer be met by increasing the operating frequencies on single-core architectures. Adding cores (MCs) will improve performance (Geometric Scaling).

But thermal management challenges, in the embedded space, are overwhelming the performance improvements achievable by increasing CPU frequency. Hence the need to look at the challenge from the SOC Platform perspective.

There may be contention for bus bandwidth and memories, scalability problems, and perhaps even worse, unused processing cycles due to lack of programming visibility.

Adding Accelerator Engines (AEs) will continue to add incremental improvement to performance (equivalent scaling) in the context of SOC-MC/AE networking platform. But leveraging the hardware require greater investment in software enablement and simulation environment (Functional Diversification).

Thus, SOC-MC/AE Networking Platform is not only designed to provide superior performance and energy efficiency, but also to help make the transition to multi-core processors as quick and as painless as possible with an industry leading enablement ecosystem.

Thus, Multi-Core (MC), Accelerator-Engine (ME) and Simulation/ Enablement/ Ecosystem (SEE) are three ingredients that will change the landscape of networking and will deliver a scalable & sustainable performance to meet next generation advanced application and services.

Fawzi Behmann is the Chair of the Marketing Committee of Power.org, the open community driving collaborative innovation around Power Architecture technology, Fawzi is working with member companies advancing the roadmap of the Power architecture and ecosystem. Fawzi is also chairing the networking System Drivers Working Group at ITRS (International Technology Roadmap for Semiconductors) and is presently contributing in defining networking platform that will address a new class of advanced applications and services in the coming 10-15 years. He has also been the Director of Strategic Marketing for the Networking Systems Division within Freescale's Networking and Multimedia Group.

References
1. ITU-R M.1645 "Future Development of IMT-2000 and IMT-Advanced", WG Spectrum, Document 8F revised draft, July 2007
2. IEEE Communications, "Web Services in Telecommunications", "Orchestration in Web Services and Real-Time Communications", July, 2007 PP. 26-27, 44-50
3. IEEE Wireless Communications, "New Generation Heterogeneous Mobile Networks", April 2007, PP 2-3
4. IEEE Wireless Communications, "The Multiple Access Scheme for Wireless Communication", June 2007, PP2-3
5. IEEE Wireless Communications, "Next Generation CDMA vs OFDMA for 4G Wireless Applications", June 2007, PP 6-7
6. IEEE Wireless Communications, "IFDMA: A Scheme Combining the advantages of OFDMA and CDMA", June 2007, PP 9-17
7. Communications News " Enterprise Network Solutions "Are you ready for converged IP?", July 2007 PP40-41
8. Semiconductor International, "Semicon West 2007", June 2007, PP20
9. Mobile Enterprise " Connecting Enterprise Solutions to Business Strategy, "Bettering Behavior, Mobile Tools", July, 2007, PP8, 19-25
10. EE Times, "Freescale CEO: IC growth drivers shifting", July 2, 2007, PP8
11. IEEE Micro, "Hot Chips 18", March-April, 2007, PP 7-9, "The AMD Opteron Northbridge Architecture", PP 10-21, "The Blackford Northbridge Chipset for the Intel 5000", PP 22-33, "ARM996HS: The First Licensable, Clockless 32-bit Processor core", PP. 58-68
12. Power Architecture " Cell BE, "Cell Microprocessor", Wikipedia
13. IEEE Computer Society, "Synergistic Processing in Cell's Multi-core Architecture", 2006, PP10-24
14. ACM, "Evolution of Low Power Electronics and its Future Applications", ACM, 2003, PP2-5
15. IEEE Comp Society, "CMOS Scaling for sub-90 nm to sub-10 nm", 2004, PP1-6
16. IEEE Journal of Solid State, "CMOS Technology " Year 2010 ad Beyond", 1999, PP 357, 366
17. IEEE " Proceeding of 8th IPFA 2001, "Direction of Silicon Technology from Past to Future", 2001, PP 1-35
18. ITRS 2005 Publication " Introduction of "More than Moore" concept
19. ITRS 2007 Summer Working Group Workshop/Public Conference " Work in progress on "more than Moore"
20. Semiconductor International, July 18th ITRS Summer Conference " Panel Focus on "More Than Moore", by Peter Siger, Editor-in-chief
21. ITRS 2007 System Drivers Publications, Networking Driver " SoC Multicore/Accelerators Platform , Pages 3-5

To learn more about designing embedded systems designs using multicore technology, check out a number of classes and presentations on this topic at the Embedded Systems Conference Silicon Valley 2009.

Comments

blog comments powered by Disqus