Today’s SOC (system-on-chip) processors integrate a diversity of cores, accelerators, and other processing elements. These heterogeneous multicore architectures provide increased computational capacity, but the resulting complexity also poses new challenges for embedded-system developers across a variety of applications, including control-plane processors, video servers, wireless base stations, and broadband gateways.
Discrete cores each have full access and control of their resources. Such predictable access allows straightforward management and deterministic performance in applications with real-time constraints. In a multicore architecture, however, cores share access to resources, and potential contention complicates many design factors, such as processing latency and deterministically handling interrupts.To provide deterministic behavior equivalent to that of single-core devices, multicore architectures have begun to implement resource-sharing and management techniques that have been proved in network communications. These architectures use established queue- and traffic-management techniques to efficiently allocate resources among multiple cores, maximize throughput, minimize response latency, and avoid unnecessary congestion.
From an architectural standpoint, SOCs are complex systems with multiple cores that connect across a high-speed fabric to a variety of controllers and resources (Figure 1). In many ways, the myriad interactions within an SOC resemble a communications network with multiple sources, or cores, that interconnect to the same destinations, including memory, peripherals, and buses. Not surprisingly, bandwidth-management techniques, such as virtualization, which designers developed to improve network efficiency, have proved useful in managing traffic among multiple processor cores and shared peripherals.
Virtualization of on-chip resources enables cores to share access; this shared access is transparent to applications. Each application can treat a resource as if it were the sole owner, and a virtualization manager aggregates shared ownership—measured by the amount of allocated bandwidth. Virtualizing and sharing access to resources require both a queue manager and a traffic manager. Applications use one or more queues to buffer access to a resource. Virtualization adds events or transactions to the queue and pulls them off when the resource is
available. Queues comprise a list of buffer descriptors pointing to data in a buffer, and you can implement queues in many ways, depending on the needs of the applications. The number of supported queues varies in an SOC from a few hundred to hundreds of thousands to meet the needs of various applications.
The queue manager updates the queue state—that is, the queue size, head pointer, tail pointer, and start address— and maintains fill levels and thresholds, including full, almost full, almost empty, and empty. The queue manager also provides full memory management for each queue, including allocation and deallocation of buffers from free pools and checking of access rights when an event is added to a queue (Figure 2). Multiple requesters may simultaneously add descriptors to one or more queues, as well as allow selection from multiple queues waiting for a service.
The manager serves as the arbitrator for available bandwidth among queues assigned to the same resource. It performs this task not only between applications sharing a resource but also among the multiple queues an application may have to enable QOS (quality of service).
Traffic management employs policing and shaping mechanisms to measure and control the amount of bandwidth assigned to a flow or a group of flows. Policing controls the rate at which the traffic manager adds events to a queue, and shaping is the rate at which the traffic manager removes events from the queue. For the most control and ability to manage queue priority, you must implement policing and shaping on a per-queue basis. The traffic manager also maps multiple queues to a single shared resource based on a predefined servicing algorithm.
By bringing queue and traffic management together, you can provide reliable, end-to-end QOS. This approach allows multiple paths to share a resource without negatively affecting bandwidth subscriptions. Fine-grained QOS supports SLAs (service-level agreements), guaranteeing minimum, average, and maximum bandwidth on a per-flow basis. Developers can implement queue levels for marking and metering traffic to prevent congestion. Early notification of congestion allows the queue manager to take corrective action through feedback to traffic sources to eliminate the unnecessary processing of packets that are likely to be dropped or, ideally, to avoid congestion altogether.
For example, a queue- and traffic-management-based Ethernet driver prevents any one processor from unfairly monopolizing port bandwidth. It also guarantees bandwidth allocations and maximum-latency constraints regardless of other queue states. The driver supports a choice of arbitration schemes—strict priority or weighted round robin, for example—and facilitates reliable real-time services, such as video streaming. In the end, multiple sources can share the Ethernet port
without adversely affecting bandwidth subscriptions. Tasks such as IP (Internet Protocol) forwarding become straightforward to implement robustly, and latency-sensitive applications, such as audio or video delivery, benefit from deterministic and reliable port management. In addition, when you implement the queue and traffic management in hardware, the driver can maintain end-to-end QOS with little to no software overhead.
The virtualization layer
Early multicore SOCs, like the original network processors, left all of the work of virtualizing resources to developers. Applications, to some degree, had to recognize that they might share a resource with other applications. When an application used a shared resource, it had to do so in a way that allowed coexistence with other applications. The operating system also needed to support
In a traditional architecture, processors manage their own access to shared resources through a software layer (Figure 3a). Processors must be aware of what resources are available and how often they can use them. As the number of processors increases, so does the complexity of resource sharing. One downside of software-based virtualization is that it introduces overhead to every transaction to store and later retrieve packets. Such overhead consumes processor cycles and introduces complexity to the coding process. It also places the burden of bandwidth management and meeting subscription guarantees on the virtualization
software. Even when using tools to automate the creation of virtualization code, developers still must troubleshoot application interactions as they pass through the virtualization code.
The added overhead and complexity of virtualization have limited the use of multicore SOCs. Queue and traffic management, however, is a fairly deterministic process that you can implement in hardware. Developers configure queues once for an application, and the hardware mechanisms can then completely offload queue management, thus restoring substantial computational cycles back to the application processors. The ability to dynamically change allocations allows modification of the overall configuration at runtime to accommodate changing task loads.
In an architecture using a hardware-based queuing and synchronization mechanism, each processor operates independently of the others (Figure 3b). Through virtualization of resources, sharing becomes transparent to the applications. The mechanism allocates each processor and each task resource bandwidth, and each processor and task operates as if it were the only controller of the resource. Although the gains from implementing queue and traffic management vary from application to application, hardware-based resource virtualization and sharing significantly improve system efficiency.
A hardware-based virtualization layer removes or accelerates the software-virtualization layer. Offloading virtualization substantially increases processor efficiency. In some cases, hardware-based virtualization removes the need for software-based virtualization, other than during initial configuration. In other cases, hardware-based queue and traffic management significantly accelerates virtualization software in the datapath.
A hardware-based virtualization layer also lowers design complexity and speeds development because it eliminates the need for developers to implement and design around the virtualization layer. This approach simplifies design and speeds time to market. This hardware-based layer also increases determinism. Elimination of virtualization overhead reduces a major source of system interrupts. This elimination in turn reduces processing latency and increases system responsiveness.
Another benefit of this approach is that it simplifies debugging. Because virtualization and resource sharing are hardware functions, the virtualization layer itself is not part of the development process. However, developers still have full access to and control of queues if necessary for troubleshooting. A hardware-based virtualization layer also increases reliability because hardware-implemented queue and traffic management is not vulnerable to many of the issues that can arise with a software-based implementation. For example, if the core-handling software-based virtualization becomes compromised, the entire system is vulnerable. With a hardware-based implementation, there is no centralized control routine to compromise.
The level of supported queue offloading depends on the implementation. For example, some SOCs might provide locking mechanisms but not perform
all state management of queues. Ideally, developers want a flexible system that supports different configurations, is straightforward to integrate with software, and minimizes the software changes necessary to adopt the SOC. A virtualization mechanism may be efficient; if it requires significant deviation from traditional programming models, however, porting application code will increase system cost and delay time to market.
How you implement queues can also affect system performance. For example, queue location affects which processors can access those queues. Some queues must reside in memory types, be spread across multiple chips, or be tied to a resource. Dynamically allocated queues give developers the flexibility to appropriately partition queues to applications and resources. For systems using multiple multicore SOCs, the ability to manage queues over a system bus, such as PCIe (Peripheral Component Interconnect Express), enables sharing of resources not just between cores on the same SOC but also between those on different SOCs.
For example, a cluster of processors can share a single forwarding database. Alternatively, a multiple-SOC system may have a single deep-packet-inspection engine that applications running on different SOCs must access. Such multichip sharing of resources allows even further virtualization of system resources.
One of the greatest design challenges in multichip architectures is partitioning tasks in a way that equally spreads resource requirements among all processors. In software-based virtualization, this process can be time-consuming and places a burden on designers, including the challenge of efficiently managing free memory pools. In addition, any change in software can result in a shift in resource requirements, requiring developers to repartition the system. Many of these issues
apply to both asymmetrical and symmetrical multiprocessor architectures.
With hardware-based virtualization, most partitioning management takes place in hardware, and the operating system handles a small remainder. With this abstracted partitioning, developers can make system changes without manually repartitioning the system. This approach also offloads tasks, such as managing free memory pools, from the application and operating system.
Control of a resource also extends to limiting the maximum allocation a processor can receive to address potential processing bottlenecks on the receiver side. For example, many communications, audio/video, data-acquisition, and test-and-measurement applications have a maximum transmission data rate that the receiving processor is expecting or can handle. In these cases, even if there is more capacity available on the peripheral because other processors are not currently using their allocations, the application may not want the queue flushed at a faster rate because this flushing may overwhelm the receiving processor and
result in loss of data.
Many developers take a worst-case approach to design; they make sure there is enough capacity to support worst-case loading. This approach means, however, that, under typical operating conditions, there will be underused resource capacity. A typical round-robin arbitration algorithm, for example, supports only minimum allocations. If the system can have as many as 10 requesters for a resource, each can expect to always have at least 10% of the bandwidth. However, if only one requester is active, that requester could receive 100% of the bandwidth.
Virtual and transparent resource allocation means that an application does not know how much bandwidth it might receive. For applications with receiver-side bottlenecks, the ability to set a maximum allocation for a resource is important for the stability of the system. This maximum allows developers, no matter what allocation algorithm is in use, to control resource bandwidth per application, to prevent swamping the receiver-side processor, and to prevent data loss. Developers also have the option of implementing standard mechanisms, such as IEEE 802.1Qav or 802.1Qau, to manage congestion.
An application may sometimes attempt to use a resource to which it does not have access. This situation can occur because of an error in programming, when an only partially updated application is in use, or when an overwriting of code or data memory has occurred. You must prevent such an application from corrupting other applications—that is, by writing in their memory space—or negatively
affecting their performance—for example, by seizing control of a shared resource. In software-based resource-sharing implementations, a corrupted application may ignore its bandwidth allocation and monopolize a shared resource. Similarly, if the processor hosting virtualization becomes corrupt, queuing mechanisms may fail and bring down the entire system.
Hardware-based queue management allows you to protect the various components of the system from each other. The most basic form of fault isolation is preventing access to memory and resource bandwidth allocated to other applications. To keep sharing of virtualized resources completely transparent to applications, the queue and traffic manager must take action only on the corrupted application. In other words, you must shield applications from both the actions of other applications and the need to accommodate the failure of another application to maintain stability. Dedicated queues, by their nature, isolate faults and prevent other processors and applications from effects. Such queues also facilitate effective error recovery; dedicated queues can completely clear with no loss of data for other applications.
A queue- and traffic-management controller can implement several levels of response to resource-access violations. The simplest response is to prevent the access and generate an alarm to the application, typically through an interrupt. This alarm tells the application that it has tried to do something it shouldn’t have. A second method logs the violation for use by developers to help troubleshoot issues in the field. The queueand traffic-management controller also must be able to escalate its response by triggering a reset and the reinitialization of a potentially corrupted application. Ideally, a developer can create a policy that controls this response. For example, a developer could set a threshold dictating that, if an application makes three illegal accesses, it is assumed to be a corrupted application and must restart.
When a series of transactions must take place in order, this requirement becomes a blocking instance because other requesters must wait until the transaction is complete before they can gain control of the resource. Consider a typical SATA
(serial-advanced-technology-attachment) transaction in which you first configure the SATA port and it then executes a sequence of commands. Unlike an Ethernet port, in which packets are single events, the SATA port must lock to an application until the transaction is complete; otherwise, two applications may overwrite each other before either has completed its task.
Although applications cannot fully share resources supporting transactions of this nature, allocation can be partially virtualized. Applications wanting to use the resource first must make sure that the port is available and then lock the port while it is in use. Support of locking requires a thin software layer between operating systems to enable them to communicate to see which application has control of the lock. The use of hardware, however, can manage and accelerate
acquisition of the lock. You must implement lock acquisition in hardware to provide a failsafe mechanism for the resource; otherwise, a locked processor could also lock the resource.
Depending on the application, a system must support shared resources that can be completely virtualized and those that require locking. An SOC could, for example, provide a SATA port that is not shared, but only one processor could use it, and sharing of the resource would have to be in software. By also supporting lockable resources, cores within the SOC can still share the resource in a manner transparent to all applications with failsafe reliability.
Ease of integration is an important aspect of multicore architectures. The ability to bring multiple processors onto a single chip requires straightforward application software to migrate; otherwise, developers might as well design a new system.
You must consider a number of factors in determining ease of migration to a virtualized architecture. For example, the architecture must support multiple operating systems because multiprocessors often use multiple operating systems across cores, depending on the applications that require support. Multicore architectures that support only one operating system force developers to use that operating system and then port all code to it. By supporting multiple operating systems, a multicore SOC simplifies code migration.
You also need to consider QOS because applications have different bandwidth needs. Latency-sensitive applications, such as video streaming, need realtime access to shared resources, whereas data-based applications, such as content downloading, can tolerate delay and take advantage of underused bandwidth. The ability to service different bandwidth requirements enables developers to bring together divergent applications under the same processor cluster.
Also consider whether the architecture includes transparent resource sharing because transparency allows developers to migrate both applications that support virtualization and applications that don’t support it. Another aspect is removal of the software-virtualization layer. Although some code rewriting is necessary when migrating between SOCs, for many applications, most changes when moving to an SOC with hardware-based resource sharing involve not changing the software but rather eliminating the software-virtualization layer. Removal of this layer simplifies system design and troubleshooting and increases system efficiency. In cases in which manufacturers have licensed virtualization code, removing this layer also reduces system cost.
Another factor to consider is whether the architecture consolidates system resources. When a system employs multiple chips and operating systems, each must have its own storage resources. By managing resources in hardware, all tasks can share access to a resource they need. For example, rather than requiring multiple drives, as a traditional architecture does, a single hard drive or Ethernet port can serve an entire system, even across SOCs. This approach results in equipment savings and lower overall system-component count.
Communication among SOCs is also important. Applications can share resources over a high-bandwidth system bus, such as PCIe, to enable sharing among SOCs. Traditional architectures allocate a fixed bandwidth to each processor in a manner that limits effective QOS management among cores and makes it difficult to oversubscribe reliably (Figure 4a). With hardware-based virtualization, resource allocation is flexible, even among SOCs (Figure 4b). Full bandwidth management is possible, with rate policing, traffic shaping, and queue arbitration that reflect and balance the needs of each processor and application. Together, this approach enables the efficient sharing of resources, such as a hard drive or a security engine, across the entire system, not just one processor.
You should also consider interprocessor communications because multiprocessor systems often must transfer significant amounts of data among processors. Queue-management mechanisms provide a simple and efficient means of accelerating communication among processors on an SOC and across multiple SOCs.
Today’s next-generation SOCs are complex multiprocessor environments that must share on-chip resources without incurring additional overhead that reduces system efficiency. Queue management helps virtualize chip resources and simplifies resource sharing by providing a reliable mechanism for allocating bandwidth, isolating faults, and facilitating robust error recovery. Traffic management ensures that a system fairly shares resources in a manner that meets the differing latency and throughput needs of applications by controlling the rate at which traffic enters and leaves queues. With hardware-implemented virtualization, developers can offload queue and traffic management to improve application efficiency, maximize resource throughput, reduce latency, and increase system reliability.
About the author:
Satish Sathe is with AppliedMicro
This article first appeared in EDN, January 2011