Challenges of safety-critical multi-core systems

by Chris Ault, Wind River , TechOnline India - April 27, 2011

This paper will explore the benefits of virtualization to safety critical systems and explore some of the challenges and how to mitigate the risks associated with them.

Many embedded systems are realizing the benefits of multi-core CPUs. These benefits include the ability to consolidate multiple distinct hardware boards on a single CPU, the ability to deliver more performance per Watt, as well as ability to quickly migrate existing designs to new processors and then use the additional compute power to implement new functionality.

These benefits are also very enticing to projects that are building embedded systems specifically for the safety critical market. However, these systems have their own challenges with regards to safety certification. Ideally, safety systems would like to reap the same benefits (consolidation, performance, migration), while keeping certification costs as low as possible.

One particular attractive scenario for safety critical systems is to combine a certified subsystem, such as robot-control spplication, with a non-certified subsystem, perhaps a Linux or MS Windows based human machine interface. The challenge in this scenario is certification of the complete product.

The challenges of multi-core CPUs include interrupt handling, bus contention, and increased coding and debugging complexities; there are hardware devices on the CPU that cannot be shared among safety-certified and general-purpose applications.

By having the ability to partition the devices and present specific devices to certain cores and applications these challenges can be mitigated, and the benefits of multi-core can be realized. Complicated software can be used for this partitioning and isolation, but embedded virtualization offers a configurable means by which devices can be partitioned and presented to specific cores, operating systems, and applications.

Code footprint directly impacts certification costs. Choosing an embedded virtualization solution with minimal code footprint will minimize recertification costs and maintain the real-time responsiveness of a device. Choosing a safety-certified virtualization solution will ensure that the complete application stack can be safety certified.

This paper will explore the benefits of virtualization to safety critical systems and explore some of the challenges and how to mitigate the risks associated with them.

Migration to Multi-core

Safety critical systems are embedded systems that, in cases of errors or failures, could cause injury or loss of human life, loss or severe damage to equipment, or environmental harm.

Systems such as flight control, automotive drive-by-wire, or nuclear reactor management are examples. There is no room for software error in these systems. To ensure the utmost in reliable, bug-free operation, these systems must be scrutinized to various levels of industry-standard certification, depending on the nature of the device.

Safety-related components require temporal and spatial separation from other system components of different levels of criticality. Today’s separation concepts are mostly designed to use completely independent subsystems for each function, There could be, for example, a single board computer (SBC) for the safety related aspect and a separate SBC for the human machine interaction.

This approach is not hardware efficient; it increases cost, and limits product functionality and evolution. The introduction of multi-core CPUs in embedded devices offers unique opportunities for safety-critical equipment; however there are many challenges that need to be resolved.

Multi-core processors allow devices to be partitioned so that specific functions can be performed on dedicated cores. This ensures isolation performance while offering the ability to segregate functions. Functions that can be segregated include the separation, or isolation, of the safety-related functions from general-purpose functions of a device, such as standards-based communication stacks or enriched graphics for human-machine interfaces.

This segregation means the amount of code that needs to be certified is significantly reduced, which lowers product costs while increasing time-to-market. Another opportunity that arises from this segregation is the ability to update or enhance the general-purpose partitions without modifying the safety-critical applications.

The certification of a two-SBC system consisting of a small embedded RTOS with a safety function alongside a Linux system with an HMI requires certification of only the SBC with the RTOS and the safety function.
The Linux system is out of scope for certification. Consolidating this into a single SBC running Linux would increase the certification load to include the entire Linux system, which is not commercially feasible for high certification levels for standards such as IEC 61508 or DO178B.

What is really needed is a way to combine the small RTOS that hosts the safety function and consolidate it with the Linux system without raising the certification cost significantly.

 

                             

Figure 1: Segregation of safety-critical and general-purpose functionality

Crafting a device that utilizes such segregation allows it to have mixed levels of safety and certification: Only a subset of the device needs to be safety-certified.

 

Challenges for Safety-Critical Applications

Multi-core CPUs and systems offer attractive benefits in increased computational power while decreasing power consumption, however it is hard to utilize this added performance. With techniques used in the past, programs did not have to change with each new generation of CPU. Application performance increases as CPU clock rates increase.

Adding processor cores to get higher performance requires multiple threads running in parallel to be able to utilize the increased core count. In many embedded applications, the processor is running one thread most of the time. Additional processors will sit idle while the existing thread keeps the first CPU busy. The added compute power is underutilized - or available for other purposes.

In a single-core CPU, there is only one operating system and it owns and controls all hardware devices it detects at startup and during run-time. Increasing the functionality of underutilized cores of a multi-core processor can be done by adding operating systems for more device functionality.

This can be done by installing an additional general-purpose operating system to tackle tasks that are not safety-related, such as implementing standards-based communication protocols or enhancing the human-machine interface with an operating system that provides rich graphics libraries.

Because all operating systems attempt to have full control of all devices that they can detect on the hardware, a challenge arises when multiple disparate operating systems are running on the same multi-core CPU, competing for access to shared resources such as interrupt controllers, timers, I/O devices, memory ranges, and so on.
When multiple operating systems are executing on a multi-core processor, the operating systems need to know a lot about each other, for example, in a master/slave relationship. But this situation can be simplified by inserting a supervisor or hypervisor layer above the hardware on top of which the operating systems execute.

Configuring board support packages (BSPs) of multiple operating systems so that even two of them can work together with well-defined device boundaries while not trampling on each other is a challenging task. Other challenges of moving to multi-core CPUs include the following:

- Bus contention
- Interrupt handling
- Time management

Bus contention can result when a single memory management unit (MMU) on the CPU or board is being accessed by all of the cores of the CPU. Multi-core processor cores executing different operating systems can create bad results. The software needed to control access to the MMU would be complex and would need to be shared among all cores, regardless of operating system. This additional software would also increase code bloat as well as product certification costs.

Like the MMU, the programmable interrupt controller (PIC) is also shared among cores. There are times during normal run-time that the operating system must disable interrupts so that special execution, such as operations on shared data, can proceed uninterrupted.

When one operating system is masking interrupts for all cores (and other operating systems), it can lead to time skew, missed events, and misbehaving applications. The impact of this on a safety-critical device can be deadly.

The software needed to manage and avoid this would be sizable and would increase code bloat as well as product certification costs. On some processor implementations, the clock interrupt is directed only to one core and must be propagated to the other cores. This directly impacts system reliability and safety certification costs.

Resolving the challenges of augmenting current designs and architectures to gain the benefits of multi-core CPUs can be daunting. But the challenges can be avoided when operating systems execute on top of a hardware virtualization layer.

 

Virtualization and Partitioning

In a multi-core compute environment in which multiple operating systems are performing specific and differentiated tasks, the challenges can rapidly diminish the perceived value of multi-core CPUs.

But by carefully partitioning and isolating the hardware devices for each of the specific operating systems, and presenting the devices to each specific operating system, these challenges of sharing devices that should not be shared among operating systems can be avoided.

To completely partition a multi-core processor and devices in such a way that there can be safety-certified and general-purpose partitions doing different tasks, there would need to be an arbitration layer between the operating systems and the devices such as the PIC and MMU.

The operating systems would need to have virtualized access to these specific hardware devices to be able to operate in the mode in which they are written, with the assumption that the operating system has control over all devices. In this manner, one operating system’s interaction with the MMU or PIC will not have any detrimental impact on other operating systems on the processor.

 

                             

Figure 2: Virtualization supervises access to cores, memory, and devices


By inserting an embedded virtualization layer below the multiple operating systems that virtualizes and arbitrates access to the cores, memory, and devices, each operating system can properly execute in its own isolated partition. Leveraging a fully configurable hypervisor makes the task of presenting specific devices to each partition very simple and deterministic, while also ensuring that device access for the operating system partitions remains at the efficiencies required by the real-time devices.

 

                             

Figure 3: Configurable device partitioning

Configurable device partitioning allows system designers to specifically describe which operating environment, and therefore operating system, is presented with which device from the hardware. This means that a safety function that needs direct access to devices such as actuators or sensors can be given direct access to the necessary hardware. In many situations this means that the safety function does not have to be re-written to run in the virtualized environment.

When migrating existing designs from single-core implementations to multi-core CPUs, changes in hardware board layout and devices can force the need for many modifications in the software. Changes to application code can be minimized by presenting an explicit subset of devices to a particular partition. This means that not only is an operating system prevented from accessing devices it is not supposed to access, it does not even detect their presence.

Consider a safety-certified partition that is operating a position sensor that detects the placement of machinery, to provide closed-loop feedback on the position of the machinery.

It would be undesirable for a general-purpose operating system, providing a rich graphics human-machine interface, to be given the task of detecting the position sensor hardware. With configurable partitioning, not only does the general purpose operating system not have access to the hardware, it doesn’t know the hardware exists.

With such separation and configurable partitioning, product updates can be delivered that involve enhancements to the general-purpose partitions, while leaving the safety-certified partition intact. This allows product updates to be delivered without forcing recertification expenses and delays.

 

Certification Considerations

To maintain or attain complete product safety certification to the levels necessary for the product and industry, all aspects of the compute platform leading up to the certified application are under certification scrutiny. This includes the virtualization layer.

 

                             

 

                                   Figure 4: Mixed safety levels on a multi-core CPU

From a certification standpoint the following components of the product require certification:

- The hardware
- The virtualization layer
- The operating system running inside the virtual board (VxWorks CERT for example)
- The safety application itself

Obtaining certification is a detailed process and depends highly on the standard used for certification. In the case presented here, all parts require certification.

For most standards this means that the code needs to be written using strict coding standards, the system must be methodically tested and tests need to be linked to all the requirements. Risk and time-to-market can be significantly reduced by using components that already have all this work done.

Certification efforts are labor intensive and very costly. With the hypervisor included in scope of certification, one must choose a virtualization solution of minimal code footprint.

For example, the Linux KVM kernel is approximately 2.7 million lines of code (Mloc), and changes (additions and deletions) are in the tens of thousands of lines per day. Certifying such a code base is impossible due to not only size but also to churn.

 

                             

                       Figure 5: Mixed Safely Levels Require Less Certification Costs

The situation in Figure 5 (above) has partitions that offer the safety functions and isolated partitions that can use Linux or MS Windows to provide a graphical user interface.

However, what the virtualization layer guarantees is that there is exclusivity between the HMI partition and the safety partition(s). That is, rogue processes executing in the HMI partition cannot impact the safety functions.
The safety functions run on different cores (or share a core through a pre-set schedule), they use different devices and different memory. They are running in their own, separate environment enforced by the embedded virtualization layer.

This claim of separation needs to be provided by the embedded virtualization layer. This puts a heavy strain of certification on that virtualization layer, but once proven it allows the system integrator to certify the safety side and iterate over the side without the safety requirement, updating functionality and user interfaces without heavy cost. Selecting a vendor that has a history of security and safety-certified products will minimize the overall product certification costs.

Leveraging a safety-certified embedded virtualization solution allows developers to migrate their existing safety-certified applications from single-core CPUs to multi-core CPUs while introducing new functionality that can be hosted on the remaining CPU cores.

 

                             

Figure 6: Safety-Certified Embedded Virtualization with Multiple Partitions

Migration of existing software assets to a new multi-core platform with embedded virtualization can be easily attained when the designers have full control over the specific device configuration for all partitions.

Conclusion

There are many revolutionary changes occurring in the embedded industry, and the additional processing power of multi-core devices is delivering many new capabilities into the market. This change is affecting devices with certification requirements as well. Multi-core devices can provide significant benefits to these types of products; however, the certification requirements need to be met.

Embedded virtualization provides enforceable separation between safety and non-safety functions, allowing for the consolidation of mixed safety level systems on a single piece of silicon. This technology allows direct hardware access for embedded performance as well as the strict separation needed to ensure that the final system can be certified.

Operating systems and virtualization layers with available certification evidence can significantly reduce the risk and speed time-to market for these devices.

 

About the author:

Chris Ault is a Senior Product Manager with Wind River Systems focusing on virtualization solutions. Prior to joining Wind River, Chris has worked in various roles from software engineering, engineering management, technical sales, and product management at Mitel, Nortel, Ciena, AppZero, and Liquid Computing, with a focus on virtualization products, technologies, and sales. Chris holds Electronics, Computer Science, and Economics degrees and resides in Ottawa, Canada.

Comments

blog comments powered by Disqus