Using trace to solve the multicore system debug problem

by Aaron Spear, VMware , TechOnline India - April 12, 2011

In this paper we outline current multicore development trends, explore the deficiencies in traditional software development tooling when applied to multicore systems. We will also introduce the “Common Trace Format” (CTF), a coming standard for tracing multicore systems over time.

Modern multicore designs often aggregate wildly different hardware and software technologies.  Traditional debuggers, which show a snapshot of a portion of the system, do little to uncover issues that arise due

to complex interaction of components.

Engineers routinely cobble together proprietary tracing facilities in order to have some chance of catching hard to find defects. What is needed to debug these diverse systems is an approach that can analyze trace data coming from many different collection technologies. 

In this paper we outline current multicore development trends, explore the deficiencies in traditional software development tooling when applied to multicore systems. We will also introduce the “Common Trace Format” (CTF), a coming standard for tracing multicore systems over time.

The nature of the multicore debug problem

Debugging and optimizing modern multicore designs is growing in difficulty in proportion to the exponential growth of technology. In 2011, the number of “smart phones” will overtake “feature phones” in the U.S. That means that over half the population has scaled their expectations for the new “normal” embedded system to include:


* Hip touch screen UI with 3D effects

* Extremely responsive

* Reliable/fast Internet connectivity anywhere

* Long battery lifetime

* Available now!

The net result is that the embedded systems designer is handed a set of requirements that are extraordinarily difficult to reconcile. Requirements on power consumption, external connectivity (Cellular, Wi-Fi,Bluetooth), human interface, and time to market have resulted in some general patterns in how devices are designed:

* Low-power requirements force using more cores at often variable clock rates in order to meet the same performance levels.

* Performance critical functionality is pushed into dedicated hardware.

* Open source operating systems, drivers, and software stacks are used (e.g. Linux, Android).

* The control plane and data plane are split into separate processors (with different OSes on each).

Embedded systems are doing more, and doing it in parallel.  Debugging an embedded system with a single core is difficult enough, but debugging multiple streams of software activity that are interacting with each other is really hard.

The problem with software debugging

A traditional software debugger, GDB, allows developers to look at snapshots of one program at a time. The developer sets a breakpoint, runs the program, and then sequentially steps through the code from the

breakpoint, verifying that state changes as expected.  The developer sees the stack of the functions that were called leading up to the next line of source code to be executed.  When debugging multiple threads of execution concurrently (multi or single core), most software debuggers add more contexts to the same display. Figure 1 below, for instance, shows Eclipse CDT with GDB debugging three Linux processes, with multiple threads.



          Figure 1: Multi-process debugging with CDT and GDB 7.2 [1]


Debugging a set of cores that are interacting with each other requires that a debugger be attached to all of them. Depending on the cores and tools being used, it is possible to have a debugger capable of debugging multiple cores concurrently. In some cases hardware may allow synchronous debugging, that is all cores running and stopping together. 

This use case tends to be the exception and not the rule however.  In many cases, decoupled tools must be used to debug the cores individually. This sounds difficult, but it is in fact much worse than this due to the layering of operating systems and application stacks.

Android, for example, consists of three distinctly different software domains in which to debug: at the bottom is kernel space (which includes the Linux kernel and drivers), in the middle is Linux user process space and a unique JVM, and at the top Java applications and application frameworks.

There are no open source tools that allow you to seamlessly debug from Java down into a C/C++ user process, and then down into Linux kernel code. Debugging everything in the software stack requires three different debuggers for the core running Android, and then additional debuggers for other cores.

The extreme difficulty in simply setting up to debug is further compounded by problems that are impossible for a software debugger to solve. 

Some systems have real-time deadlines and they simply cannot be stopped at all. It is also increasingly common for significant portions of functionality to be pushed into dedicated “hardware accelerators”, on chip resources
that offload functionality such as networking or graphics from the main processor. These blocks tend to be black boxes inaccessible to software debuggers.

Software debuggers are invaluable for debugging algorithms, but for debugging a system of concurrently running hardware and software components, what is needed is a tool that can trace the state and interaction of software and hardware over time.


The benefit of being able to see state change over time is no revelation to any programmer who has written any amount of C code. The de facto method for logging since the 1950’s has been to dump state information to stdout or files via the venerable printf routine (Or one of its ancestors in older languages).

Developers, having no alternative, add large amounts of manual instrumentation in software to provide visibility into the system’s behavior over time. Most commercial applications of any significant size have some sort of logging infrastructure, likely cobbled together by the implementers of the product. These facilities can be invaluable for troubleshooting issues in the field after a product ships for instance. 

For all its benefit, the way that this sort of logging is most commonly done does have some practical limits:

* The logging is intrusive, that is it changes the timing of the system. If the logging was architected well the intrusion may be small and deterministic, but it may also completely change the system behavior depending on architecture (e.g. two threads running in parallel both wanting to insert a log message in the same buffer/file at the same time requires some form of mutual exclusion).  Additionally, if the logging is done via string formatting, the performance impact on the system may be a concern.

* Only locations that have been instrumented are logged. In general this sort of logging is added by inserting some preprocessor macros or logging function calls into the source code and then building the application. This static instrumentation is extremely valuable, but the designer may not be able to anticipate all of the information that may be needed. Among other things, it also requires the developer to be able to change and rebuild all components where data will be collected, which may be difficult or impossible.  

* Correlation in time is difficult. If you are trying to correlate multiple contexts running in parallel you must include some sort of time stamp that you can use to understand ordering.  In the multicore world, this can mean cores with unrelated clocks.

* Tools for multicore log analysis don’t readily exist. Log formats created by different technologies are often in a unique format, and engineers must create their own tools to analyze them because what they need simply does not exist in a form they can use as is.


The solution to the multicore system analysis problem is one that enables tracing the state of a heterogeneous system over time, and yet addresses the limitations of “logging” mentioned above. In this context, “tracing” is used instead of “logging”.  The difference between “tracing” and “logging” is arguably one of semantics; however they are distinguished by the existence of a few attributes:

* Performance of tracing is paramount, intrusion must be minimized. Methods typically employed in logging such as string formatted messages may not be acceptable.

* Trace events tend to be low-level and high-volume (e.g., OS scheduler information)

* Flow of execution through a system tends to be the primary concern in tracing, whereas logging tends to be focused on whether or not an event occurred (i.e., a log might drop repeated events in favor of being concise, a trace may want them all).

* Tracing may be inserted into a system dynamically. Logging tends to be a static feature of a product.

* Tracers may have stringent robustness requirements. Some systems function in “flight recorder” mode, always collecting trace data as a part of normal operation in order to enable detailed analysis in the case of a failure, or an error that only occurs on production workloads.

Over time, many companies and open source initiatives have created capable tracing mechanisms (Dtrace, SystemTap, Ftrace, strace, LTTng, …), but there is one fundamental problem with all these great ideas. 

Modern multi-core systems are a combination of many different technologies. A single design may aggregate different processor architectures, closed and open source operating systems and application stacks, hardware
acceleration IP, etc. 

To trace this system adequately, you would have to coalesce many different trace data formats, and no tool exists that can do this. Even with a capable trace analyzer that could help you see events over time in context, adapting all of these formats is difficult.


The MCA Tools Infrastructure Working Group

Some time ago the Multi-core Association’s member companies decided that it would be beneficial to have a Tools Infrastructure Working Group (TIWG) working on standards to advance interoperability of multicore

Around the table at the inaugural meeting were various tool vendors, operating system vendors, and semiconductor companies, discussing pain points in our industry and how we might work together.  It became apparently quite quickly that we all realized that the increasing parallelism and complexity of multicore systems was a daunting problem, and not a single one of us had a solution for everything (though we all had solutions for portions of the problem space). 

A far sighted goal everyone had was tooling that enabled people to develop and optimize for multicore systems.  How do you optimize something you do not understand?  The first step must be to see, understand and benchmark the behavior of the system.  It is then clear what to focus on, and changes can be validated against
previous benchmarks. The idea of an open standard for interchange of  trace data that would enable our tools to interoperate seemed like a great first step and foundation for future work. 

The MCA was not the only group in the world thinking that a trace data standard was needed.  In the Linux world in particular, a number of different tracing technologies with incompatible data formats exists.  In
that space there was a strong desire to be able to share analysis tooling, and Ericsson had been driving a unification effort for Linux starting in 2008.  

At the Embedded Linux Conference in April 2010, the MCA and Linux communities decided to collaborate to create a standard that could meet both Linux and embedded system needs.  The initial work to create the Common Trace Format (CTF) specification was begun by Mathieu Desnoyers, the maintainer of the “Linux Trace Toolkit next generation”  (LTTng) with input from the MCA.  Mathieu’s work was financed by Ericsson and the Embedded Linux Forum.

Common Trace Format (CTF)

CTF endeavors to create a general trace data format that is application, architecture and programming language agnostic.  It is meant as a carrier for data that is temporal in nature, that is ordered events occurring in time.  The
goal at its highest level is to be able to easily analyze trace streams coming from wildly different collection mechanisms on different cores/systems with different clock domains on each. 

The traces may be created by operating systems, hardware probes, bus analyzers, simulators, or instrumentation in any arbitrary application. A common use case might be something like this:

* A RISC processor running Linux instrumented with LTTng to collect kernel-level trace data.  High-level scheduling information for the kernel.

* A hardware trace probe collecting low-level instruction and data trace on the same RISC chip (this can be used to provide function call-level detail).A DSP running a proprietary RTOS which also has kernel instrumentation trace.

* A hardware trace coming from a network “accelerator” hardware block in the same system.

Each of these trace sources has different types of data recorded, and different clocks for each.  The clocks are often related in some way, and trace streams on the different sources can possibly be correlated in other ways, for example IPC between the cores that results in events in multiple contexts. 

Each trace collector is independent, resulting in decoupled traces.  The goal is that a single, or multiple analyzers can correlate those traces and do specific analyses on them, understanding their trace collector specific event schema.

At its lowest level CTF is simply a format to express arbitrary events vs. time.  The lowest level analysis tool that would use CTF could read in the  multiple traces from different trace collectors, and simply display a
text log that contains records with timestamp, context, event type, and event specific data (rendered as a string for instance). 

On top of the raw events, there is the intent of building some rules for interpretation of the data for common concepts.  Take for example the tracing of an operating system scheduler vs. time.  Operating systems may have different models for execution (“cores”, “processes”, “threads”, “interrupts”,  “blocks”, …), but regardless of the details of the implementation, it is universally desirable to be able to present a Gantt chart showing execution state vs. time. 

CTF will be able to express a general concept of “state” for a context vs. time that will be applicable for any OS. CTF tries to philosophically account for arbitrary collection mechanisms, but initial proof of concept was done with the following trace collectors:

* Operating system instrumentation (e.g. Ftrace, LTTng)Application tracing (e.g. LTTng User Space Tracing (LTTng-UST))

* Hardware trace (e.g. instruction and data trace, Nexus)

Details of the format itself are shaped by requirements that come from the trace collector (the producer of the trace data), the trace analyzer (the consumer of the trace data), and then storage and transmission mediums.  These are not exhaustive, but are instructive:


Trace Collector (trace producer) requirements

* Trace data may be optimized for the trace collector (i.e. native byte order and alignment allowable)

* Trace data may be packed or sparse at the collectors convenience

* Must be able to move data directly from target buffers into files/communication channels

* Ability to record or infer context (core/process/thread), type of event, and timestamp on every event

* Ability to record arbitrary binary event data

* Different collection use cases possible: many traces each of one type of event, single trace with many event types, traces from one core, aggregate traces from many cores, …

* Meta-data that describes the layout of the trace data must be simple to generate

* Must be able to stream trace data over a networkMust be able to process extremely large trace files (> 10GB)

* Must be possible to aggregate traces that come from different components/processes/cores/systems

* Must not be required to scan all events in order to begin processing

* Should be possible to do a binary search on large files (by time)

* Must be able to bias/offset timestamps on traces for alignment

Trace Analyzer (trace data consumer) requirements

* Must be able to process extremely large trace files (> 10GB)

* Must be possible to aggregate traces that come from different components/processes/cores/systems

* Must not be required to scan all events in order to begin processing

* Should be possible to do a binary search on large files (by time)

* Must be able to bias/offset timestamps on traces for alignment

CTF Event Model

The CTF event model is shaped by the anticipated use cases of collection and consumption. Raw events are assumed to be collected in some sort of optimized buffer specific to the trace collector. At some point after
those events are buffered, the events are either inserted into CTF trace file(s), or streamed on a communication link (network). The terminology for the format has been chosen with those use cases in mind. Some preliminary definitions:

Event Trace: A container of one or more event streams and meta-data that describes them.

Event Stream: An ordered sequence of events, containing a subset of the trace event types.

Event Packet: A sequence of physically contiguous events within an event stream. Event packets are variable sized, with optional padding at the end.

Event: The basic entry in a trace. A variable sized container of one or more event specific attributes including:

- event type: Numeric identifier that uniquely identifies the “class” of event within the trace stream.

- event context: Numeric identifier for the context of the event. Interpretation is use case specific, it may be a “core”, may be a process id, thread id, etc.

- event timestamp: The size and meaning of a timestamp is described in the meta-data.  It may be omitted if not available for the use case.

- event payload: Event specific binary data interpreted as defined by the metadata.

The meta-data for a trace is human readable text that describes the details of the encoding, size, and interpretation of the various fields in the events. CTF includes support for a complete set of basic types
(integers, floating point, fixed point, enums, strings, …) as well as compound types (i.e. structures) and arrays.

File representation

The final output of the trace, after its generation and optional transport over the network, is expected to be either on permanent or temporary storage in a virtual file system. Because each event stream is appended to while a trace is being recorded, each is associated with a separate file for output.  Therefore, a stored trace can be represented as a directory containing one file per stream as shown in Figure 2 below.



Figure 2. CTF file architecture 

 Network streaming

A key use case for CTF is collecting events on a remote system and transmitting them over a network.  It
is also possible to stream events as they occur. The terminology of “Event Packets” is deliberately chosen to reflect the common use case of a group of events that will be transmitted together. 

The size of an event packet can be chosen to fit within one UDP packet for minimal usage of network and CPU bandwidth.  There is a protocol being developed that builds on the Target Communications Framework (TCF) for control of trace collection, as well as streaming trace data. 

Current CTF status

The CTF specification is currently at draft “pre-v1.7” and being reviewed by the TIWG and other members of the open source community.  There are currently both open source and commercial tools being developed that have CTF generation and analysis support.   The specification can be found on the EfficiOS web site at

Open source tools and frameworks

* BabelTrace: An MIT licensed convertor utility called “BabelTrace” has been created that is a framework to convert from legacytrace formats to CTF.  The idea is that anyone can write a utility to convert from their own format to CTF.  Sources can be downloaded from

* LTTng kernel tracer: The Linux kernel tracer is currently being ported to use CTF format natively in its internal buffering as well as CTF files as output.

* LTTng User Space Tracer: Will be ported to use CTF format natively.

* LTTV: Open source Linux trace analyzer will be ported to consume CTF files.

* Eclipse Linux Tools LTTng viewer: Open source analyzer for Linux LTTng traces.  This project will be extended in the summer of 2011 to  consume CTF format natively following the support in LTTV.  There is
growing interest in extending this tool to support additional operating systems and trace environments.

Commercial tools with CTF support

Mentor Graphics System Analyzer: Multicore system analysis product that consumes CTF format as one of a number of supported trace formats.

CTF in the future

The MCA Tools Infrastructure Working Group is currently working on expanding the CTF specification to allow systems to emit information about the relationship between the clocks used in different traces. This will allow automatic correlation of trace events collected from different cores.  Group member Texas Instruments is currently spear-heading the effort.

The MCA TIWG is also working on creation of an MIT licensed portable C library for creation of CTF files.  The vision is to provide an easy migration path to CTF for both proprietary and open source operating systems, tool suppliers, and others.  With this library it will be a straight forward exercise to add CTF support
inside of various products (firmware of a hardware analyzer, source for an RTOS, etc).

It is our belief that CTF will enable analyzing the behavior of complex multicore systems in a way that was not previously possible due to the diversity of technologies involved.  An open standard and the existence of easily accessible open source components for creation and analysis of the format will enable developers to quickly add tracing to applications and application frameworks. 

The result will be better understanding of the behavior of multicore systems.  The TIWG welcomes involvement and participation from interested companies and individuals.


About the author:

Aaron Spear works as an ecosystems infrastructure architect for VMware on tools for cloud computing. In the
past 17 years he has designed many embedded systems (gas analyzers, control systems, audio,...) lived in a van as an aspiring rock star, written RTOS extensions, ad then stepped through the looking glass as the architect for a multicore embedded systems debugger.  He is currently the chairman of the MCA’s Tool Infrastructure Working Group, and spends lots of time pondering how to gracefully solve the parallel computing problems that pop up in every domain.


Staying ahead of the multi-core revolution with CDT Debug, Patrick Chuong (TI), Dobrin Alexiev (TI), Marc Khouzam (Ericsson Canada),

CTF format requirements and proposal, Mathieu Desnoyers,


About Author


blog comments powered by Disqus