Is lock-free programming practical for multicore?

by David Kalinsky , TechOnline India - April 06, 2011

Facilities for lock-free programming are available in the hardware of many multicore SoCs, as in the software of some multicore operating systems. The caveat is that unfortunately, development of application software algorithms with these is both different and more challenging than working in the traditional way.

In a multicore environment, you can do resource sharing efficiently without locks, but there are some caveats.

I teach courses in industry about multicore software design. The professional engineers attending often feel uncomfortable in this Brave New World of truly parallel software, which is significantly different from the "pseudo-concurrency" of traditional single-CPU multitasking.

When I get to the part of the course about shared resources, the attendees are shifting nervously in their seats as I describe four different kinds of semaphores.[1] By the time I add to the discussion three or more varieties of
"spin-locks" with their characteristic active "spinning," some attendees begin to audibly groan. As I try to explain the often subtle differences in usage of these seven or more mechanisms, attendees frequently stop me in mid-sentence to ask: "Hey David, is there any way I can just avoid that entire can of worms?"

What they’re really asking is: "Hey David, is there a way I can do resource sharing efficiently without locks in a multicore environment?"

My answer? Facilities for lock-free programming are indeed available in the hardware of many multicore system on chips (SoCs), as well as in thesoftware of some multicore operating systems.

The caveat is that unfortunately the development of application software algorithms with these facilities is both different and more challenging than working in the traditional way. When using traditional locks, a software designer first identifies "critical sections" of code and then "protects" them by selecting one of the traditional locking mechanisms to "guard" the critical sections. In the past, avoiding locks seemed dangerous or tended to involve intricate, convoluted algorithms. For these reasons, lock-free programming has not been widely practiced.

But as application software developers gain experience with multicore SoCs and multicore operating systems, they frequently discover that traditional multitasking design approaches are often inappropriate in a multicore environment. Multiple tasks, and often multiple cores, can spend a great deal of time waiting on locked locks, thus reducing parallelism, increasing latencies, and reducing the benefits of using a multicore SoC.

While lock-free programming is not a cure-all, it can be used sensibly to provide a significant performance advantage over lock-based programming. This advantage is most often achieved by using lock-free programming within a small percentage of application software’s tightest, most deeply-nested and heavily-executed loops.

The very best form of lock-free programming is simply to design application software as very large and totally independent parallel chunks of code. Such immense blocks of code could run concurrently (typically on different cores) for long periods of time without any interaction at all between them. There would be no need for locks, since there would be no data interactions. But this is not practical in many applications.

If data interactions are unavoidable, some simple data interactions between parallel chunks of code can be governed by a lock-free operation called CAS, or atomic compare-and-swap, that is offered in hardware on various multicore SoCs and also in a number of multicore operating systems. I’ll give several examples of how to use CAS later in this article.

Problems with locks

Locking mechanisms, such as semaphores, mutexes, and multiple reader-writer locks, are problematic even in single-CPU multitasking environments.  Using them in a multicore environment tends to exacerbate their problems because of the true parallelism involved and the more chaotic nature of task scheduling done by multicore operating systems. In multicore environments, additional locking mechanisms called spin-locks join the fray. Here are some oft-encountered problems involving these locks.

Deadlocks—A deadlock is a situation where two (or more) tasks wait for each other to release a resource that the other is holding. Since they’re waiting, they won’t release anything; and as a result, they won’t do any useful work ever again. An example is shown in Figure 1.

                                       

                                      

 

The deadlock here will not occur consistently but rather randomly, hence making it very hard to debug. On a multicore SoC, the deadlocking tasks may well be on different cores. And if they’re trying to lock spinlocks
(instead of the semaphores in the figure), the deadlocking tasks might be "spinning" at full speed on the different cores, while making absolutely no progress.

Lockouts — A lockout is a situation where a task locks a lock, then uses the resource the lock is "guarding," but later forgets to release the lock. If another task (or perhaps the same task again) tries to lock the lock after this, it will never get the lock. If this second task is insistent about the lock, it will never run its application code again. It will appear to have "died."

Priority inversions — A priority inversion is a situation in which a low-priority task prevents a high-priority task from executing. This situation can happen very easily if a low-priority task locks a lock, and then a high-priority task (perhaps running simultaneously on another core) tries to lock the same lock. The
high-priority task will be made to wait on the lock, until the low-priority task releases it. If the locking, and hence the waiting, goes on for a long time, it can have a serious impact on the performance of the high-priority task.

Performance degradations — Locks are operating system abstractions that are used by calling operating system services, which are often cumbersome. For example, operating systems associate task-waiting queues with most kinds of locks and must manage the associated task queue during every operation on a lock. This frequent queue management can be time-consuming. In addition, tasks asking to lock a lock may be delayed in the task-waiting queue if the lock is in use. In a multicore environment, multiple tasks at multiple cores may be
forced to wait, possibly in a busy-wait if they are spinning on a spin-lock.

Interrupt handler limitations — Interrupt handlers such as ISRs (interrupt service routines) are limited in their interactions with most locks. For example, ISRs are forbidden to wait for semaphores. They’re forbidden to operate in any way at all on (priority-promotion) mutexes. Thus, tasks can’t use semaphores or mutexes to protect against critical sections of code in ISRs.  ISRs are permitted to wait for spin-locks; however, such waiting could inject significant additional latency into the ISR.

 

Lock-free to the rescue?

Lock-free programming can be used in some situations to provide a significant performance advantage and avoid the litany of problems inherent in lock-based programming. Simple data interactions between parallel chunks of code can be governed by a lock-free operation, such as CAS, that is offered as a machine instruction on a number of multicore SoCs as well as a service of some multicore operating systems.

The operation of CAS, expressed in a C-like language, is shown in Figure 2.[2]

 

                                     

 

What CAS does is to compare the current content of var with an expected value. If it is as expected, then var is updated to a new value. If it is not as expected, var is not updated. All of this is done as a single noninterruptable, nonpreemptable operation.

The way this is used in the development of lock-free code is to begin by copying the original value of a variable (or pointer) into expected. Then a second copy is used in a (possibly long and complex) calculation to arrive at a new value for the variable. But during the time of the processing, another task — perhaps on another core — may have already modified the variable from its original value. We would like to update the variable with the new value only if it has not been "tampered with" by another task in the interim. CAS does the combined check and update together, without the need for a locking mechanism.

If CAS shows that the variable no longer contains its original or expected value, application software needs to decide what to do next. The variable has not been updated with the new value. What is often done is to repeat the entire calculation of the (possibly long and complex) processing, this time based on a new reading of the current value of the variable. Then CAS is invoked again, in the hope that this time there was no "tampering" by another task.


EXAMPLE 1: Incrementing a counter


Let’s begin with a simple example of two tasks that both increment a value using the C-language ++ operator. This operation is not an atomic operation. See Figure 3.

                                            

                                

The two tasks may become active at the same time in a multicore environment, whether they have the same priority or differing priorities. Occasionally, the ++ operations in the two tasks will be executing at precisely the same time — resulting in a corruption of the value stored in counter. For example, a count may be lost.

This can be avoided by using a lock to protect the ++ operation as a "critical section." But we’re trying to do lock-free programming! Figure 4 shows how to achieve the same objective in a lock-free fashion.[2]

 

                        

 

In the Figure 4 code, it doesn’t matter whether the + 1 operation is atomic or not. It only matters whether or not another task has been "tampering with" (or "working on") the variable counter during the calculation. The CAS operation checks this and will only update counter if there’s been no tampering. If there has been tampering, this code will simply loop around to the beginning and try to increment again.

Before leaving this example, let’s note several things. First of all, this code is much more complex and "delicate" than its lock-based alternative. So it is much harder to understand and excruciatingly harder to debug. Secondly, the looping behavior of this code makes its execution time uncontrollable (in situations where "tampering" might be detected again and again). This style of coding might be unacceptable in a hard real-time system.

In some applications, it may be possible to limit the number of  iterations through the loop, in order to limit the worst-case execution time. This may require justification such as a probabilistic calculation to show that the iteration limit will be hit so rarely as to be negligible.

 

 

EXAMPLE 2: Lock-free linked list

Linked lists are often used in embedded software to organize large quantities of sensor data or other data streams. Figure 5 illustrates a linked list, with each list member shown as a rectangle containing both a data field and a pointer to the next member of the list. (The data field in a list member could be used to store a pointer
to a large buffer, if there is a large quantity of data associated with it.)

    

               

Inserting a new member into a linked list is straightforward with a lock-free approach. First, the proper position in the list for the new member must be located. Then the content of the new member must be prepared. And finally, the new member is inserted into the list using a single CAS operation. This operation is shown in Figure 6.

 

  

In this situation, CAS operates on the pointer-to-next-member field of an existing list member to verify that there has been no "tampering" with it, before CAS changes it to point to the new member. Here, "tampering" really means that while the new member was being prepared for insertion into the list, another member had already been inserted at that position in the list. Hence all preparations for inserting the new member must be started again "from scratch" in that situation.

 

At first glance, it may seem that deleting an existing member from a list ought to be just as easy. "Just ‘CAS’ the pointer field of the previous list member to a pointer to the following list member." It sounds like it ought to work. But there is a "race condition" here, as illustrated in Figure 7.

The "Delete Member B" operation is shown in the upper part of the figure.  But in parallel, an "Insert Member Z after Member B" operation could be going on at exactly the same time, as shown in the lower part of the figure. The result would be that Member Z would be left inaccessible.

So here we see that CAS is not an all-powerful magic bullet in itself. Developers of lock-free programs need to "help it along" with solutions to problems such as the race condition of Figure 7. Here’s an idea that may help: Before the "Delete Member B" code calls CAS, it should intentionally modify the pointer field of Member B with some kind of deletion mark. This would make a parallel CAS from an "Insert" request involving Member B fail. In that way, the "Insert" would be postponed until a later retry, which would then succeed in inserting the new list member correctly.

In this way, it’s possible to do ultra-high-performance lock-free manipulation of a linked list. But once again, we have encountered logic that is more complex and "delicate" than its lock-based alternative. So again, it’s harder to understand and much harder to debug. And once again, there is looping behavior in this logic that makes its execution time possibly uncontrollable.

What’s all the fuss?

Perhaps some of the readers of this article are disappointed at this point. Wasn’t one of the big expectations of lock-free programming that we could eliminate the task waiting and core waiting inherent in lock-based programming on multicore SoCs? But now we have seen that application software designed for lock-free programming could itself involve significant task waiting and core waiting. And this waiting, as in the examples we’ve seen, is typically busy-waiting where application software just keeps on running ("hogging" the "running state" of a core) while preparing for the next attempt to CAS.

Some readers were perhaps expecting that "lock-free programming" would be synonymous with "wait-free programming." But now we know they’re different. And wait-free programming is even more challenging to design than lock-free programming."[3]

However, lock-free programming does have distinct advantages over traditional lock-based programming in areas such as typical performance,deadlock avoidance, lockout avoidance, and priority-inversion avoidance.  Program execution can sometimes be accelerated by factors of two or more at former critical sections.

Lock-free programming isn’t often used at present, perhaps because of the ultra-fine craftsmanship required of the software developers. The design of library software might be a sensible place to consider it. Or perhaps there is one small corner of your application where the elimination of high-frequency locking would give a major performance boost. Those might be good opportunities for you to try out a lock-free programming approach for the first time.


About the author:


David Kalinsky is director of customer education at D. Kalinsky Associates – Technical Training, a provider of intensive short courses on embedded systems and software development for professional engineers.He is a popular lecturer and seminar leader on technologies for embedded software in North America, Europe, and Israel. In recent years, David has built high-tech training programs for a number of Silicon Valley companies, on various real-time operating systems and other aspects of software engineering for the development of real-time and embedded systems. Before that, he was involved in the design of many embedded medical and aerospace systems. David holds a Ph.D. in nuclear physics from Yale University. Contact him at david@kalinskyassociates.com.

Endnotes:

1. Kalinsky, David. "Multicore’s ‘Fourth Semaphore’: multiple reader-writer lock," Embedded.com, posted October 26, 2010. www.eetimes.com/4210147.

2. Siebert, Fridtjof. "Facing the Challenges for Real-Time Software Development on Multi-Cores," Embedded Systems Conference 2010 Boston, paper ESC-422. (https://www.cmpevents.com/ESCe10/a.asp?option=C&V=11&SessID=11396)

3. Alexandrescu, Andrei. "Lock-Free Data Structures," Dr. Dobb’s Journal, Oct. 1, 2004.  http://drdobbs.com/184401865

About Author

Comments

blog comments powered by Disqus