Semaphores, locks (spinlocks and mutexes), and condition variables are the standard tools used to write safe multithreaded code. You'll see them if you look inside a typical modern operating system kernel designed for multi-processor (SMP) systems.
But there are times when these standard synchronization mechanisms don't quite cut it. Perhaps it's a piece of performance-critical code, or code which has trouble scaling to large numbers of processors because of lock use. If you dig around in Linux and other systems, you'll find some other clever approaches to synchronization, too.
First a disclaimer: You probably shouldn't be using these methods in your code, unless you understand the issues involved, have good reasons to avoid standard synchronization primitives, and are willing to write some architecture-specific code. Even then, you should think twice. But they can be fun to learn about.
Remember transactional memory from discussion last week? There, we were able to write programs without any explicit locking—only marking code blocks that we wanted to be atomic—and the processor would take of the details for us. Unfortunately, there aren't any processors you can buy that will do this now. However, we may able to get some of the same performance benefits—not having to use explicit locks—in restricted cases by doing a lot of the work ourselves in software.
Here's a simple example: we'll implement a linked list and give it stack operations: push an item onto the list and pop an item off the list. We'd like multiple threads to be able to access the list concurrently, without the need for locks.
The data structure is simple and what you'd expect:
struct item {
int value;
struct item *next;
};
struct item *stack_top = NULL;
Here is how we might implement the push and pop operations for a single-threaded program:
void push(int value)
{
struct item *node = new item;
node->value = value;
node->next = stack_top;
stack_top = node;
}
int pop()
{
struct item *node = stack_top;
if (node == NULL)
return -1; // Signal the stack is empty
stack_top = node->next;
int value = node->value;
delete node;
return value;
}
This code won't work as written if multiple threads may be calling push or pop simultaneously. To implement it without locks, we'll rely on the compare-and-exchange (or compare-and-swap) operation. The behavior of compare-and-exchange is essentially
int cmpxchg(int source, int expected, int *dest)
{
int old_value = *dest;
if (old_value == expected)
*dest = source;
return old_value;
}
except that the code executes atomically. This would be implemented as
a single instruction in the processor. In words, compare-and-exchange
compares the value at a memory address with an expected value, and if
the two are equal, it writes a third value into memory. If the values
don't agree, it does nothing. In either case, it will return the value
which was actually found in memory. (So here, if the return value is
equal to expected, then we know the write to memory
occurred.)
There are different variants of compare-and-exchange that might be available, depending upon the system; the key idea is that we have a single atomic instruction which can check a value in memory against an expected value, if it matches write a different value, and let us know whether it succeeded or not.
How can we use this in our linked-list example? We'll use cmpxchg to actually update the stack_top variable. If another thread tries to update simultaneously, the thread that goes second will notice the problem, go back, and retry the operation. So in the push function, we'll get the new item ready for the front of the list, and then atomically try to put it into place. If stack_top changed while we were doing this, we'll have to fix up the next pointer on the new item, and then retry. pop works similarly.
void push(int value)
{
struct item *node = new item;
node->value = value;
struct item *old_stack_top;
do {
old_stack_top = stack_top;
node->next = old_stack_top;
} while (cmpxchg(node, old_stack_top, &stack_top) != old_stack_top);
}
int pop()
{
struct item *node;
do {
node = stack_top;
if (node == NULL)
return -1; // Signal the stack is empty
} while (cmpxchg(node->next, node, &stack_top) != node);
int value = node->value;
delete node;
return value;
}
(Actually, there's still a potential small race condition in the code for pop as things are written. Can you spot it? What happens if two threads execute pop nearly simultaneously? What effect might the delete node line in the first thread have? Hold this thought for the next section.)
Though it is simple, compare-and-exchange is powerful enough that it can be used to implement a wide variety of lock-free data structures. It is more powerful than the test-and-set; while we could use test-and-set to implement locks, test-and-set is not sufficient for implmenting all these lock-free data structures.
Suppose that you have a data structure which is read by very many threads, often concurrently, but is only very rarely written to. For accesses to be safe, you'd ordinarily need to use a lock. Since most threads are reading, we could use a reader-writer lock to allow more concurrency. But if most threads are only accessing the data only very briefly, even this will not help much. (Why? Consider our implementation of a reader-writer lock from lecture, and think about what will happen if many, many threads are all trying to acquire a read lock at the same time.)
The Linux kernel supports a technique called Ready-Copy-Update, or RCU, for cases such as these. By observing a few restrictions, RCU allows all the readers to access the data concurrently, without ever having to acquire any locks!
For a concrete example: suppose our kernel is keeping track of the date and time:
struct datetime {
int year, month, day;
int hour, minute, second;
};
struct datetime now;
We set up a timer to update now once per second, so that it
always contains the current time. What could happen if another piece of
code tries to print the date, and is interrupted by the timer after it
has read the year, but before the month and day?
We'd like is for readers to be able to see a consistent version of the data, without having to use a lock to synchronize access. RCU solves this problem by:
How does the system know when no thread is still using the old object? One possibility, if the kernel scheduler is not pre-emptive, is to establish the rule that a thread cannot start to read an RCU object, yield the CPU (perhaps by going to sleep or blocking on a lock), and then finish reading the RCU object. Once all processors in the system have performed a context switch, it is safe to delete old objects. (Why?)
Can you think of other mechanisms that could be used to decide when old objects can be reclaimed?
If you'd like to read more, see the Wikipedia article on RCU, which also contains links to other references.
If we have time, we may discuss some of these questions in discussion. Feel free to think about them ahead of time, though don't worry if you don't have the chance.