atomic_counter whitepaper

This started out as the same material as published in the August 1998 issues of Visual C++ Developer, and then is edited to integrate it into the rest of the documentation and evolves with updates to the material.

When it comes to multi-threaded programs, one issue looms large: race conditions. The operating system (Windows NT, or Windows 95 / 98) provides primitives for mutual exclusion as a means to cope with these race conditions.

Mutual Exclusion means that if one thread is executing a particular block of code, then no other thread is allowed to execute that same block. There are a variety of kernel objects for this, including the Mutex.

But, Mutexís are slow. They are pure overhead and donít contribute any real meaning to the program, so making them expensive is just adding insult to injury. Adding thread safety to the reference counting in a string class, for example, slowed it down by a factor of three.

Less general purpose, but designed specifically for these race conditions, Win32 provides a CRITICAL_SECTION object. This is basically a Mutex that cheats. Without changing from user mode to kernel mode, it can check for the (hopefully common) case of the object being un-owned, and grab it. It can also check for the case of recursive locking, where the same thread acquires the object more than once. In other cases, when one thread has to wait for another, a real Mutex is used. For a program with little or no contention, this produces a dramatic increase in speed.

But this is still not the last word in eliminating race conditions. To get any faster, we need to get away from the idea of a protected region of code all together.

Digging Deeper

Just how does the CRITICAL_SECTION do its user-mode checking, anyway? Somehow, it needs to modify a flag and test it as a single unbroken operation. This is an existence proof that there is a way to do a read/modify/write operation in a thread-safe manner.

In fact, if you peruse the Win32 API you will find a set of functions for this purpose:

InterlockedCompareExchange
InterlockedDecrement
InterlockedExchange
InterlockedExchangeAdd
InterlockedIncrement

But, I was surprised to find that the Microsoft compiler for x86 doesnít emit inline code for these. Although they could be optimized down to a single machine instruction, you instead call a small function. Furthermore, the InterlockedExchange is implemented so poorly that I had to get a second opinion to even be sure it was correct.

Before we slam the functions too badly, letís take a look at why they might be useful to us. Suppose you were implementing a reference count in some object. Some thread may increment the value while another thread decrements the value, or two threads may increment it at the same time. This is a classic read/modify/write operation, susceptible to serious race conditions. But instead of throwing a region of mutual exclusion around this simple operation (appalling because the overhead outweighs the operation by a factor of a thousand—like shipping a single SIMM in a foam block the size of boxcar), we could use the Interlocked instructions.

Basically, the CPU has a primitive capability of doing an increment or decrement in a single unbroken operation, or atomically. You might suppose that a single INC DWORD PTR instruction is atomic in and of itself. After all, an interrupt can't occur between the read and the write, since interrupts are only serviced between instructions.

That's true enough for a single CPU, but even with only one CPU (and no contention from other devices like bus-mastering or memory-mapped I/O) it's not true if you want to see what the value is at the same time you changed it. That is, both testing the value and changing is not atomic. However, there are some CPU instructions that provide this, such as XADD (exchange and add).

In order to be truly safe, you must use a special feature of the CPU to make sure that the instruction is indeed atomic. On the x86 this is the bus lock, achieved with the LOCK prefix. When applied to one of those single-instruction read/test/modify/write opcodes like XADD, it uses hardware controls to make that atomicness hold true even with multiple CPUs (or other masters) in the system.

The EnterCriticalSection is in fact implemented by using one of these instructions. It atomically reads the old value and writes a “taken” flag to see if the critical section is already taken. If the old value is “not taken”, then the code proceeds. If the old value is also “taken” then it has to wait, and writing the “taken” flag did not change anything so nothing needs to be undone.

For our reference count, a single atomic increment or decrement is all we need to accomplish our final goal! There is no need to go through the whole logic of EnterCriticalSection and then LeaveCriticalSection. Using atomic primitives is two to three times faster! Even more important to some uses, they never block.

test Dual Pentium Pro
200 MHz, WinNT4
circa 1998
Pentium 4 w/hyperthreading
2.8 GHz, WinXP
circa 2006
regular memory variable: 31.25 nanoseconds 2.34 nanoseconds
hand-coded memory variable: 31.25 nanoseconds 2.08 nanoseconds
atomic counter: 178.1 nanoseconds 38 nanoseconds
critical section: 504.6 nanoseconds 78.1 nanoseconds

This shows that an atomic counter is 16 times slower than using a regular variable. However, this does not consider the superscaling features and how the atomic instruction prevents that. A test that interleaves three variables is just as fast as one variable! This means that an atomic instruction can be 48 times slower, since it stalls the pipelining of operations. Generally, newer processors have a heavier penalty than older ones for stalling the pipeline, because the newer CPUs do more simultainiously. A critical section, when it doesnít block, requires two atomic instructions: one to enter and one to leave. So it is not suprising that it benchmarks at more than twice the duration of the atomic counter.

These tests and more can be found in the atomic_counter_benchmark.cxx file, and you can run it for yourself.

Taking the next step

So, we know that atomic primitives are nice to have around. The selection of Win32 calls, however , are sadly incomplete and inherently inefficient, not to mention too primitive for C++. We can do a lot better, with just a little work.

Instead of

y= InterlockedExchangeAdd (&x, 1);

Iíd rather write the cleaner

y= x++;

This is just syntactic sugar, as InterlockedExchangeAdd of 1 is just an atomic post-increment. But what about an atomic pre-increment? There is no such function in the Win32 arsenal, so you would have to use the post-increment and then repeat the operation on the returned “old” value to find out what the new value was at that instant.

y=
InterlockedExchangeAdd (&x, 1) // increments x but returns old value
+1; // re-compute to know what the new value of x is.

This is why I said that the functions are inherently inefficient. As it turns out, the machine primitive is a pre-increment! A similar trick is done in reverse to un-do the operation on the return value, as part of the implementation of InterlockedExchangeAdd. So, to get a pre- increment we apply two more operations that cancel each other out. Yuck.

If there was both pre- and post-increment available, you could use whichever you really needed. Which one was native under the hood would not matter, since in either case you do the minimum work to get the result you actually wanted.

Another shortcoming of the Interlocked suite of functions is the argument type. They only work on signed long values. Actually, the code works on any 4-byte integral value, but only a long* signature is provided so you have to cast to use it on int, unsigned, unsigned long, or any pointer type. And it cannot work at all on short, unsigned short, char, or unsigned char.

A proper C++ solution would be fully rich in terms of available types, and be type-safe without casting. The solution presented here is a template which does exactly this.

The atomic_counter template lets you create a value that is manipulated atomically. Being a template, you can bestow this atomicness on anything you like.

atomic_counter<int> c1;
atomic_counter<unsigned> c2;
atomic_counter<byte> c3;
// Ö etcÖ

Once declared, you can use prefix or postfix increment or decrement, in addition to += and -= on the variables, and they behave atomically. Listing A shows a first cut at a template class definition, which you can see is pretty simple.

template <typename T>
class atomic_counter {
protected:
   T value;
   // >> need a critical section, too.
public:
   operator T() const volatile { return value; }
   T operator= (atomic_counter x) volatile { value = x.value; }
   atomic_counter (T x) : value(x) {}
   atomic_counter() {}
   T operator++() volatile;
   T operator--() volatile;
   T operator++ (int) volatile;
   T operator-- (int) volatile;
   T operator+= (T) volatile;
   T operator-= (T) volatile;
   T exchange (T) volatile;
   };

// The implementation of the general case requires a critical
// section.  No implementation is provided at this time.

This is a fairly ordinary template, which creates a class to hold a value of the specified type and apply various operators to it.

This parameterized class could be implemented using a CRITICAL_SECTION, and then it would handle any kind of type so long as the underlying operations existed. This would include floating point values, and any user-defined types such as BCD.

But, thatís not what we have in mind. We want the efficient atomic instructions to be used, not critical sections.

Getting Specialized

The special atomic instructions only work for certain types. Specifically, integers (signed or unsigned) of 1, 2, or 4 bytes. The instructions can also be applied to pointers with a slightly different implementation. They wonít work on arbitrary types and certainly not on user-defined class types.

So, the idea is to define a generic atomic_counter that will take anything because it uses a portable implementation. Then, tell the compiler that for particular types, use an alternate implementation.

The fact that int is special is hidden from the user. You can have:

atomic_counter<double> dc;
double val= dc++;

and a critical section will be used. But if you write:

atomic_counter<unsigned long> Lc;
unsigned long val= Lc++;

then the atomic instructions will be used instead.

Keeping the syntax uniform is more than just being neat and relaxing the need for the user to remember two different template names and when to use each. Keeping the syntax the same is crucial in writing other templates. Suppose I wrote:

template <typename T>
T foo()
 {
 atomic_counter<T> count;
 // Ö

Here I don't know what T will be, and this template would inherit any special restrictions from the templates used to implement it. That is not nice. In order to be able to use an atomic_counter<T> within some other template function or template class, the syntax must be uniform across all types.

But that doesnít mean the implementation has to be the same!

In the simplest case, you can write an explicit specialization of a class. For example, listing B sketches an explicit specialization of atomic_counter<int>.

template<>
class atomic_counter<int> {
   int value;
public:
      inline int operator++()
         { return InterlockedExchangeAdd(&value, 1); }
         in reality, would need volatile and casting.
      // Ö etc
   };

This tells the compiler that when it needs an atomic_counter<int>, to use the definition supplied instead of generating it from the generic atomic_counter<T>. In principle, we need to supply a specialized definition of all the types that can be handled specially.

That can get rather lengthy. The thought of writing 10 classes each with ten functions makes me cringe. Thatís what templates are for!

And sure enough, you can use another template to implement nine of the special cases. You still need to define all the classes, but now they are trivial.

template<>
class atomic_counter<int> : public atomic_integer_xx<int> {
public:
   atomic_counter() {}
   atomic_counter (int x) : atomic_integer_xx<int>(x) {}
   int operator= (atomic_counter x) volatile { return value = x.value; }
   };

The real work is done by the helper template atomic_integer_xx, and then we declare the specialization to inherit the implementation generated by the helper template. We need only supply the constructors and assignment operator.

The key to writing the helper template is uniformity. If all the different flavors look the same, then the template can generate them from a generic description. The key to making all of them look the same is to isolate the differences in small functions, and allow overloading to choose the correct function.

So, the implementation of one of the functions is:

T operator++() volatile
      { return 1+ internal::Xadd (&value,1); }

where internal::Xadd is an overloaded function that handles one, two, or four byte values with the appropriate atomic instructions. The beauty is that all the functions except exchange can be implemented in terms of Xadd. So the template handles all the common operators and all the available types, and the real implementation boils down to three primitives to atomically add bytes, shorts, and longs.

Atomic CPU Instructions

To implement the three forms of Xadd, we need assembly language. The supplied InterlockedExchangeAdd only handles the 4-byte case, and no supplied function exists for the other two sizes. But, the function is quite trivial to write.

__declspec(naked) int __fastcall Xadd (volatile int*, int)
 {
 __asm {
    lock xadd dword ptr [ECX], EDX
    mov EAX, EDX
    ret
    }
 }

As seen in Listing C, inline assembly can be used to spit out the instruction, and also to return the proper result. The __fastcall keyword tells the compiler that we want arguments passed in registers instead of on the stack. There, we can immediately use them in the lock xadd instruction. Then, return a value by moving it into the EAX register. The __declspec(naked) tells the Microsoft compiler not to bother generating the normal entry and exit code for the function—we have no need for it.

The byte and 16-bit forms are almost identical.

volatile values

You may have noticed that all the members of atomic_counter are declared volatile, as are various arguments within the implementation. This is necessary in order to allow the user to declare volatile atomic counters.

In C++, volatile is a modifier similar to const. While const says that the value will not be changed by the code, volatile says that the value might be changed by some process unknown to the code. In particular, any time a volatile value is used, it must actually be checked. The compiler is not allowed to optimize it out by pulling it into a register or assume that just because it didn't change it means it still holds the same value.

Consider a loop like this:

atomic_counter<int> count= 0;
// create some threads Ö
// use count Ö

//later,
while (count > 0) Sleep(1000);

The idea here is that this code waits for other threads to finish their work, and those other threads decrement count as they finish. When the count hits zero, this code can proceed. In reality, it causes an infinite loop when optimizations are turned on. The compiler will see that count is never changed in the body of the loop, and assume that it has not changed. The volatile keyword tells the compiler that count could indeed changed by other means, in this case by other threads.

volatile atomic_counter<int> count= 0;

Because of the intended use of atomic counters as values that will be manipulated simultaneously on multiple threads, they should be volatile in general. (so why not just make the value member volatile? Because I just now thought of it).

Now, if count is declared as volatile, we could not call operator++ on it unless operator++ were a volatile function. Itís the same as with const, which you should be more familiar with. The compiler will not allow an implicit conversion that loses qualifiers. If I declare the variable as volatile, I can only pass a pointer or reference to it to a function that understands that itís volatile and respects it as such. I could not pass it to a function that then caches it in a register—that would go against the standing orders about this variable.

So, the need for the volatile keyword propagates all the way down into the most primitive implementation functions.

The use of volatile also requires me to supply an assignment operator, which at first glance would seem to be no different than what the compiler would do for me anyway. But, the compiler- generated copy assignment operator is not volatile, and if I want to assign a value to a volatile variable, I need to specify otherwise.

In practice

The bottom line is that atomic counters are easy to use and satisfy a specialized but somewhat common need more efficiently than a more general-purpose mutual exclusion mechanism.

Benchmarks (see table) show that this class is more than twice the speed of a CRITICAL_SECTION, though it is still much slower than an ordinary non-thread-safe counter. It is, however, as good as we can get on the hardware.

Besides the speed, it offers two more advantages. It does not need to be initialized like a CRITICAL_SECTION, so initialization order issues are avoided. Second, it fits in the same place as a primitive value—an atomic int is four bytes, no larger than a normal int. This means that atomic arithmetic can be performed on structures that have already been defined to hold integer values at specific offsets, and we donít have large CRITICAL_SECTION objects that outweigh the object being protected.

The atomic_counter template is a powerful tool in my arsenal of thread-safe code.

Availability

The atomic_counter code is part of the Repertoire Project, a study in designing a set of solid reusable primitive foundation components. Complete code and documentation is available at http:// www.dlugosz.com/Repertoire. This class may be trivially extracted from the library and modified and incorporated into your own code, or you can just use the entire Classics library, which is full of other goodies.