Preshing on Programming

The double-checked locking pattern (DCLP) is a bit of a notorious case study in lock-free programming. Up until 2004, there was no safe way to implement it in Java. Before C++11, there was no safe way to implement it in portable C++.

As the pattern gained attention for the shortcomings it exposed in those languages, people began to write about it. In 2000, a group of high-profile Java developers got together and signed a declaration entitled “Double-Checked Locking Is Broken”. In 2004, Scott Meyers and Andrei Alexandrescu published an article entitled “C++ and the Perils of Double-Checked Locking”. Both papers are great primers on what DCLP is, and why, at the time, those languages were inadequate for implementing it.

All of that’s in the past. Java now has a revised memory model, with new semantics for the volatile keyword, which makes it possible to implement DCLP safely. Likewise, C++11 has a shiny new memory model and atomic library which enable a wide variety of portable DCLP implementations. C++11, in turn, inspired Mintomic, a small library I released earlier this year which makes it possible to implement DCLP on some older C/C++ compilers as well.

In this post, I’ll focus on the C++ implementations of DCLP.

What Is Double-Checked Locking?

Suppose you have a class which implements the well-known Singleton pattern, and you want to make it thread-safe. The obvious approach is to ensure mutual exclusivity by adding a lock. That way, if two threads call Singleton::getInstance simultaneously, only one of them will create the singleton.

Singleton* Singleton::getInstance() {
    Lock lock;      // scope-based lock, released automatically when the function returns
    if (m_instance == NULL) {
        m_instance = new Singleton;
    }
    return m_instance;
}

It’s a totally valid approach, but once the singleton is created, there isn’t really any need for the lock anymore. Locks aren’t necessarily slow, but they don’t scale well under heavy contention.

The double-checked locking pattern avoids this lock when the singleton already exists. However, it’s not so simple, as the Meyers-Alexandrescu paper shows. In that paper, the authors describe several flawed attempts to implement DCLP in C++, dissecting each attempt to explain why it’s unsafe. Finally, on page 12, they show an implementation which is safe, but which depends on unspecified, platform-specific memory barriers.

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance;
    ...                     // insert memory barrier
    if (tmp == NULL) {
        Lock lock;
        tmp = m_instance;
        if (tmp == NULL) {
            tmp = new Singleton;
            ...             // insert memory barrier
            m_instance = tmp;
        }
    }
    return tmp;
}

Here, we see where the double-checked locking pattern gets its name: We only take a lock when the singleton pointer m_instance is NULL, which serializes the first group of threads which happen to see that value. Once inside the lock, m_instance is checked a second time, so that only the first thread will create the singleton.

This is very close to a working implementation. It’s just missing some kind of memory barrier on the highlighted lines. At the time when the authors wrote the paper, there was no portable C/C++ function which could fill in the blanks. Now, with C++11, there is.

Using C++11 Acquire and Release Fences

You can safely complete the above implementation using acquire and release fences, a subject which I explained at length in my previous post. However, to make this code truly portable, you must also wrap m_instance in a C++11 atomic type and manipulate it using relaxed atomic operations. Here’s the resulting code, with the acquire and release fences highlighted.

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load(std::memory_order_relaxed);
        if (tmp == nullptr) {
            tmp = new Singleton;
            std::atomic_thread_fence(std::memory_order_release);
            m_instance.store(tmp, std::memory_order_relaxed);
        }
    }
    return tmp;
}

This works reliably, even on multicore systems, because the memory fences establish a synchronizes-with relationship between the thread which creates the singleton and any subsequent thread which skips the lock. Singleton::m_instance acts as the guard variable, and the contents of the singleton itself are the payload.

That’s what all those flawed DCLP implementations were missing: Without any synchronizes-with relationship, there was no guarantee that all the writes performed by the first thread – in particular, those performed in the Singleton constructor – were visible to the second thread, even if the m_instance pointer itself was visible! The lock held by the first thread didn’t help, either, since the second thread doesn’t acquire any lock, and can therefore run concurrently.

If you’re looking for a deeper understanding of how and why these fences make DCLP work reliably, there’s some background information in my previous post as well as in earlier posts on this blog.

Using Mintomic Fences

Mintomic is a small C library which provides a subset of functionality from C++11’s atomic library, including acquire and release fences, and which works on older compilers. Mintomic relies on the assumptions of the C++11 memory model – specifically, the absence of out-of-thin-air stores – which is technically not guaranteed by older compilers, but it’s the best we can do without C++11. Keep in mind that these are the circumstances in which we’ve written multithreaded C++ code for years. Out-of-thin-air stores have proven unpopular over time, and good compilers tend not to do it.

Here’s a DCLP implementation using Mintomic’s acquire and release fences. It’s basically equivalent to the previous example using C++11’s acquire and release fences.

mint_atomicPtr_t Singleton::m_instance = { 0 };
mint_mutex_t Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = (Singleton*) mint_load_ptr_relaxed(&m_instance);
    mint_thread_fence_acquire();
    if (tmp == NULL) {
        mint_mutex_lock(&m_mutex);
        tmp = (Singleton*) mint_load_ptr_relaxed(&m_instance);
        if (tmp == NULL) {
            tmp = new Singleton;
            mint_thread_fence_release();
            mint_store_ptr_relaxed(&m_instance, tmp);
        }
        mint_mutex_unlock(&m_mutex);
    }
    return tmp;
}

To implement acquire and release fences, Mintomic tries to generate the most efficient machine code possible on every platform it supports. For example, here’s the resulting machine code on Xbox 360, which is based on PowerPC. On this platform, an inline lwsync is the leanest instruction which can serve as both an acquire and release fence.

The previous C++11-based example could (and ideally, would) generate the exact same machine code for PowerPC when optimizations are enabled. Unfortunately, I don’t have access to a C++11-compliant PowerPC compiler to verify this.

Using C++11 Low-Level Ordering Constraints

C++11’s acquire and release fences can implement DCLP correctly, and should be able to generate optimal machine code on the majority of today’s multicore devices (as Mintomic does), but they’re not considered very fashionable. The preferred way to achieve the same effect in C++11 is to use atomic operations with low-level ordering constraints. As I’ve shown previously, a write-release can synchronize-with a read-acquire.

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load(std::memory_order_acquire);
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load(std::memory_order_relaxed);
        if (tmp == nullptr) {
            tmp = new Singleton;
            m_instance.store(tmp, std::memory_order_release);
        }
    }
    return tmp;
}

Technically, this form of lock-free synchronization is less strict than the form using standalone fences; the above operations are only meant to prevent memory reordering around themselves, as opposed to standalone fences, which are meant to prevent certain kinds of memory reordering around all neighboring operations. Nonetheless, on the x86/64, ARMv6/v7, and PowerPC architectures, the best possible machine code is the same for both forms. For example, in an older post, I showed how C++11 low-level ordering constraints emit dmb instructions on an ARMv7 compiler, which is the same thing you’d expect using standalone fences.

One platform on which the two forms are likely to generate different machine code is Itanium. Itanium can implement C++11’s load(memory_order_acquire) using a single CPU instruction, ld.acq, and store(tmp, memory_order_release) using st.rel. I’d love to investigate the performance difference of these instructions versus standalone fences, but have no access to an Itanium machine.

Another such platform is the recently introduced ARMv8 architecture. ARMv8 offers ldar and stlr instructions, which are similar to Itanium’s ld.acq and st.rel instructions, except that they also enforce the heavier StoreLoad ordering between the stlr instruction and any subsequent ldar. In fact, ARMv8’s new instructions are intended to implement C++11’s SC atomics, described next.

Using C++11 Sequentially Consistent Atomics

C++11 offers an entirely different way to write lock-free code. (We can consider DCLP “lock-free” in certain codepaths, since not all threads take the lock.) If you omit the optional std::memory_order argument on all atomic library functions, the default value is std::memory_order_seq_cst, which turns all atomic variables into sequentially consistent (SC) atomics. With SC atomics, the whole algorithm is guaranteed to appear sequentially consistent as long as there are no data races. SC atomics are really similar to volatile variables in Java 5+.

Here’s a DCLP implementation which uses SC atomics. As in all previous examples, the second highlighted line will synchronize-with the first once the singleton is created.

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load();
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load();
        if (tmp == nullptr) {
            tmp = new Singleton;
            m_instance.store(tmp);
        }
    }
    return tmp;
}

SC atomics are considered easier for programmers to reason about. The tradeoff is that the generated machine code tends to be less efficient than that of the previous examples. For example, here’s some x64 machine code for the above code listing, as generated by Clang 3.3 with optimizations enabled:

Because we’ve used SC atomics, the store to m_instance has been implemented using an xchg instruction, which acts as a full memory fence on x64. That’s heavier instruction than DCLP really needs on x64. A plain mov instruction would have done the job. It doesn’t matter too much, though, since the xchg instruction is only issued once, in the codepath where the singleton is first created.

On the other hand, if you compile SC atomics for PowerPC or ARMv6/v7, you’re pretty much guaranteed lousy machine code. For the gory details, see 00:44:25 - 00:49:16 of Herb Sutter’s atomic<> Weapons talk, part 2.

Using C++11 Data-Dependency Ordering

In all of the above examples I’ve shown here, there’s a synchronizes-with relationship between the thread which creates the singleton and any subsequent thread which avoids the lock. The guard variable is the singleton pointer, and the payload is the contents of the singleton itself. In this case, the payload is considered a data dependency of the guard pointer.

It turns out that when working with data dependencies, a read-acquire operation, which all of the above examples use, is actually overkill! We can do better by performing a consume operation instead. Consume operations are cool because they eliminate one of the lwsync instructions on PowerPC, and one of the dmb instructions on ARMv7. I’ll write more about data dependencies and consume operations in a future post.

Using a C++11 Static Initializer

Some readers already know the punch line to this post: C++11 doesn’t require you to jump through any of the above hoops to get a thread-safe singleton. You can simply use a static initializer.

Singleton& Singleton::getInstance() {
    static Singleton instance;
    return instance;
}

The C++11 standard’s got our back in section 6.7.4:

If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.

It’s up to the compiler to fill in the implementation details, and DCLP is the obvious choice. There’s no guarantee that the compiler will use DCLP, but it just so happens that some (perhaps most) do. Here’s some machine code generated by GCC 4.6 when compiling for ARM with the -std=c++0x option:

Since the Singleton is constructed at a fixed address, the compiler has introduced a separate guard variable for synchronization purposes. Note in particular that there’s no dmb instruction to act as an acquire fence after the initial read of this guard variable. The guard variable is a pointer to the singleton, and therefore the compiler can take advantage of the data dependency to omit the dmb instruction. __cxa_guard_release performs a write-release on the guard, is therefore dependency-ordered-before the read-consume once the guard has been set, making the whole thing resilient against memory reordering, just like all the previous examples.

As you can see, we’ve come a long way with C++11. Double-checked locking is fixed, and then some!

Personally, I’ve always thought that if you want to initialize a singleton, best to do it at program startup. But DCLP can certainly help you out of a jam. And as it happens, you can also use DCLP to store arbitrary value types in a lock-free hash table. More about that in a future post as well.

Raymond Chen defined acquire and release semantics as follows, back in 2008:

An operation with acquire semantics is one which does not permit subsequent memory operations to be advanced before it. Conversely, an operation with release semantics is one which does not permit preceding memory operations to be delayed past it.

Raymond’s definition applies perfectly well to Win32 functions like InterlockedIncrementRelease, which he was writing about at the time. It also applies perfectly well to atomic operations in C++11, such as store(1, std::memory_order_release).

It’s perhaps surprising, then, that this definition does not apply to standalone acquire and release fences in C++11! Those are a whole other ball of wax.

To see what I mean, consider the following two code listings. They’re both taken from my post about the double-checked locking pattern in C++11. The code on the left performs a release operation directly on m_instance, while the code on the right uses a release fence instead.

In both cases, the purpose is the same: to prevent other threads from observing the store to m_instance before any stores performed inside the Singleton constructor.

Now, what if Raymond’s definition of release semantics did apply to the release fence on the right? That would only mean that preceding memory operations are prevented from being reordered past the fence. But that guarantee, alone, is not enough. There would still be nothing to prevent the relaxed atomic store to m_instance, on the very next line, from being reordered before the fence – and even before the stores performed by the Singleton constructor – defeating the whole purpose of the fence in the first place.

This misconception is very common. Herb Sutter himself makes this mistake at 1:10:35 in part 1 of his (otherwise superb) atomic<> Weapons talk, even going so far as to say that release makes no sense on a standalone fence.

“If this memory barrier were just one way – for instance, the memory barrier itself was a release… the write to global = temp can skip right over, and then all the code’s together again. Now the fox in the henhouse, and mayhem ensues.”

Fortunately, that’s not how C++11 release fences work. A release fence actually prevents all preceding memory operations from being reordered past subsequent writes.

To his credit, Herb has acknowledged the error in a comment here.

In C++11, a Release Fence Is Not Considered a “Release Operation”

I think that the above misconception stems from some confusion about what is a “release operation” in C++11 and what isn’t. You might reasonably expect a release fence to be considered a “release operation”, but if you comb through the C++11 standard, you’ll find that it’s actually very careful not to call it that.

In the language of C++11, only a store can be a release operation, and only a load can be an acquire operation. (See section 29.3.1 of working draft N3337.) A memory fence is neither a load nor a store, so obviously, it can’t be an acquire or release operation. Furthermore, if we accept that acquire and release semantics apply only to acquire and release operations, it’s clear that Raymond Chen’s definition does not apply to acquire and release fences. In my own post about acquire and release semantics, I was careful to specify the kind of operations on which they can apply.

Nor Can a Release Operation Take the Place of a Release Fence

It goes the other way, too. The two code listings near the start of this post both do a fine job of implementing DCLP, but they are not equivalent to each other. While the code on the right achieves all the effects of the code on the left, the reverse is not true: Technically, the code on the left does not achieve all the effects of the code on the right. The difference is quite subtle.

The release operation on the left, m_instance.store(tmp, std::memory_order_release), actually places fewer memory ordering constraints on neighboring operations than the release fence on the right. A release operation (such as the one on the left) only needs to prevent preceding memory operations from being reordered past itself, but a release fence (such as the one on the right) must prevent preceding memory operations from being reordered past all subsequent writes. Because of this difference, a release operation can never take the place of a release fence.

It’s easy to see why not. Consider what happens when we take the code listing on the right, and replace the release fence with a release operation on a separate atomic variable, g_dummy:

Singleton* tmp = new Singleton;
g_dummy.store(0, std::memory_order_release);
m_instance.store(tmp, std::memory_order_relaxed);

This time, we really do have the problem that Herb Sutter was worried about: The store to m_instance is now free to be reordered before the store to g_dummy, and possibly before any stores performed by the Singleton constructor. The fox is in the henhouse, and mayhem ensues!

(Interesting side note: An early draft of the C++11 standard, N2588, dating back to 2008, actually tried to define memory fences in a manner similar to this example. There was no standalone atomic_thread_fence function in that draft; there was only a member function on atomic objects, fence. For convenience, the draft included a global_fence_compatibility object, similar to the g_dummy object used here. A paper by Peter Dimov revealed some shortcomings in this design. As a result, the C++11 standard committee ditched the approach in favor of the standalone fence function we have today.)

Now That We’ve Cleared That Up…

Standalone memory fences are considered difficult to use, and I think part of the reason is because very few people even use the C++11 atomic library – let alone this part of it. As a result, misconceptions tend to go unnoticed more easily.

This particular misconception has bugged me for a while. Part of that is because earlier this year, I released an open source library called Mintomic. The only way to enforce memory ordering in Mintomic is by using standalone fences that are equivalent to those in C++11. I’d rather not have people think they don’t work!

Recently, I became interested in the inner workings of Bitcoin – specifically, the way it uses elliptic curve cryptography to generate Bitcoin addresses such as 1PreshX6QrHmsWbSs8pHpz6kLRcj9kdPy6. It inspired me to write another obfuscated Python script. The following is valid Python code:

_                   =r"""A(W/2,*M(3*G
               *G*V(2*J%P),G,J,G)+((M((J-T
            )*V((G-S)%P),S,T,G)if(S@(G,J))if(
         W%2@(S,T)))if(W@(S,T);H=2**256;import&h
       ashlib&as&h,os,re,bi    nascii&as&k;J$:int(
     k.b2a_hex(W),16);C$:C    (W/    58)+[W%58]if(W@
    [];X=h.new("rip           em    d160");Y$:h.sha25
   6(W).digest();I$                 d=32:I(W/256,d-1)+
  chr(W%256)if(d>0@"";                  U$:J(k.a2b_base
 64(W));f=J(os.urando       m(64))        %(H-U("AUVRIxl
Qt1/EQC2hcy/JvsA="))+      1;M$Q,R,G       :((W*W-Q-G)%P,
(W*(G+2*Q-W*W)-R)%P)       ;P=H-2**       32-977;V$Q=P,L=
1,O=0:V(Q%W,W,O-Q/W*                      L,L)if(W@O%P;S,
T=A(f,U("eb5mfvncu6                    xVoGKVzocLBwKb/Nst
zijZWfKBWxb4F5g="),      U("SDra         dyajxGVdpPv8DhEI
qP0XtEimhVQZnEfQj/       sQ1Lg="),        0,0);F$:"1"+F(W
 [1:])if(W[:1           ]=="\0"@""        .join(map(B,C(
  J(W))));K$:               F(W          +Y(Y(W))[:4]);
   X.update(Y("\4"+                     I(S)+I(T)));B$
    :re.sub("[0OIl    _]|            [^\\w]","","".jo
     in(map(chr,ra    nge    (123))))[W];print"Addre
       ss:",K("\0"+X.dig    est())+"\nPrivkey:",K(
         "\x80"+I(f))""";exec(reduce(lambda W,X:
            W.replace(*X),zip(" \n&$@",["","",
               " ","=lambda W,",")else "])
                    ,"A$G,J,S,T:"+_))

Python 2.5 – 2.7 is required. Each time you run this script, it generates a Bitcoin address with a matching private key.

So, what’s going on here? Basically, this little script gives you the ability to throw some money around. Obviously, I don’t recommend doing so. I just think it’s cool that such a thing is even possible. Allow me to demonstrate.

Sending Bitcoins to One of These Addresses

To show that the above Python script generates working Bitcoin addresses, I’ll go ahead and send 0.2 BTC – that’s currently over $100 worth – to the first address shown in the above screenshot. I’ll use Bitcoin-Qt, the original Bitcoin desktop wallet.

Here’s the transaction verified on Blockchain.info. Goodbye, 0.2 BTC!

Now, if I didn’t have the private key corresponding to 1AbbYb365sQ5DpZXTKkoXMCDMjLSx6m3pH, those bitcoins would be lost forever. Fortunately, I do have the private key. It was generated by the Python script too. So let’s get those bitcoins back.

Recovering Those Bitcoins

To recover those bitcoins, I’ll use another desktop wallet called Electrum. Under Wallet → Private keys → Import, I can enter the private key:

…and presto! Electrum considers those 0.2 BTC mine to spend once again.

To make sure, let’s send them back to another address.

Here’s the final transaction verified on Blockchain.info.

There you have it. We’ve successfully sent money to – and more importantly, back from! – a Bitcoin address that was generated by some code shaped like a Bitcoin logo.

What Does This Illustrate About Bitcoin?

Bitcoin addresses are created out of thin air. First, the script generates a pseudorandom number – that’s the private key. It then multiplies that number by an elliptic curve point to find the matching public key. The public key is shortened by a hash function, producing a Bitcoin address. Finally, both private key and address are encoded as text. Most Bitcoin wallet applications generate addresses in exactly this way.

Randomness ensures that each address is unique. With addresses created out of thin air, you might worry that two different Bitcoin wallets will eventually generate the same address. That’s not impossible, but with a strong pseudorandom number generator, it’s very, very, very, very, very, very, very, very unlikely. There are 2¹⁶⁰ possible Bitcoin addresses. If you were to generate one million addresses per second for 5000 years, you’d be more likely to have a meteor fall on your house than to ever see the same address twice.

You must keep your private keys safe! The security of your bitcoins depends entirely on your ability to keep your private keys secret. Normally, your collection of private keys is stored in a wallet, so it’s absolutely critical to keep that wallet safe – whether it’s stored online, encypted to a file on your hard disk, or printed on paper. If you lose access to your wallet, you lose your bitcoins. Likewise, if a thief gains access to your wallet, and bitcoins are still stored at any address inside it, he or she could steal those bitcoins within seconds. Indeed, such thefts happen regularly.

In researching Bitcoin, I found that there are a lot of smart people who understand Bitcoin very well, and a lot of people who know almost nothing about it. Luckily, the first group has created plenty of resources for learning more. This post was pieced together from information on Wikipedia, this blog post, the Bitcoin wiki, and the original white paper.

If you enjoyed this post, send bitcoins! If I manage to get a few donations, I’ll be tempted to post a de-obfuscated version of this script with a simple explanation of the math behind it.

When I first started learning about Bitcoin, I found plenty of information, but nothing that directly answered the most burning question:

When you buy bitcoins… what is it that you own, exactly?

That’s the question I’ll answer in this post. Along the way, I’ll explain several key Bitcoin concepts. You’ll see how bitcoins are secured and how they’re transferred between owners.

First and foremost, a bitcoin is a unit of account, in the same sense that a gallon is a unit of volume, or a gram is a unit of weight. You can’t pick up a bitcoin and hold it in your hand like you can a dollar bill. But that’s OK, because that’s not what’s important. What’s important is that:

Bitcoins can be possessed.
Bitcoins can be transferred.
Bitcoins are impossible to copy.

These three properties, combined, allow bitcoins to function effectively as a system of distribution of wealth. And fundamentally, that’s what makes bitcoins useful.

When you think about it, what makes dollar bills good at distributing wealth, anyway? It’s the same three properties. You can possess a dollar bill by putting it in your pocket. You can transfer it to somebody else by giving it to them. And dollar bills are pretty difficult to copy. Hence, cold, hard cash is one good way to distribute wealth, and bitcoins are another.

Since bitcoins are not physical objects, but merely units of account, there needs to be some other way to keep track of them. Bitcoin’s solution to this problem is particularly clever.

How Bitcoins Are Distributed

As you may have heard, there is no central server to keep track of everyone’s bitcoins. But that doesn’t mean there are no servers keeping track of bitcoins. Quite the contrary. There are, in fact, hundreds of thousands of servers keeping track of bitcoins. Each server in the Bitcoin network is called a node.

A Bitcoin node is basically an electronic bookkeeper, and anybody in the world can set up and run one. Each node has a complete copy of the public ledger – that’s a record of every Bitcoin transaction that ever happened, in history, all the way back to the very beginning of Bitcoin. As of today, the public ledger contains more than 30 million transactions and requires 13 GB of disk space.

To actually use bitcoins, you need some kind of device which functions as a wallet. It can be an application running on your computer, a mobile app, a service offered by a website, or something else entirely. Your wallet can add a transaction to the public ledger by informing a single node on the Bitcoin network. That node will relay the transaction to other nodes, which will relay it to others, and so on – similar to the way BitTorrent works. It only takes about 7 seconds for a transaction to propagate across the entire Bitcoin network.

What’s In a Transaction?

By now, it should be apparent that when you “send” bitcoins to another person, you aren’t really sending anything directly to that person. Instead, your wallet reassigns those bitcoins, from one owner to another, by adding a transaction to the public ledger. For instance, here’s a set of three transactions that took place in December last year:

As you can see, every transaction has a set of inputs and a set of outputs. The inputs identify which bitcoins are being spent, and the outputs assign those bitcoins to their new owners. Each input is just a digitally signed reference to some output from a previous transaction. Once an output is spent by a subsequent input, no other transaction can spend that output again. That’s what makes bitcoins impossible to copy.

Each unspent output represents some amount of bitcoins that are currently in someone’s possession. If you add up all unspent outputs on the public ledger, you’ll get the same total amount as there are bitcoins in existence. You could even go so far as to say that the unspent outputs are the bitcoins.

Note that nobody’s real name appears anywhere within a transaction. That’s why Bitcoin is often said to be pseudonymous. Instead of real names, bitcoins are assigned to addresses such as 1PreshX6QrHmsWbSs8pHpz6kLRcj9kdPy6. A Bitcoin address is like a numbered bank account, only much easier to create, and each person can have a potentially unlimited number of them.

Where Do Addresses Come From?

Obviously, if you want to receive bitcoins, you need to come up with a Bitcoin address. Your wallet can do this for you.

In order to generate an address, your wallet first generates a private key. A private key is nothing but a large pseudorandom number roughly between 1 and 2²⁵⁶. To make such numbers shorter to write, it’s customary to encode them as sequence of numbers and letters.

Next, your wallet converts that private key to a Bitcoin address using a well-known function. This function is very straightforward for a computer to perform. If anyone knows your private key, they could easily convert it to a Bitcoin address, too. In fact, many Bitcoin wallets have a feature allowing you to import private keys.

On the other hand, it’s extremely difficult to go the other way. If someone knows only your Bitcoin address, it’s virtually impossible to figure out what the private key was.

It’s perfectly safe to give your Bitcoin addresses to other people, but extremely important to keep your private keys secret. Most of the time, as a Bitcoin user, you’ll never even see your own private keys. Typically, your wallet keeps track of your private keys for you, usually by storing them in an encrypted wallet file, either on your hard drive, on a server on the Internet, or elsewhere.

How Are Transactions Authorized?

That brings us to why it’s important to keep your private keys secret: Your private keys give you the ability to spend the bitcoins you’ve received.

To see how, take a closer look at the second transaction in the above listing, b6f4ec453a021ac561…. This transaction spends the bitcoins from previous output e14768c1d648b98a52…:0. When we examine that previous output, we see that those bitcoins were previously sent to the address 1NqUaJrFeStshjad1bhrEFFzWSQw6JHbqv. It stands to reason that this transaction should be authorized by whoever generated that address in the first place.

That’s where the digital signature comes in. In Bitcoin, a valid digital signature serves as proof that the transaction was authorized by the address’s owner. Here’s what makes it safe: Just as a private key was required to generate that address, the same private key is required, once again, to generate a valid digital signature.

A digital signature is only valid if a specific equation is satisfied by the address, the previous output and the signature. As you’d expect, every time a Bitcoin node receives a new transaction, it checks to make sure each digital signature is valid. The node has no idea which private key was used to generate each signature, but that’s OK, because it doesn’t need to know. It only needs to verify that the equation is satisfied.

The concept of digital signatures is based on old idea known as public-key cryptography. Bitcoin is not the first digital currency to secure transactions using such cryptography, but it is the first to do so without relying on a single, centralized server. That’s the breakthrough at the heart of Bitcoin.

Some Wallet Types

There’s already a wide range of Bitcoin wallets to choose from, but in most cases, the purposes of the wallet are the same:

To store your private keys.
To send bitcoins to other people.
To generate addresses, so you can receive bitcoins from other people.
To view your transaction history and current balance.

A desktop wallet is an application you install on Windows, MacOS or Linux. Examples include Electrum, Multibit and Bitcoin QT. Your private keys are stored locally, in a file somewhere on your hard drive such as wallet.dat, and the security of your bitcoins is only as good as your ability to protect that file from data loss and theft. Exceptionally, Bitcoin QT also turns your computer into a Bitcoin node, and therefore requires much more disk space and bandwidth than the other applications.

There are also web wallets such as Coinbase or Blockchain.info’s My Wallet service. When using a web wallet, your private keys are stored – usually encrypted – on the website’s servers instead of your own hard drive. Some web wallets are also Bitcoin exchanges, such as Bitstamp or Virtex, where bitcoins can be traded for US dollars and other currencies.

A mobile wallet is an app you install on a smartphone or tablet. Many mobile wallets, such as the Coinbase app, are simply interfaces to a web wallet, which means that your private keys are once again stored online. One notable exception is Bitcoin Wallet for Android, which stores private keys directly on your mobile device. Currently, Apple forbids mobile apps from sending bitcoins, which renders iOS wallets rather useless.

Since virtually all smartphones have a built-in camera, QR codes have become a popular way to communicate Bitcoin addresses. You can send bitcoins to someone by scanning their QR code with your mobile wallet.

QR codes can be dynamically generated, too, and can include additional information such as the exact quantity of bitcoin the recipient expects to receive. This allows some mobile wallets to function as point-of-sale devices.

There is even such a thing as a paper wallet. A paper wallet serves just one purpose: to store your private keys. Some of the above wallets can print your private keys for you, but you can also generate a brand new private key and address pair without an existing wallet. Bitaddress.org offers one such service. Once your private keys are printed, you can lock them away in a safe place, such as a safety deposit box, and wipe them from your computer. This is referred to as cold storage.

Bitcoin’s protocol is completely open, allowing anyone to implement an application or device that is compatible with Bitcoin. In effect, the entire Bitcoin ecosystem has been crowdsourced. This has sparked all sorts of innovation – there are many more Bitcoin wallet and point-of-sale devices than the ones I’ve mentioned here, and there are certainly more to come. The landscape is changing constantly.

Unfortunately, this openness also creates an opportunity for scammers. That’s why you must choose your wallet carefully. Do some research before deciding that your wallet provider is trustworthy. In the case of a web wallet, you must trust that website with your bitcoins exactly as you would trust a bank with your cash. A desktop wallet lets you be your own bank, but you must still trust that the desktop wallet actually functions as advertised. It helps to choose a wallet that is open source, is actually built from that source, and has a good reputation on discussion forums such as the Bitcoin subreddit and bitcointalk.org.

Are Bitcoins Backed By Anything Valuable?

If you’re looking for something of value behind Bitcoin, I’d argue that it’s the private keys. Certainly, every fraction of a bitcoin is backed by one. And once bitcoins are sent to an address, the corresponding private key becomes extremely valuable.

Consider the Bitcoin address 1933phfhK3ZgFQNLGSDXvqCn32k2buXY8a. It currently contains 111114 bitcoins. At the time of writing, that’s worth about $90 million US. If you only knew the private key for that address, you could spend that $90 million as if it were yours.

But you don’t know it. Nobody knows that private key, except for the owner of 1933phfhK3ZgFQNLGSDXvqCn32k2buXY8a, whoever they are. Maybe the private key is stored in an encrypted file on a USB stick somewhere. Maybe it’s printed on a paper wallet hidden inside a vault. Wherever that private key is, I’ll bet it’s guarded carefully, because it’s highly valuable.

Suffice to say, thieves around the world are itching to discover people’s private keys. A new breed of wallet-stealing malware has already emerged. Safeguarding your wallet is an important subject I won’t cover here, except to say that you must never, ever forget the password to an encrypted wallet, and you must practice impeccable computer hygiene to ensure that no spyware or malware infects any computer that is used to access a Bitcoin wallet.

Summary

So, when you buy bitcoins… what it is that you own, exactly?

When you own bitcoins, what you have is the exclusive ability to add specific transactions to the public ledger. Your bitcoins exist as unspent outputs from previous transactions on the ledger, sent to an address that your wallet created out of thin air, waiting for you to use as inputs to a future transaction. Your wallet is the only wallet that can digitally sign those inputs, because it contains a private key that no one else has.

The entire public ledger is stored, redundantly, on hundreds of thousands of nodes scattered across the globe. As you can imagine, shutting down Bitcoin at this point would be very difficult indeed.

Finally, there are many simplifications of Bitcoin floating around the web. In Bitcoin’s introductory video, you see one user flipping a stack of coins directly to another. In mainstream news articles about Bitcoin, there’s usually stock photo of a physical coin with the Bitcoin logo on it. By now, you should understand that such illustrations are largely symbolic. They don’t depict the way bitcoins are actually stored and transferred in practice. (While some physical bitcoins do exist, and are similar in concept to paper wallets, such items are novelties.)

Further Information

Now that you know exactly what a bitcoin is, how it’s secured, and how it’s transferred, you have a foundation for understanding further information about Bitcoin.

Bitcoin’s public ledger is also known as the blockchain. Since the blockchain is totally public, people have built websites to interactively browse its contents, such as Blockchain.info and BlockExplorer.
I’ve intentionally left out any details about the mining process and its role in extending the blockchain. It’s simply not profitable to mine bitcoins using ordinary computing hardware anymore. Therefore, mining is not the first concern to novice Bitcoin users. If you’re interested in the state of the art, check into ASIC miners and how to join a mining pool.
Khan Academy has a fairly comprehensive series of videos diving into the guts of Bitcoin.
There’s an insightful and entertaining talk, dated Sep. 25, 2013, with Andreas Antonopoulos, a recognized Bitcoin expert.
The original white paper by Satoshi Nakamoto is what started it all. It’s an accessible read for those with a background in computer science.
Almost all other cryptocurrencies in existence, including Litecoin, Peercoin, Namecoin, Dogecoin and all those listed on CoinWarz, are cloned and derived from the reference Bitcoin implementation on GitHub.
This post presented a simplified description of a Bitcoin transaction. Real Bitcoin transactions are based on scripts, which allow various other kinds of financial instruments to exist on the Bitcoin platform.

Have any other information you think would be of interest to a novice Bitcoin user? Any corrections or feedback about this post? Feel free to comment!

In the C++11 standard atomic library, most functions accept a memory_order argument:

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

The above values are referred to as memory ordering constraints. Each of them has its intended purpose. Among them, memory_order_consume is probably the least well-understood. It’s the most complicated ordering constraint, and it offers the least reward for using it correctly. Nonetheless, there it is, tempting the curious programmer to make sense of it – if only to unlock its dark, mysterious secrets. That’s exactly what this post aims to do.

First, let’s get our terminology straight: An operation that uses memory_order_consume is said to have consume semantics. We call such operations consume operations.

Perhaps the most valuable observation about memory_order_consume is that you can always safely replace it with memory_order_acquire. That’s because acquire operations provide all of the guarantees of consume operations, and then some. In other words, acquire is stronger.

Both consume and acquire serve the same purpose: To help pass non-atomic information safely between threads. And just like acquire operations, a consume operation must be combined with a release operation in another thread. The main difference is that there are fewer cases where consume operations are legal. In return for the inconvenience, consume operations are meant to be more efficient on some platforms. I’ll illustrate all of these points using an example.

A Quick Recap of Acquire and Release Semantics

This example will begin by passing a small amount of data between threads using acquire and release semantics. Later, we’ll modify it to use consume semantics instead.

First, let’s declare two shared variables. Guard is a C++11 atomic integer, while Payload is just a plain int. Both variables are initially zero.

atomic<int> Guard(0);
int Payload = 0;

The main thread sits in a loop, repeatedly attempting the following sequence of reads. Basically, the purpose of Guard is to protect access to Payload using acquire semantics. The main thread won’t attempt to read from Payload until Guard is non-zero.

g = Guard.load(memory_order_acquire);
if (g != 0)
    p = Payload;

At some point, an asynchronous task (running in another thread) comes along, assigns 42 to Payload, then sets Guard to 1 with release semantics.

Payload = 42;
Guard.store(1, memory_order_release);

Readers should be familiar with this pattern by now; we’ve seen it several times before on this blog. Once the asynchronous task writes to Guard, and the main thread reads it, it means that the write-release synchronized-with the read-acquire. In that case, we are guaranteed that p will equal 42, no matter what platform we run this example on.

Here, we’ve used acquire and release semantics to pass a simple non-atomic integer Payload between threads, but the pattern works equally well with larger payloads, as demonstrated in previous posts.

The Cost of Acquire Semantics

To measure the cost of memory_order_acquire, I compiled and ran the above example on three different multicore systems. For each architecture, I chose the compiler with the best available support for C++11 atomics. You’ll find the complete source code on GitHub.

Let’s look at the resulting machine code around the read-acquire:

g = Guard.load(memory_order_acquire);
if (g != 0)
    p = Payload;

Intel x86-64

On Intel x86-64, the Clang compiler generates compact machine code for this example – one machine instruction per line of C++ source code. This family of processors features a strong memory model, so the compiler doesn’t need to emit special memory barrier instructions to implement the read-acquire. It just has to keep the machine instructions in the correct order.

PowerPC

PowerPC is a weakly-ordered CPU, which means that the compiler must emit memory barrier instructions to guarantee acquire semantics on multicore systems. In this case, GCC used a sequence of three instructions, cmp;bne;isync, as recommended here. (A single lwsync instruction would have done the job, too.)

ARMv7

ARM is also a weakly-ordered CPU, so again, the compiler must emit memory barrier instructions to guarantee acquire semantics on multicore. On ARMv7, dmb ish is the best available instruction, despite being a full memory barrier.

Here are the timings for each iteration of our example’s main loop, running on the test machines shown above:

On PowerPC and ARMv7, the memory barrier instructions impose a performance penalty, but they are necessary for this example to work. In fact, if you delete all dmb ish instructions from the ARMv7 machine code, but leave everything else the same, memory reordering can be directly observed on the iPhone 4S.

Data Dependency Ordering

Now, I’ve said that PowerPC and ARM are weakly-ordered CPUs, but in fact, there are some cases where they do enforce memory ordering at the machine instruction level without the need for explicit memory barrier instructions. Specifically, these processors always preserve memory ordering between data-dependent instructions.

Two machine instructions, executed in the same thread, are data-dependent whenever the first instruction outputs a value and the second instruction uses that value as input. The value could be written to register, as in the following PowerPC listing. Here, the first instruction loads a value into r9, and the second instruction treats r9 as a pointer for the next load:

Because there is a data dependency between these two instructions, the loads will be performed in-order.

You may think that’s obvious. After all, how can the second instruction know which address to load from before the first instruction loads r9? Obviously, it can’t. Keep in mind, though, that it’s also possible for the load instructions to read from different cache lines. If another CPU core is modifying memory concurrently, and the second instruction’s cache line is not as up-to-date as the first, that would result in memory reordering, too! PowerPC goes the extra mile to avoid that, keeping each cache line fresh enough to ensure data dependency ordering is always preserved.

Data dependencies are not only established through registers; they can also be established through memory locations. In this listing, the first instruction writes a value to memory, and the second instruction reads that value back, establishing a data dependency between the two:

When multiple instructions are data-dependent on each other, we call it a data dependency chain. In the following PowerPC listing, there are two independent data dependency chains:

Data dependency ordering guarantees that all memory accesses performed along a single chain will be performed in-order. For example, in the above listing, memory ordering between the first blue load and last blue load will be preserved, and memory ordering between the first green load and last green load will be preserved. On the other hand, no guarantees are made about independent chains! So, the first blue load could still effectively happen after any of the green loads.

There are other processor families that preserve data dependency ordering, too. Itanium, PA-RISC, SPARC (in RMO mode) and zSeries all respect data dependency ordering at the machine instruction level. In fact, the only known weakly-ordered processor that does not preserve data dependency ordering is the DEC Alpha.

It goes without saying that strongly-ordered CPUs, such as Intel x86, x86-64 and SPARC (in TSO mode), also respect data dependency ordering.

Consume Semantics Are Designed to Exploit That

When you use consume semantics, you’re basically trying to make the compiler exploit data dependencies on all those processor families. That’s why, in general, it’s not enough to simply change memory_order_acquire to memory_order_consume. You must also make sure there are data dependency chains at the C++ source code level.

At the source code level, a dependency chain is a sequence of expressions whose evaluations all carry-a-dependency to each another. Carries-a-dependency is defined in §1.10.9 of the C++11 standard. For the most part, it just says that one evaluation carries-a-dependency to another if the value of the first is used as an operand of the second. It’s kind of like the language-level version of a machine-level data dependency. (There is actually a strict set of conditions for what constitutes carrying-a-dependency in C++11 and what does not, but I won’t go into the details here.)

Now, let’s go ahead and modify the original example to use consume semantics. First, we’ll change the type of Guard from atomic<int> to atomic<int*>:

atomic<int*> Guard(nullptr);
int Payload = 0;

We do that because, in the asynchronous task, we want to store a pointer to Payload to indicate that the payload is ready:

Payload = 42;
Guard.store(&Payload, memory_order_release);

Finally, in the main thread, we replace memory_order_acquire with memory_order_consume, and we load p indirectly via the pointer obtained by g. Loading from g, rather than reading directly from Payload, is key! It makes the first line of code carry-a-dependency to the third line, which is crucial to using consume semantics correctly in this example:

g = Guard.load(memory_order_consume);
if (g != nullptr)
    p = *g;

You can view the complete source code on GitHub.

Now, this modified example works every bit as reliably as the original example. Once the asynchronous task writes to Guard, and the main thread reads it, the C++11 standard guarantees that p will equal 42, no matter what platform we run it on. The difference is that, this time, we don’t have a synchronizes-with relationship anywhere. What we have this time is called a dependency-ordered-before relationship.

In any dependency-ordered-before relationship, there’s a dependency chain starting at the consume operation, and all memory operations performed before the write-release are guaranteed to be visible to that chain.

The Value of Consume Semantics

Now, let’s take a look at some machine code for our modified example using consume semantics.

Intel x86-64

This machine code loads Guard into register rcx, then, if rcx is not null, uses rcx to load the payload, thus creating a data dependency between the two load instructions. The data dependency doesn’t really make a difference here, though. x86-64’s strong memory model already guarantees that loads are performed in-order, even if there isn’t a data dependency.

PowerPC

This machine code loads Guard into register r9, then uses r9 to load the payload, thus creating a data dependency between the two load instructions. And it helps – this data dependency lets us completely avoid the cmp;bne;isync sequence of instructions that formed a memory barrier in the original example, while still ensuring that the two loads are performed in-order.

ARMv7

This machine code loads Guard into register r4, then uses r4 to load the payload, thus creating a data dependency between the two load instructions. This data dependency lets us completely avoid the dmb ish instruction that was present in the original example, while still ensuring that the two loads are performed in-order.

Finally, here are new timings for each iteration of the main loop, using the assembly listings I just showed you:

Unsurprisingly, consume semantics make little difference on Intel x86-64, but they make a huge difference on PowerPC and a significant difference on ARMv7, by allowing us to eliminate costly memory barriers. Keep in mind, of course, that these are microbenchmarks. In a real application, the performance gain would depend on the frequency of acquire operations being performed.

One real-world example of a codebase that uses this technique – exploiting data dependency ordering to avoid memory barriers – is the Linux kernel. Linux provides an implementation of read-copy-update (RCU), which is suitable for building data structures that are read frequently from multiple threads, but modified infrequently. As of this writing, however, Linux doesn’t actually use C++11 (or C11) consume semantics to eliminate those memory barriers. Instead, it relies on its own API and conventions. Indeed, RCU served as motivation for adding consume semantics to C++11 in the first place.

Today’s Compiler Support is Lacking

I have a confession to make. Those assembly code listings I just showed you for PowerPC and ARMv7? Those were fabricated. Sorry, but GCC 4.8.3 and Clang 4.6 don’t actually generate that machine code for consume operations! I know, it’s a little disappointing. But the goal of this post was to show you the purpose of memory_order_consume. Unfortunately, the reality is that today’s compilers do not yet play along.

You see, compilers have a choice of two strategies for implementing memory_order_consume on weakly-ordered processors: an efficient strategy and a lazy one. The efficient strategy is the one described in this post. If the processor respects data dependency ordering, the compiler can refrain from emitting memory barrier instructions, as long as it outputs a machine-level dependency chain for each source-level dependency chain that begins at a consume operation. In the lazy strategy, the compiler simply treats memory_order_consume as if it were memory_order_acquire, and ignores dependency chains altogether.

Current versions of GCC and Clang/LLVM use the lazy strategy, all the time. As a result, if you compile memory_order_consume for PowerPC or ARMv7 using today’s compilers, you’ll end up with unnecessary memory barrier instructions, which defeats the whole point.

It turns out that it’s difficult for compiler writers to implement the efficient strategy while adhering to the letter of the current C++11 specification. There are some proposals being put forth to improve the specification, with the goal of making it easier for compilers to support. I won’t go into the details here; that’s a whole other potential blog post.

If compilers did implement the efficient strategy, I can think of a few use cases (besides RCU) where consume semantics might pay off with modest performance gains: lazy initialization via double-checked locking, lock-free hash tables with non-trivial types, and lock-free stacks. Mind you, these gains will only be realized on specific processor families. Nonetheless, I’ll probably continue to pepper this blog with examples using memory_order_consume, regardless of whether compilers actually implement it efficiently or not.

Last month, I attended CppCon 2014 in Bellevue, Washington. It was an awesome conference, filled with the who’s who of C++ development, and loaded with interesting, relevant talks. It was a first-year conference, so I’m sure CppCon 2015 will be even better. I highly recommend it for any serious C++ developer.

While I was there, I gave a talk entitled, “How Ubisoft Montreal Develops Games For Multicore – Before and After C++11.” You can watch the whole thing here:

To summarize the talk:

At Ubisoft Montreal, we exploit multicore by building our game engines on top of three common threading patterns.
To implement those patterns, we need to write a lot of custom concurrent objects.
When a concurrent object is under heavy contention, we optimize it using atomic operations.
Game engines have their own portable atomic libraries. These libraries are similar to the C++11 atomic library’s “low level” functionality.

Most of the talk is spent exploring that last point: Comparing game atomics to low-level C++11 atomics.

There was a wide range of experience levels in the room, which was cool. Among the attendees were Michael Wong, CEO of OpenMP and C++ standard committee member, and Lawrence Crowl, who authored most of section 29, “Atomic operations library,” in the C++11 standard. Both of them chime in at various points. (I certainly wasn’t expecting to explain the standard to the guy who wrote it!)

You can download the slides here and grab the source code for the sample application here. A couple of corrections about certain points:

Compiler Ordering Around C++ Volatiles

At 24:05, I said that the compiler could have reordered some instructions on x86, leading to the same kind of memory reordering bug we saw at runtime on PowerPC, and that we were just lucky it didn’t.

However, I should acknowledge that in the previous console generation, the only x86 compiler we used at Ubisoft was Microsoft’s. Microsoft’s compiler is exceptional in that it does not perform those particular instruction reorderings on x86, because it treats volatile variables differently from other compilers, and m_writePos is volatile. That’s Microsoft’s default x86 behavior today, and it was its only x86 behavior back then. So in fact, the absence of compiler reordering was more than just luck: It was a vendor-specific guarantee. If we had used GCC or Clang, then we would have run the risk of compiler reordering in these two places.

Enforcing Correct Usage of Concurrent Objects

Throughout the talk, I keep returning to the example of a single-producer, single-consumer concurrent queue. For this queue to work correctly, you must follow the rules. In particular, it’s important not to call tryPush from multiple threads at the same time.

At 54:00, somebody asks if there’s a way to prevent coworkers from breaking such rules. My answer was to talk to them. At Ubisoft Montreal, the community of programmers playing with lock-free data structures is small, and we tend to know each other, so this answer is actually quite true for us. In many cases, the only person using a lock-free data structure is the one who implemented it.

But there was a better answer to his question, which I didn’t think of at the time: We can implement a macro that fires an assert when two threads enter the same function simultaneously. I won’t show the macro’s implementation here, but as it turns out, the tryPush and tryPop functions are two perfect candidates for it. This assert won’t prevent people from breaking the rules, but it will help catch errors earlier.

bool tryPush(const T& item)
{
    ASSERT_SINGLE_THREADED(m_pushDetector);
    int w = m_writePos.load(memory_order_relaxed);  
    if (w >= size)   
        return false;  
    m_items[w] = item;   
    m_writePos.store(w + 1, memory_order_release);  
    return true;
}

Several modern C++ features are currently missing from Visual Studio Express, and from the system GCC compiler provided with many of today’s Linux distributions. Generic lambdas – also known as polymorphic lambdas – are one such feature. This feature is, however, available in the latest versions of GCC and Clang.

The following guide will help you install the latest GCC on Windows, so you can experiment with generic lambdas and other cutting-edge C++ features. You’ll need to compile GCC from sources, but that’s not a problem. Depending on the speed of your machine, you can have the latest GCC up and running in as little as 15 minutes.

The steps are:

Install Cygwin, which gives us a Unix-like environment running on Windows.
Install a set of Cygwin packages required for building GCC.
From within Cygwin, download the GCC source code, build and install it.
Test the new GCC compiler in C++14 mode using the -std=c++14 option.

[Update: As a commenter points out, you can also install native GCC compilers from the MinGW-w64 project without needing Cygwin.]

1. Install Cygwin

First, download and run either the 32- or 64-bit version of the Cygwin installer, depending on your version of Windows. Cygwin’s setup wizard will walk you through a series of steps. If your machine is located behind a proxy server, make sure to check “Use Internet Explorer Proxy Settings” when you get to the “Select Your Internet Connection” step.

When you reach the “Select Packages” step (shown below), don’t bother selecting any packages yet. Just go ahead and click Next. We’ll add additional packages from the command line later.

After the Cygwin installer completes, it’s very important to keep the installer around. The installer is an executable named either setup-x86.exe or setup-x86_64.exe, and you’ll need it to add or remove Cygwin packages in the future. I suggest moving the installer to the same folder where you installed Cygwin itself; typically C:\cygwin or C:\cygwin64.

If you already have Cygwin installed, it’s a good idea to re-run the installer to make sure it has the latest available packages. Alternatively, you can install a new instance of Cygwin in a different folder.

2. Install Required Cygwin Packages

Next, you’ll need to add several packages to Cygwin. You can add them all in one fell swoop. Just open a Command Prompt (in Windows), navigate to the folder where the Cygwin installer is located, and run the following command:

setup-x86_64.exe -q -P wget -P gcc-g++ -P make -P diffutils -P libmpfr-devel -P libgmp-devel -P libmpc-devel

A window will pop up and download all the required packages along with their dependencies.

At this point, you already have a working GCC compiler on your system. It’s just not the latest version of GCC; it’s whatever version the Cygwin developers chose as their system compiler. At the time of writing, that’s GCC 4.8.3. To get a more recent version of GCC, you’ll have to compile it yourself, using the GCC compiler you already have.

3. Download, Build and Install the Latest GCC

Open a Cygwin terminal, either from the Start menu or by running Cygwin.bat from the Cygwin installation folder.

If your machine is located behind a proxy server, you must run the following command from the Cygwin terminal before proceeding – otherwise, wget won’t work. This step is not needed if your machine is directly connected to the Internet.

$ export http_proxy=$HTTP_PROXY ftp_proxy=$HTTP_PROXY

To download and extract the latest GCC source code, enter the following commands in the Cygwin terminal. If you’re following this guide at a later date, there will surely be a more recent version of GCC available. I used 4.9.2, but you can use any version you like. Keep in mind, though, that it’s always best to have the latest Cygwin packages installed when building the latest GCC. Be patient with the tar command; it takes several minutes.

$ wget http://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.gz
$ tar xf gcc-4.9.2.tar.gz

That will create a subdirectory named gcc-4.9.2. Next, we’ll configure our GCC build. As the GCC documentation recommends, it’s best to configure and build GCC in another directory outside gcc-4.9.2, so that’s what we’ll do.

$ mkdir build-gcc
$ cd build-gcc
$ ../gcc-4.9.2/configure --program-suffix=-4.9.2 --enable-languages=c,c++ --disable-bootstrap --disable-shared

Here’s a description of the command-line options passed to configure:

The --program-suffix=-4.9.2 option means that once our new GCC is installed, we’ll run it as g++-4.9.2. This will make it easier for the new GCC compiler to coexist alongside the system GCC compiler provided by Cygwin.
The --enable-languages=c,c++ option means that only the C and C++ compilers will be built. Compilers for other languages, such as Fortran, Java and Go, will be excluded. This will save compile time.
The --disable-bootstrap option means that we only want to build the new compiler once. If we don’t specify --disable-bootstrap, the new compiler will be built three times, for testing and performance reasons. However, the system GCC compiler (4.8.3) provided by Cygwin is pretty recent, so --disable-bootstrap is good enough for our purposes. This will save a significant amount of compile time.
The --disable-shared option means that we don’t want to build the new standard C++ runtime library as a DLL that’s shared with other C++ applications on the system. It’s totally possible to make C++ executables work with such DLLs, but it takes care not to introduce conflicts with C++ executables created by older or newer versions of GCC. That’s something distribution maintainers need to worry about; not us. Let’s just avoid the additional headache.
By default, the new version of GCC will be installed to /usr/local in Cygwin’s virtual filesystem. This will make it easier to launch the new GCC, since /usr/local/bin is already listed in Cygwin’s PATH environment variable. However, if you’re using an existing Cygwin installation, it might prove difficult to uninstall GCC from /usr/local later on (if you so choose), since that directory tends to contain files from several different packages. If you prefer to install the new GCC to a different directory, add the option --prefix=/path/to/directory to the above configure command.

We’re not going to build a new Binutils, which GCC relies on, because the existing Binutils provided by Cygwin is already quite recent. We’re also skipping a couple of packages, namely ISL and CLooG, which means that the new compiler won’t be able to use any of the Graphite loop optimizations.

Next, we’ll actually build the new GCC compiler suite, including C, C++ and the standard C++ library. This is the longest step.

$ make -j4

The -j4 option lets the build process spawn up to four child processes in parallel. If your machine’s CPU has at least four hardware threads, this option makes the build process run significantly faster. The main downside is that it jumbles the output messages generated during the build process. If your CPU has even more hardware threads, you can specify a higher number with -j. For comparison, I tried various numbers on a Xeon-based machine having 12 hardware threads, and got the following build times:

Be warned: I encountered a segmentation fault the first time I ran with -j4. Bad luck on my part. If that happens to you, running the same command a second time should allow the build process to finish successfully. Also, when specifying higher numbers with -j, there are often strange error messages at the end of the build process involving “jobserver tokens”, but they’re harmless.

Once that’s finished, install the new compiler:

$ make install
$ cd ..

This installs several executables to /usr/local/bin; it installs the standard C++ library’s include files to /usr/local/include/c++/4.9.2; and it installs the static standard C++ library to /usr/local/lib, among other things. Interestingly, it does not install a new standard C library! The new compiler will continue to use the existing system C library that came with Cygwin.

If, later, you decide to uninstall the new GCC compiler, you have several options:

If you installed GCC to a directory other than /usr/local, and that directory contains no other files, you can simply delete that directory.
If you installed GCC to /usr/local, and there are files from other packages mixed into the same directory tree, you can run the list_modifications.py script from this post to determine which files are safe to delete from /usr/local.
You can simply uninstall Cygwin itself, by deleting the C:\cygwin64 folder in Windows, along with its associated Start menu entry.

4. Test the New Compiler

All right, let’s compile some code that uses generic lambdas! Generic lambdas are part of the C++14 standard. They let you pass arguments to lambda functions as auto (or any templated type), like the one highlighted below. Create a file named test.cpp with the following contents:

#include <iostream>

int main()
{
    auto lambda = [](auto x){ return x; };
    std::cout << lambda("Hello generic lambda!\n");
    return 0;
}

You can add files to your home directory in Cygwin using any Windows-based text editor; just save them to the folder C:\cygwin64\home\Jeff (or similar) in Windows.

First, let’s see what happens when we try to compile it using the system GCC compiler provided by Cygwin:

$ g++ --version
$ g++ -std=c++1y test.cpp

If the system compiler version is less than 4.9, compilation will fail:

Now, let’s try it again using our freshly built GCC compiler. The new compiler is already configured to locate its include files in /usr/local/include/c++/4.9.2 and its static libraries in /usr/local/lib. All we need to do is run it:

$ g++-4.9.2 -std=c++14 test.cpp
$ ./a.exe

It works!

GCC is not just a compiler. It’s an open source project that lets you build all kinds of compilers. Some compilers support multithreading; some support shared libraries; some support multilib. It all depends on how you configure the compiler before building it.

This guide will demonstrate how to build a cross-compiler, which is a compiler that builds programs for another machine. All you need is a Unix-like environment with a recent version of GCC already installed.

In this guide, I’ll use Debian Linux to build a full C++ cross-compiler for AArch64, a 64-bit instruction set available in the latest ARM processors. I don’t actually own an AArch64 device – I just wanted an AArch64 compiler to verify this bug.

Required Packages

Starting with a clean Debian system, you must first install a few packages:

$ sudo apt-get install g++ make gawk

Everything else will be built from source. Create a new directory somewhere, and download the following source packages. (If you’re following this guide at a later date, there will be more recent releases of each package available. Check for newer releases by pasting each URL into your browser without the filename. For example: http://ftpmirror.gnu.org/binutils/)

$ wget http://ftpmirror.gnu.org/binutils/binutils-2.24.tar.gz
$ wget http://ftpmirror.gnu.org/gcc/gcc-4.9.2/gcc-4.9.2.tar.gz
$ wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.17.2.tar.xz
$ wget http://ftpmirror.gnu.org/glibc/glibc-2.20.tar.xz
$ wget http://ftpmirror.gnu.org/mpfr/mpfr-3.1.2.tar.xz
$ wget http://ftpmirror.gnu.org/gmp/gmp-6.0.0a.tar.xz
$ wget http://ftpmirror.gnu.org/mpc/mpc-1.0.2.tar.gz
$ wget ftp://gcc.gnu.org/pub/gcc/infrastructure/isl-0.12.2.tar.bz2
$ wget ftp://gcc.gnu.org/pub/gcc/infrastructure/cloog-0.18.1.tar.gz

The first four packages – Binutils, GCC, the Linux kernel and Glibc – are the main ones. We could have installed the next three packages in binary form using our system’s package manager instead, but that tends to provide older versions. The last two packages, ISL and CLooG, are optional, but they enable a few more optimizations in the compiler we’re about to build.

How The Pieces Fit Together

By the time we’re finished, we will have built each of the following programs and libraries. First, we’ll build the tools on the left, then we’ll use those tools to build the programs and libraries on the right. We won’t actually build the target system’s Linux kernel, but we do need the kernel header files in order to build the target system’s standard C library.

The compilers on the left will invoke the assembler & linker as part of their job. All the other packages we downloaded, such as MPFR, GMP and MPC, will be linked into the compilers themselves.

The diagram on the right represents a sample program, a.out, running on the target OS, built using the cross compiler and linked with the target system’s standard C and C++ libraries. The standard C++ library makes calls to the standard C library, and the C library makes direct system calls to the AArch64 Linux kernel.

Note that instead of using Glibc as the standard C library implementation, we could have used Newlib, an alternative implementation. Newlib is a popular C library implementation for embedded devices. Unlike Glibc, Newlib doesn’t require a complete OS on the target system – just a thin hardware abstraction layer called Libgloss. Newlib doesn’t have regular releases; instead, you’re meant to pull the source directly from the Newlib CVS repository. One limitation of Newlib is that currently, it doesn’t seem to support building multithreaded programs for AArch64. That’s why I chose not to use it here.

Build Steps

Extract all the source packages.

$ for f in *.tar*; do tar xf $f; done

Create symbolic links from the GCC directory to some of the other directories. These five packages are dependencies of GCC, and when the symbolic links are present, GCC’s build script will build them automatically.

$ cd gcc-4.9.2
$ ln -s ../mpfr-3.1.2 mpfr
$ ln -s ../gmp-6.0.0 gmp
$ ln -s ../mpc-1.0.2 mpc
$ ln -s ../isl-0.12.2 isl
$ ln -s ../cloog-0.18.1 cloog
$ cd ..

Choose an installation directory, and make sure you have write permission to it. In the steps that follow, I’ll install the new toolchain to /opt/cross.

$ sudo mkdir -p /opt/cross
$ sudo chown jeff /opt/cross

Throughout the entire build process, make sure the installation’s bin subdirectory is in your PATH environment variable. You can remove this directory from your PATH later, but most of the build steps expect to find aarch64-linux-gcc and other host tools via the PATH by default.

$ export PATH=/opt/cross/bin:$PATH

Pay particular attention to the stuff that gets installed under /opt/cross/aarch64-linux/. This directory is considered the system root of an imaginary AArch64 Linux target system. A self-hosted AArch64 Linux compiler could, in theory, use all the headers and libraries placed here. Obviously, none of the programs built for the host system, such as the cross-compiler itself, will be installed to this directory.

1. Binutils

This step builds and installs the cross-assembler, cross-linker, and other tools.

$ mkdir build-binutils
$ cd build-binutils
$ ../binutils-2.24/configure --prefix=/opt/cross --target=aarch64-linux --disable-multilib
$ make -j4
$ make install
$ cd ..

We’ve specified aarch64-linux as the target system type. Binutils’s configure script will recognize that this target is different from the machine we’re building on, and configure a cross-assembler and cross-linker as a result. The tools will be installed to /opt/cross/bin, their names prefixed by aarch64-linux-.
--disable-multilib means that we only want our Binutils installation to work with programs and libraries using the AArch64 instruction set, and not any related instruction sets such as AArch32.

2. Linux Kernel Headers

This step installs the Linux kernel header files to /opt/cross/aarch64-linux/include, which will ultimately allow programs built using our new toolchain to make system calls to the AArch64 kernel in the target environment.

$ cd linux-3.17.2
$ make ARCH=arm64 INSTALL_HDR_PATH=/opt/cross/aarch64-linux headers_install
$ cd ..

We could even have done this before installing Binutils.
The Linux kernel header files won’t actually be used until step 6, when we build the standard C library, although the configure script in step 4 expects them to be already installed.
Because the Linux kernel is a different open-source project from the others, it has a different way of identifying the target CPU architecture: ARCH=arm64

All of the remaining steps involve building GCC and Glibc. The trick is that there are parts of GCC which depend on parts of Glibc already being built, and vice versa. We can’t build either package in a single step; we need to go back and forth between the two packages and build their components in a way that satisfies their dependencies.

3. C/C++ Compilers

This step will build GCC’s C and C++ cross-compilers only, and install them to /opt/cross/bin. It won’t invoke those compilers to build any libraries just yet.

$ mkdir -p build-gcc
$ cd build-gcc
$ ../gcc-4.9.2/configure --prefix=/opt/cross --target=aarch64-linux --enable-languages=c,c++ --disable-multilib
$ make -j4 all-gcc
$ make install-gcc
$ cd ..

Because we’ve specified --target=aarch64-linux, the build script looks for the Binutils cross-tools we built in step 1 with names prefixed by aarch64-linux-. Likewise, the C/C++ compiler names will be prefixed by aarch64-linux-.
--enable-languages=c,c++ prevents other compilers in the GCC suite, such as Fortran, Go or Java, from being built.

4. Standard C Library Headers and Startup Files

In this step, we install Glibc’s standard C library headers to /opt/cross/aarch64-linux/include. We also use the C compiler built in step 3 to compile the library’s startup files and install them to /opt/cross/aarch64-linux/lib. Finally, we create a couple of dummy files, libc.so and stubs.h, which are expected in step 5, but which will be replaced in step 6.

$ mkdir -p build-glibc
$ cd build-glibc
$ ../glibc-2.20/configure --prefix=/opt/cross/aarch64-linux --build=$MACHTYPE --host=aarch64-linux --target=aarch64-linux --with-headers=/opt/cross/aarch64-linux/include --disable-multilib libc_cv_forced_unwind=yes
$ make install-bootstrap-headers=yes install-headers
$ make -j4 csu/subdir_lib
$ install csu/crt1.o csu/crti.o csu/crtn.o /opt/cross/aarch64-linux/lib
$ aarch64-linux-gcc -nostdlib -nostartfiles -shared -x c /dev/null -o /opt/cross/aarch64-linux/lib/libc.so
$ touch /opt/cross/aarch64-linux/include/gnu/stubs.h
$ cd ..

--prefix=/opt/cross/aarch64-linux tells Glibc’s configure script where it should install its headers and libraries. Note that it’s different from the usual --prefix.
Despite some contradictory information out there, Glibc’s configure script currently requires us to specify all three --build, --host and --target system types.
$MACHTYPE is a predefined environment variable which describes the machine running the build script. --build=$MACHTYPE is needed because in step 6, the build script will compile some additional tools which run as part of the build process itself.
--host has a different meaning here than we’ve been using so far. In Glibc’s configure, both the --host and --target options are meant to describe the system on which Glibc’s libraries will ultimately run.
We install the C library’s startup files, crt1.o, crti.o and crtn.o, to the installation directory manually. There’s doesn’t seem to a make rule that does this without having other side effects.

5. Compiler Support Library

This step uses the cross-compilers built in step 3 to build the compiler support library. The compiler support library contains some C++ exception handling boilerplate code, among other things. This library depends on the startup files installed in step 4. The library itself is needed in step 6. Unlike some other guides, we don’t need to re-run GCC’s configure. We’re just building additional targets in the same configuration.

$ cd build-gcc
$ make -j4 all-target-libgcc
$ make install-target-libgcc
$ cd ..

Two static libraries, libgcc.a and libgcc_eh.a, are installed to /opt/cross/lib/gcc/aarch64-linux/4.9.2/.
A shared library, libgcc_s.so, is installed to /opt/cross/aarch64-linux/lib64.

6. Standard C Library

In this step, we finish off the Glibc package, which builds the standard C library and installs its files to /opt/cross/aarch64-linux/lib/. The static library is named libc.a and the shared library is libc.so.

$ cd build-glibc
$ make -j4
$ make install
$ cd ..

7. Standard C++ Library

Finally, we finish off the GCC package, which builds the standard C++ library and installs it to /opt/cross/aarch64-linux/lib64/. It depends on the C library built in step 6. The resulting static library is named libstdc++.a and the shared library is libstdc++.so.

$ cd build-gcc
$ make -j4
$ make install
$ cd ..

Dealing with Build Errors

If you encounter any errors during the build process, there are three possibilities:

You’re missing a required package or tool on the build system.
You’re attempting to perform the build steps in an incorrect order.
You’ve done everything right, but something is just broken in the configuration you’re attempting to build.

You’ll have to examine the build logs to determine which case applies. GCC supports a lot of configurations, and some of them may not build right away. The less popular a configuration is, the greater the chance of it being broken. GCC, being an open source project, depends on contributions from its users to keep each configuration working.

Automating the Above Steps

I’ve written a small bash script named build_cross_gcc to perform all of the above steps. You can find it on GitHub. On my Core 2 Quad Q9550 Debian machine, it takes 13 minutes from start to finish. Customize it to your liking before running.

build_cross_gcc also supports Newlib configurations. When you build a Newlib-based cross-compiler, steps 4, 5 and 6 above can be combined into a single step. (Indeed, that’s what many existing guides do.) For Newlib support, edit the script options as follows:

TARGET=aarch64-elf
USE_NEWLIB=1
CONFIGURATION_OPTIONS="--disable-multilib --disable-threads"

Another way to build a GCC cross-compiler is using a combined tree, where the source code for Binutils, GCC and Newlib are merged into a single directory. A combined tree will only work if the intl and libiberty libraries bundled with GCC and Binutils are identical, which is not the case for the versions used in this post. Combined trees don’t support Glibc either, so it wasn’t an option for this configuration.

There are a couple of popular build scripts, namely crosstool-NG and EmbToolkit, which automate the entire process of building cross-compilers. I had mixed results using crosstool-NG, but it helped me make sense of the build process while putting together this guide.

Testing the Cross-Compiler

If everything built successfully, let’s check our cross-compiler for a dial tone:

$ aarch64-linux-g++ -v
Using built-in specs.
COLLECT_GCC=aarch64-linux-g++
COLLECT_LTO_WRAPPER=/opt/cross/libexec/gcc/aarch64-linux/4.9.2/lto-wrapper
Target: aarch64-linux
Configured with: ../gcc-4.9.2/configure --prefix=/opt/cross --target=aarch64-linux --enable-languages=c,c++ --disable-multilib
Thread model: posix
gcc version 4.9.2 (GCC)

We can compile the C++14 program from the previous post, then disassemble it:

$ aarch64-linux-g++ -std=c++14 test.cpp
$ aarch64-linux-objdump -d a.out
...
0000000000400830 <main>:
  400830:       a9be7bfd        stp     x29, x30, [sp,#-32]!
  400834:       910003fd        mov     x29, sp
  400838:       910063a2        add     x2, x29, #0x18
  40083c:       90000000        adrp    x0, 400000 <_init-0x618>
  ...

This was my first foray into building a cross-compiler. I basically wrote this guide to remember what I’ve learned. I think the above steps serve as a pretty good template for building other configurations; I used build_cross_gcc to build TARGET=powerpc-eabi as well. You can browse config.sub from any of the packages to see what other target environments are supported. Comments and corrections are more than welcome!

As I explained previously, there are two valid ways for a C++11 compiler to implement memory_order_consume: an efficient strategy and a heavy one. In the heavy strategy, the compiler simply treats memory_order_consume as an alias for memory_order_acquire. The heavy strategy is not what the designers of memory_order_consume had in mind, but technically, it’s still compliant with the C++11 standard.

There’s a somewhat common misconception that all current C++11 compilers use the heavy strategy. I certainly had that impression until recently, and others I spoke to at CppCon 2014 seemed to have that impression as well.

This belief turns out to be not quite true: GCC does not always use the heavy strategy (yet). GCC 4.9.2 actually has a bug in its implementation of memory_order_consume, as described in this GCC bug report. I was rather surprised to learn that, since it contradicted my own experience with GCC 4.8.3, in which the PowerPC compiler appeared to use the heavy strategy correctly.

I decided to verify the bug on my own, which is why I recently took an interest in building GCC cross-compilers. This post will explain the bug and document the process of patching the compiler.

An Example That Illustrates the Compiler Bug

Imagine a bunch of threads repeatedly calling the following read function:

#include <atomic>

std::atomic<int> Guard(0);
int Payload[1] = { 0xbadf00d };

int read()
{
    int f = Guard.load(std::memory_order_consume);    // load-consume
    if (f != 0)
        return Payload[f - f];                        // plain load from Payload[f - f]
    return 0;
}

At some point, another thread comes along and calls write:

int write()
{
    Payload[0] = 42;                                  // plain store to Payload[0]
    Guard.store(1, std::memory_order_release);        // store-release
}

If the compiler is fully compliant with the current C++11 standard, then there are only two possible return values from read: 0 or 42. The outcome depends on the value seen by the load-consume highlighted above. If the load-consume sees 0, then obviously, read will return 0. If the load-consume sees 1, then according to the rules of the standard, the plain store Payload[0] = 42 must be visible to the plain load Payload[f - f], and read must return 42.

As I’ve already explained, memory_order_consume is meant to provide ordering guarantees that are similar to those of memory_order_acquire, only restricted to code that lies along the load-consume’s dependency chain at the source code level. In other words, the load-consume must carry-a-dependency to the source code statements we want ordered.

In this example, we are admittedly abusing C++11’s definition of carry-a-dependency by using f in an expression that cancels it out (f - f). Nonetheless, we are still technically playing by the standard’s current rules, and thus, its ordering guarantees should still apply.

Compiling for AArch64

The compiler bug report mentions AArch64, a new 64-bit instruction set supported by the latest ARM processors. Conveniently enough, I described how to build a GCC cross-compiler for AArch64 in the previous post. Let’s use that cross-compiler to compile the above code and examine the assembly listing for read:

$ aarch64-linux-g++ -std=c++11 -O2 -S consumetest.cpp
$ cat consumetest.s

This machine code is flawed. AArch64 is a weakly-ordered CPU architecture that preserves data dependency ordering, and yet neither compiler strategy has been taken:

No heavy strategy: There is no memory barrier instruction between the load from Guard and the load from Payload[f - f]. The load-consume has not been promoted to a load-acquire.
No efficient strategy: There is no dependency chain connecting the two loads at the machine code level. I’ve highlighted the two machine-level dependency chains above, in blue and green. As you can see, the two loads lie along separate chains.

As a result, the processor is free to reorder the loads at runtime so that the second load sees an older value than the first. There is a very real possibility that read will return 0xbadf00d, the initial value of Payload[0], even though the C++ standard forbids it.

Patching the Cross-Compiler

Andrew Macleod posted a patch for this issue in the bug report. His patch adds the following lines near the end of the get_memmodel function in gcc/builtins.c:

  /* Workaround for Bugzilla 59448. GCC doesn't track consume properly, so
     be conservative and promote consume to acquire.  */
  if (val == MEMMODEL_CONSUME)
    val = MEMMODEL_ACQUIRE;

Let’s apply this patch and build a new cross-compiler.

$ cd gcc-4.9.2/gcc
$ wget -qO- https://gcc.gnu.org/bugzilla/attachment.cgi?id=33831 | patch
$ cd ../../build-gcc
$ make
$ make install
$ cd ..

Now let’s compile the same source code as before:

$ aarch64-linux-g++ -std=c++11 -O2 -S consumetest.cpp
$ cat consumetest.s

This time, the generated assembly is valid. The compiler now implements the load-consume from Guard using ldar, a new AArch64 instruction that provides acquire semantics. This instruction acts as a memory barrier on the load itself, ensuring that the load will be completed before all subsequent loads and stores (among other things). In other words, our AArch64 cross-compiler now implements the “heavy” strategy correctly.

This Bug Doesn’t Happen on PowerPC

Interestingly, if you compile the same example for PowerPC, there is no bug. This is using the same GCC version 4.9.2 without Andrew’s patch applied:

$ powerpc-linux-g++ -std=c++11 -O2 -S consumetest.cpp
$ cat consumetest.s

The PowerPC cross-compiler appears to implement the “heavy” strategy correctly, promoting consume to acquire and emitting the necessary memory barrier instructions. Why does the PowerPC cross-compiler work in this case, but not the AArch64 cross-compiler? One hint lies in GCC’s machine description (MD) files. GCC uses these MD files in its final stage of compilation, after optimization, when it converts its intermediate RTL format to a native assembly code listing. Among the AArch64 MD files, in gcc-4.9.2/gcc/config/aarch64/atomics.md, you’ll currently find the following:

    if (model == MEMMODEL_RELAXED
    || model == MEMMODEL_CONSUME
    || model == MEMMODEL_RELEASE)
      return "ldr<atomic_sfx>\t%<w>0, %1";
    else
      return "ldar<atomic_sfx>\t%<w>0, %1";

Meanwhile, among PowerPC’s MD files, in gcc-4.9.2/gcc/config/rs6000/sync.md, you’ll find:

  switch (model)
    {
    case MEMMODEL_RELAXED:
      break;
    case MEMMODEL_CONSUME:
    case MEMMODEL_ACQUIRE:
    case MEMMODEL_SEQ_CST:
      emit_insn (gen_loadsync_<mode> (operands[0]));
      break;

Based on the above, it seems that the AArch64 cross-compiler currently treats consume the same as relaxed at the final stage of compilation, whereas the PowerPC cross-compiler treats consume the same as acquire at the final stage. Indeed, if you move case MEMMODEL_CONSUME: one line earlier in the PowerPC MD file, you can reproduce the bug on PowerPC, too.

Andrew’s patch appears to make all compilers treat consume the same as acquire at an earlier stage of compilation.

The Uncertain Future of memory_order_consume

It’s fair to call memory_order_consume an obscure subject, and the current status of GCC support reflects that. The C++ standard committee is wondering what to do with memory_order_consume in future revisions of C++.

My opinion is that the definition of carries-a-dependency should be narrowed to require that different return values from a load-consume result in different behavior for any dependent statements that are executed. Let’s face it: Using f - f as a dependency is nonsense, and narrowing the definition would free the compiler from having to support such nonsense “dependencies” if it chooses to implement the efficient strategy. This idea was first proposed by Torvald Riegel in the Linux Kernel Mailing List and is captured among various alternatives described in Paul McKenney’s proposal N4036.

C++ has changed a lot in recent years. The last two revisions, C++11 and C++14, introduce so many new features that, in the words of Bjarne Stroustrup, “It feels like a new language.”

It’s true. Modern C++ lends itself to a whole new style of programming – and you can’t help but feel Python’s influence on this new style. Ranged-based for loops, type deduction, vector and map initializers, lambda expressions. The more you explore modern C++, the more you find Python’s fingerprints all over it.

Was Python a direct influence on modern C++? Or did Python simply adopt a few common idioms before C++ got around to it? You be the judge.

Literals

Python introduced binary literals in 2008. Now C++14 has them:

static const int primes = 0b10100000100010100010100010101100;

Python also introduced raw string literals back in 1998. They’re convenient when hardcoding a regular expression or a Windows path. C++ added them in C++11:

const char* path = r"c:\this\string\has\backslashes";

Range-Based For Loops

In Python, a for loop always iterates over a Python object:

for x in myList:
    print(x)

Meanwhile, for nearly three decades, C++ supported only C-style for loops. Finally, in C++11, range-based for loops were added:

for (int x : myList)
    std::cout << x;

You can iterate over a std::vector or any class which implements the begin and end member functions – not unlike Python’s iterator protocol. With range-based for loops, I often find myself wishing C++ had Python’s xrange function built-in.

Auto Type Deduction

Python has always been a dynamically typed language. You don’t need to declare variable types anywhere, since types are a property of the objects themselves.

x = "Hello world!"
print(x)

C++, on the other hand, is not dynamically typed; it’s statically typed. But since C++11 repurposed the auto keyword for type deduction, you can write code that looks a lot like dynamic typing:

auto x = "Hello world!";
std::cout << x;

When you call functions that are overloaded for several types, such as std::ostream::operator<< or a template function, C++ resembles a dynamically typed language even more. C++14 further fleshes out support for the auto keyword, adding support for auto return values and auto arguments to lambda functions.

Tuples

Python has had tuples pretty much since the beginning. They’re nice when you need to package several values together, but don’t feel like naming a class.

triple = (5, 6, 7)
print(triple[0])

C++ added tuples to the standard library in C++11. The proposal even mentions Python as an inspiration:

auto triple = std::make_tuple(5, 6, 7);
std::cout << std::get<0>(triple);

Python lets you unpack a tuple into separate variables:

x, y, z = triple

You can do the same thing in C++ using std::tie:

std::tie(x, y, z) = triple;

Uniform Initialization

In Python, lists are a built-in type. As such, you can create a Python list using a single expression:

myList = [6, 3, 7, 8]
myList.append(5);

C++’s std::vector is the closest analog to a Python list. Uniform initialization, new in C++11, now lets us create them using a single expression as well:

auto myList = std::vector<int>{ 6, 3, 7, 8 };
myList.push_back(5);

In Python, you can also create a dictionary with a single expression:

myDict = {5: "foo", 6: "bar"}
print(myDict[5])

Similarly, uniform initialization also works on C++’s std::map and unordered_map:

auto myDict = std::unordered_map<int, const char*>{ { 5, "foo" }, { 6, "bar" } };
std::cout << myDict[5];

Lambda Expressions

Python has supported lambda functions since 1994:

myList.sort(key = lambda x: abs(x))

Lambda expressions were added in C++11:

std::sort(myList.begin(), myList.end(), [](int x, int y){ return std::abs(x) < std::abs(y); });

In 2001, Python added statically nested scopes, which allow lambda functions to capture variables defined in enclosing functions:

def adder(amount):
    return lambda x: x + amount
...
print(adder(5)(5))

Likewise, C++ lambda expressions support a flexible set of capture rules, allowing you to do similar things:

auto adder(int amount) {
    return [=](int x){ return x + amount; };
}
...
std::cout << adder(5)(5);

Standard Algorithms

Python’s built-in filter function lets you selectively copy elements from a list (though list comprehensions are preferred):

result = filter(myList, lambda x: x >= 0)

C++11 introduces std::copy_if, which lets us use a similar, almost-functional style:

auto result = std::vector<int>{};
std::copy_if(myList.begin(), myList.end(), std::back_inserter(result), [](int x){ return x >= 0; });

The upcoming ranges proposal has the potential to simplify such expressions further. Other C++ algorithms that mimic Python built-ins include transform, any_of, all_of, min and max.

Variable Arguments

Python has supported arbitrary argument lists since 1998. You can define a function taking a variable number of arguments, exposed as a tuple, and expand a tuple when passing arguments to another function:

def foo(*args):
    return tuple(*args)
...
triple = foo(5, 6, 7)

C++11 adds support for parameter packs, which also let you accept a variable number of arguments and pass those arguments to other functions. One important difference: C++ parameter packs are not exposed as a single object at runtime. You can only manipulate them through template metaprogramming at compile time.

template <typename... T> auto foo(T&&... args) {
    return std::make_tuple(args...);
}
...
auto triple = foo(5, 6, 7);

Not all of the new C++11 and C++14 features mimic Python functionality, but it seems a lot of them do. Python is recognized as a friendly, approachable programming language. Perhaps some of its charisma has rubbed off?

What do you think? Do the new features succeed in making C++ simpler, more approachable or more expressive?

The pattern gained attention for the shortcomings it exposed in those languages, and people began to write about it. In 2000, a group of high-profile Java developers got together and signed a declaration entitled “Double-Checked Locking Is Broken”. In 2004, Scott Meyers and Andrei Alexandrescu published an article entitled “C++ and the Perils of Double-Checked Locking”. Both papers are great primers on what DCLP is, and why, at the time, those languages were inadequate for implementing it.

In this post, I’ll focus on the C++ implementations of DCLP.

What Is Double-Checked Locking?

Singleton* Singleton::getInstance() {
    Lock lock;      // scope-based lock, released automatically when the function returns
    if (m_instance == NULL) {
        m_instance = new Singleton;
    }
    return m_instance;
}

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance;
    ...                     // insert memory barrier
    if (tmp == NULL) {
        Lock lock;
        tmp = m_instance;
        if (tmp == NULL) {
            tmp = new Singleton;
            ...             // insert memory barrier
            m_instance = tmp;
        }
    }
    return tmp;
}

Using C++11 Acquire and Release Fences

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load(std::memory_order_relaxed);
        if (tmp == nullptr) {
            tmp = new Singleton;
            std::atomic_thread_fence(std::memory_order_release);
            m_instance.store(tmp, std::memory_order_relaxed);
        }
    }
    return tmp;
}

If you’re looking for a deeper understanding of how and why these fences make DCLP work reliably, there’s some background information in my previous post as well as in earlier posts on this blog.

Using Mintomic Fences

Here’s a DCLP implementation using Mintomic’s acquire and release fences. It’s basically equivalent to the previous example using C++11’s acquire and release fences.

mint_atomicPtr_t Singleton::m_instance = { 0 };
mint_mutex_t Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = (Singleton*) mint_load_ptr_relaxed(&m_instance);
    mint_thread_fence_acquire();
    if (tmp == NULL) {
        mint_mutex_lock(&m_mutex);
        tmp = (Singleton*) mint_load_ptr_relaxed(&m_instance);
        if (tmp == NULL) {
            tmp = new Singleton;
            mint_thread_fence_release();
            mint_store_ptr_relaxed(&m_instance, tmp);
        }
        mint_mutex_unlock(&m_mutex);
    }
    return tmp;
}

Using C++11 Low-Level Ordering Constraints

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load(std::memory_order_acquire);
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load(std::memory_order_relaxed);
        if (tmp == nullptr) {
            tmp = new Singleton;
            m_instance.store(tmp, std::memory_order_release);
        }
    }
    return tmp;
}

Technically, this form of lock-free synchronization is less strict than the form using standalone fences; the above operations are only meant to prevent memory reordering around themselves, as opposed to standalone fences, which are meant to prevent certain kinds of memory reordering around neighboring operations. Nonetheless, on the x86/64, ARMv6/v7, and PowerPC architectures, the best possible machine code is the same for both forms. For example, in an older post, I showed how C++11 low-level ordering constraints emit dmb instructions on an ARMv7 compiler, which is the same thing you’d expect using standalone fences.

Using C++11 Sequentially Consistent Atomics

Here’s a DCLP implementation which uses SC atomics. As in all previous examples, the second highlighted line will synchronize-with the first once the singleton is created.

std::atomic<Singleton*> Singleton::m_instance;
std::mutex Singleton::m_mutex;

Singleton* Singleton::getInstance() {
    Singleton* tmp = m_instance.load();
    if (tmp == nullptr) {
        std::lock_guard<std::mutex> lock(m_mutex);
        tmp = m_instance.load();
        if (tmp == nullptr) {
            tmp = new Singleton;
            m_instance.store(tmp);
        }
    }
    return tmp;
}

Using C++11 Data-Dependency Ordering

Using a C++11 Static Initializer

Some readers already know the punch line to this post: C++11 doesn’t require you to jump through any of the above hoops to get a thread-safe singleton. You can simply use a static initializer.

[Update: Beware! As Rober Baker points out in the comments, this example doesn’t work in Visual Studio 2012 SP4. It only works in compilers that fully comply with this part of the C++11 standard.]

Singleton& Singleton::getInstance() {
    static Singleton instance;
    return instance;
}

The C++11 standard’s got our back in §6.7.4:

If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization.

It’s up to the compiler to fill in the implementation details, and DCLP is the obvious choice. There’s no guarantee that the compiler will use DCLP, but it just so happens that some (perhaps most) C++11 compilers do. Here’s some machine code generated by GCC 4.6 when compiling for ARM with the -std=c++0x option:

As you can see, we’ve come a long way with C++11. Double-checked locking is fixed, and then some!

In multithreaded programming, it’s important to make threads wait. They must wait for exclusive access to a resource. They must wait when there’s no work available. One way to make threads wait – and put them to sleep inside the kernel, so that they no longer take any CPU time – is with a semaphore.

I used to think semaphores were strange and old-fashioned. They were invented by Edsger Dijkstra back in the early 1960s, before anyone had done much multithreaded programming, or much programming at all, for that matter. I knew that a semaphore could keep track of available units of a resource, or function as a clunky kind of mutex, but that seemed to be about it.

My opinion changed once I realized that, using only semaphores and atomic operations, it’s possible to implement all of the following primitives:

A Lightweight Mutex
A Lightweight Auto-Reset Event Object
A Lightweight Read-Write Lock
Another Solution to the Dining Philosophers Problem
A Lightweight Semaphore With Partial Spinning

Not only that, but these implementations share some desirable properties. They’re lightweight, in the sense that some operations happen entirely in userspace, and they can (optionally) spin for a short period before sleeping in the kernel. You’ll find all of the C++11 source code on GitHub. Since the standard C++11 library does not include semaphores, I’ve also provided a portable Semaphore class that maps directly to native semaphores on Windows, MacOS, iOS, Linux and other POSIX environments. You should be able to drop any of these primitives into almost any existing C++11 project.

A Semaphore Is Like a Bouncer

Imagine a set of waiting threads, lined up in a queue – much like a lineup in front of a busy nightclub or theatre. A semaphore is like a bouncer at the front of the lineup. He only allows threads to proceed when instructed to do so.

Each thread decides for itself when to join the queue. Dijkstra called this the P operation. P originally stood for some funny-sounding Dutch word, but in a modern semaphore implementation, you’re more likely to see this operation called wait. Basically, when a thread calls the semaphore’s wait operation, it enters the lineup.

The bouncer, himself, only needs to understand a single instruction. Originally, Dijkstra called this the V operation. Nowadays, the operation goes by various names, such as post, release or signal. I prefer signal. Any running thread can call signal at any time, and when it does, the bouncer releases exactly one waiting thread from the queue. (Not necessarily in the same order they arrived.)

Now, what happens if some thread calls signal before there are any threads waiting in line? No problem: As soon as the next thread arrives in the lineup, the bouncer will let it pass directly through. And if signal is called, say, 3 times on an empty lineup, the bouncer will let the next 3 threads to arrive pass directly through.

Of course, the bouncer needs to keep track of this number, which is why all semaphores maintain an integer counter. signal increments the counter, and wait decrements it.

The beauty of this strategy is that if wait is called some number of times, and signal is called some number of times, the outcome is always the same: The bouncer will always release the same number of threads, and there will always be the same number of threads left waiting in line, regardless of the order in which those wait and signal calls occurred.

1. A Lightweight Mutex

I’ve already shown how to implement a lightweight mutex in an earlier post. I didn’t know it at the time, but that post was just one example of a reusable pattern. The trick is to build another mechanism in front of the semaphore, which I like to call the box office.

The box office is where the real decisions are made. Should the current thread wait in line? Should it bypass the queue entirely? Should another thread be released from the queue? The box office cannot directly check how many threads are waiting on the semaphore, nor can it check the semaphore’s current signal count. Instead, the box office must somehow keep track of its own previous decisions. In the case of a lightweight mutex, all it needs is an atomic counter. I’ll call this counter m_contention, since it keeps track of how many threads are simultaneously contending for the mutex.

class LightweightMutex
{
private:
    std::atomic<int> m_contention;         // The "box office"
    Semaphore m_semaphore;                 // The "bouncer"

When a thread decides to lock the mutex, it first visits the box office to increment m_contention.

public:
    void lock()
    {
        if (m_contention.fetch_add(1, std::memory_order_acquire) > 0)  // Visit the box office
        {
            m_semaphore.wait();     // Enter the wait queue
        }
    }

If the previous value was 0, that means no other thread has contended for the mutex yet. As such, the current thread immediately considers itself the new owner, bypasses the semaphore, returns from lock and proceeds into whatever code the mutex is intended to protect.

Otherwise, if the previous value was greater than 0, that means another thread is already considered to own the mutex. In that case, the current thread must wait in line for its turn.

When the previous thread unlocks the mutex, it visits the box office to decrement the counter:

    void unlock()
    {
        if (m_contention.fetch_sub(1, std::memory_order_release) > 1)  // Visit the box office
        {
            m_semaphore.signal();   // Release a waiting thread from the queue
        }
    }

If the previous counter value was 1, that means no other threads arrived in the meantime, so there’s nothing else to do. m_contention is simply left at 0.

Otherwise, if the previous counter value was greater than 1, another thread has attempted to lock the mutex, and is therefore waiting in the queue. As such, we alert the bouncer that it’s now safe to release the next thread. That thread will be considered the new owner.

Every visit to the box office is an indivisible, atomic operation. Therefore, even if multiple threads call lock and unlock concurrently, they will always visit the box office one at a time. Furthermore, the behavior of the mutex is completely determined by the decisions made at the box office. After they visit the box office, they may operate on the semaphore in an unpredictable order, but that’s OK. As I’ve already explained, the outcome will remain valid regardless of the order in which those semaphore operations occur. (In the worst case, some threads may trade places in line.)

This class is considered “lightweight” because it bypasses the semaphore when there’s no contention, thereby avoiding system calls. I’ve published it to GitHub as NonRecursiveBenaphore along with a recursive version. However, there’s no need to use these classes in practice. Most available mutex implementations are already lightweight. Nonetheless, they’re noteworthy for serving as inspiration for the rest of the primitives described here.

2. A Lightweight Auto-Reset Event Object

You don’t hear autoreset event objects discussed very often, but as I mentioned in my CppCon 2014 talk, they’re widely used in game engines. Most often, they’re used to notify a single other thread (possibly sleeping) of available work.

An autoreset event object is basically a semaphore that ignores redundant signals. In other words, when signal is called multiple times, the event object’s signal count will never exceed 1. That means you can go ahead and publish work units somewhere, blindly calling signal after each one. It’s a flexible technique that works even when you publish work units to some data structure other than a queue.

Windows has native support for event objects, but its SetEvent function – the equivalent of signal – can be expensive. One one machine, I timed it at 700 ns per call, even when the event was already signaled. If you’re publishing thousands of work units between threads, the overhead for each SetEvent can quickly add up.

Luckily, the box office/bouncer pattern reduces this overhead significantly. All of the autoreset event logic can be implemented at the box office using atomic operations, and the box office will invoke the semaphore only when it’s absolutely necessary for threads to wait.

I’ve published the implementation as AutoResetEvent. This time, the box office has a different way to keep track of how many threads have been sent to wait in the queue. When m_status is negative, its magnitude indicates how many threads are waiting:

class AutoResetEvent
{
private:
    // m_status == 1: Event object is signaled.
    // m_status == 0: Event object is reset and no threads are waiting.
    // m_status == -N: Event object is reset and N threads are waiting.
    std::atomic<int> m_status;
    Semaphore m_sema;

In the event object’s signal operation, we increment m_status atomically, but only if the new value won’t exceed 1. As you can see on the highlighted line, if the event object is already signaled, the function returns immediately, having only performed a single load from shared memory. It’s a big improvement over that 700 ns SetEvent call.

public:
    void signal()
    {
        int oldStatus = m_status.load(std::memory_order_relaxed);
        for (;;)    // Increment m_status atomically via CAS loop.
        {
            if (oldStatus == 1)
                return;     // Event object is already signaled.
            int newStatus = oldStatus + 1;
            if (m_status.compare_exchange_weak(oldStatus, newStatus, std::memory_order_release, std::memory_order_relaxed))
                break;
            // The compare-exchange failed, likely because another thread changed m_status.
            // oldStatus now has its latest value (passed by reference). Retry the CAS loop.
        }
        if (oldStatus < 0)
            m_sema.signal();    // Release one waiting thread.
    }

3. A Lightweight Read-Write Lock

Using the same box office/bouncer pattern, it’s possible to implement a pretty good read-write lock. This read-write lock is completely lock-free in the absence of writers, it’s starvation-free for both readers and writers, and just like the other primitives, it can spin before putting threads to sleep. It requires two semaphores: one for waiting readers, and another for waiting writers. The code is available as NonRecursiveRWLock. I intend to write a dedicated post about it later.

4. Another Solution to the Dining Philosophers Problem

The box office/bouncer pattern can also solve Dijkstra’s dining philosophers problem in a way that I haven’t seen described elsewhere. If you’re not familiar with this problem, it involves philosophers that share dinner forks with each other. Each philosopher must obtain two specific forks before he or she can eat. I don’t believe this solution will prove useful to anybody, so I won’t go into great detail. I’m just including it as further demonstration of semaphores’ versatility.

In this solution, we assign each philosopher (thread) its own dedicated semaphore. The box office keeps track of which philosophers are eating, which ones have requested to eat, and the order in which those requests arrived. With that information, the box office is able to shepherd all philosophers through their bouncers in an optimal way.

I’ve posted two implementations. One is DiningPhilosophers, which implements the box office using a mutex. The other is LockReducedDiningPhilosophers, in which every visit to the box office is lock-free.

5. A Lightweight Semaphore with Partial Spinning

You read that right: It’s possible to combine a semaphore with a box office to implement… another semaphore.

Why would you do such a thing? Because you end up with a LightweightSemaphore. It becomes extremely cheap when the lineup is empty and the signal count climbs above zero, regardless of how the underlying semaphore is implemented. In such cases, the box office will rely entirely on atomic operations, leaving the underlying semaphore untouched.

Not only that, but you can make threads wait in a spin loop for a short period of time before invoking the underlying semaphore. This trick helps avoid expensive system calls when the wait time ends up being short.

In the GitHub repository, all of the other primitives are implemented on top of LightweightSemaphore, rather than using Semaphore directly. That’s how they all inherit the ability to partially spin. LightweightSemaphore sits on top of Semaphore, which in turn encapsulates a platform-specific semaphore.

The repository comes with a simple test suite, with each test case exercising a different primitive. It’s possible to remove LightweightSemaphore and force all primitives to use Semaphore directly. Here are the resulting timings on my Windows PC:

	LightweightSemaphore	Semaphore
testBenaphore	375 ms	5503 ms
testRecursiveBenaphore	393 ms	404 ms
testAutoResetEvent	593 ms	4665 ms
testRWLock	598 ms	7126 ms
testDiningPhilosophers	309 ms	580 ms

LightweightSemaphore

Semaphore

testBenaphore

375 ms

5503 ms

testRecursiveBenaphore

393 ms

404 ms

testAutoResetEvent

593 ms

4665 ms

testRWLock

598 ms

7126 ms

testDiningPhilosophers

309 ms

580 ms

As you can see, the test suite benefits significantly from LightweightSemaphore in this environment. Having said that, I’m pretty sure the current spinning strategy is not optimal for every environment. It simply spins a fixed number of 10000 times before falling back on Semaphore. I looked briefly into adaptive spinning, but the best approach wasn’t obvious. Any suggestions?

Comparison With Condition Variables

With all of these applications, semaphores are more general-purpose than I originally thought – and this wasn’t even a complete list. So why are semaphores absent from the standard C++11 library? For the same reason they’re absent from Boost: a preference for mutexes and condition variables. From the library maintainers’ point of view, conventional semaphore techniques are just too error prone.

When you think about it, though, the box office/bouncer pattern shown here is really just an optimization for condition variables in a specific case – the case where all condition variable operations are performed at the end of the critical section.

Consider the AutoResetEvent class described above. I’ve implemented AutoResetEventCondVar, an equivalent class based on a condition variable, in the same repository. Its condition variable is always manipulated at the end of the critical section.

    void signal()
    {
        // Increment m_status atomically via critical section.
        std::unique_lock<std::mutex> lock(m_mutex);
        int oldStatus = m_status;
        if (oldStatus == 1)
            return;     // Event object is already signaled.
        m_status++;
        if (oldStatus < 0)
            m_condition.notify_one();   // Release one waiting thread.
    }

We can optimize AutoResetEventCondVar in two steps:

Pull each condition variable outside of its critical section and convert it to a semaphore. The order-independence of semaphore operations makes this safe. After this step, we’ve already implemented the box office/bouncer pattern. (In general, this step also lets us avoid a thundering herd when multiple threads are signaled at once.)
Make the box office lock-free, greatly improving its scalability. This step results in AutoResetEvent.

On my Windows PC, using AutoResetEvent in place of AutoResetEventCondVar makes the associated test case run 10x faster.

In my cpp11-on-multicore project on GitHub, there’s a class that packs three 10-bit values into a 32-bit integer.

I could have implemented it using traditional bitfields…

struct Status
{
    uint32_t readers : 10;
    uint32_t waitToRead : 10;
    uint32_t writers : 10;
};

Or with some bit twiddling…

uint32_t status = readers | (waitToRead << 10) | (writers << 20);

Instead, I did what any overzealous C++ programmer does. I abused the preprocessor and templating system.

BEGIN_BITFIELD_TYPE(Status, uint32_t)           // type name, storage size
    ADD_BITFIELD_MEMBER(readers, 0, 10)         // member name, offset, number of bits
    ADD_BITFIELD_MEMBER(waitToRead, 10, 10)
    ADD_BITFIELD_MEMBER(writers, 20, 10)
END_BITFIELD_TYPE()

The above set of macros defines a new bitfield type Status with three members. The second argument to BEGIN_BITFIELD_TYPE() must be an unsigned integer type. The second argument to ADD_BITFIELD_MEMBER() specifies each member’s offset, while the third argument specifies the number of bits.

I call this a safe bitfield because it performs safety checks to ensure that every operation on the bitfield fits within the available number of bits. It also supports packed arrays. I thought the technique deserved a quick explanation here, since I’m going to refer back to it in future posts.

How to Manipulate a Safe Bitfield

Let’s take Status as an example. Simply create an object of type Status as you would any other object. By default, it’s initialized to zero, but you can initialize it from any integer of the same size. In the GitHub project, it’s often initialized from the result of a C++11 atomic operation.

Status status = m_status.load(std::memory_order_relaxed);

Setting the value of a bitfield member is easy. Just assign to the member the same way you would using a traditional bitfield. If asserts are enabled – such as in a debug build – and you try to assign a value that’s too large for the bitfield, an assert will occur at runtime. It’s meant to help catch programming errors during development.

status.writers = 1023;     // OK
status.writers = 1024;     // assert: value out of range

You can increment or decrement a bitfield member using the ++ and -- operators. If the resulting value is too large, or if it underflows past zero, the operation will trigger an assert as well.

status.writers++;          // assert if overflow; otherwise OK
status.writers--;          // assert if underflow; otherwise OK

It would be easy to implement a version of increment and decrement that silently wrap around, without corrupting any neighboring bitfield members, but I haven’t done so yet. I’ll add those functions as soon as I have a need for them.

You can pass the entire bitfield to any function that expects a uint32_t. In the GitHub project, they’re often passed to C++11 atomic operations. It even works by reference.

m_status.store(status, std::memory_order_relaxed);
m_status.compare_exchange_weak(oldStatus, newStatus,
                               std::memory_order_acquire, std::memory_order_relaxed));

For each bitfield member, there are helper functions that return the representation of 1, as well as the maximum value the member can hold. These helper functions let you atomically increment a specific member using std::atomic<>::fetch_add(). You can invoke them on temporary objects, since they return the same value for any Status object.

Status oldStatus = m_status.fetch_add(Status().writers.one(), std::memory_order_acquire);
assert(oldStatus.writers + 1 <= Status().writers.maximum());

How It’s Implemented

When expanded by the preprocessor, the macros shown near the top of this post generate a union that contains four member variables: wrapper, readers, waitToRead and writers:

// BEGIN_BITFIELD_TYPE(Status, uint32_t)
union Status
{
    struct Wrapper
    {
        uint32_t value;
    };
    Wrapper wrapper;

    Status(uint32_t v = 0) { wrapper.value = v; }
    Status& operator=(uint32_t v) { wrapper.value = v; return *this; }
    operator uint32_t&() { return wrapper.value; }
    operator uint32_t() const { return wrapper.value; }

    typedef uint32_t StorageType;

    // ADD_BITFIELD_MEMBER(readers, 0, 10)
    BitFieldMember<StorageType, 0, 10> readers;

    // ADD_BITFIELD_MEMBER(waitToRead, 10, 10)
    BitFieldMember<StorageType, 10, 10> waitToRead;

    // ADD_BITFIELD_MEMBER(writers, 20, 10)
    BitFieldMember<StorageType, 20, 10> writers;

// END_BITFIELD_TYPE()
};

The cool thing about unions in C++ is that they share a lot of the same capabilities as C++ classes. As you can see, I’ve given this one a constructor and overloaded several operators, to support some of the functionality described earlier.

Each member of the union is exactly 32 bits wide. readers, waitToRead and writers are all instances of the BitFieldMember class template. BitFieldMember<uint32_t, 20, 10>, for example, represents a range of 10 bits starting at offset 20 within a uint32_t. (In the diagram below, the bits are ordered from most significant to least, so we count offsets starting from the right.)

Here’s a partial definition of the the BitFieldMember class template. You can view the full definition on GitHub:

template <typename T, int Offset, int Bits>
struct BitFieldMember
{
    T value;

    static const T Maximum = (T(1) << Bits) - 1;
    static const T Mask = Maximum << Offset;

    operator T() const
    {
        return (value >> Offset) & Maximum;
    }

    BitFieldMember& operator=(T v)
    {
        assert(v <= Maximum);               // v must fit inside the bitfield member
        value = (value & ~Mask) | (v << Offset);
        return *this;
    }

    ...

operator T() is a user-defined conversion that lets us read the bitfield member as if it was a plain integer. operator=(T v) is, of course, a copy assignment operator that lets use write to the bitfield member. This is where all the necessary bit twiddling and safety checks take place.

No Undefined Behavior

Is this legal C++? We’ve been reading from various Status members after writing to others; something the C++ standard generally forbids. Luckily, in §9.5.1, it makes the following exception:

If a standard-layout union contains several standard-layout structs that share a common initial sequence … it is permitted to inspect the common initial sequence of any of standard-layout struct members.

In our case, Status fits the definition of a standard-layout union; wrapper, readers, waitToRead and writers are all standard-layout structs; and they share a common initial sequence: uint32_t value. Therefore, we have the standard’s endorsement, and there’s no undefined behavior. (Thanks to Michael Reilly and others for helping me sort that out.)

Bonus: Support for Packed Arrays

In another class, I needed a bitfield to hold a packed array of eight 4-bit values.

Packed array members are supported using the ADD_BITFIELD_ARRAY macro. It’s similar to the ADD_BITFIELD_MEMBER macro, but it takes an additional argument to specify the number of array elements.

BEGIN_BITFIELD_TYPE(AllStatus, uint32_t)
    ADD_BITFIELD_ARRAY(philos, 0, 4, 8)     // 8 array elements, 4 bits each
END_BITFIELD_TYPE()

You can index a packed array member just like a regular array. An assert is triggered if the array index is out of range.

AllStatus status;
status.philos[0] = 5;           // OK
status.philos[8] = 0;           // assert: array index out of range

Packed array items support all of the same operations as bitfield members. I won’t go into the details, but the trick is to overload operator[] in philos so that it returns a temporary object that has the same capabilities as a BitFieldMember instance.

status.philos[1]++;
status.philos[2]--;
std::cout << status.philos[3];

When optimizations are enabled, MSVC, GCC and Clang do a great job of inlining all the hidden function calls behind this technique. The generated machine code ends up as efficient as if you had explicitly performed all of the bit twiddling yourself.

I’m not the first person to implement custom bitfields on top of C++ unions and templates. The implementation here was inspired by this blog post by Evan Teran, with a few twists of my own. I don’t usually like to rely on clever language contortions, but this is one of those cases where the convenience gained feels worth the increase in obfuscation.

Atomic read-modify-write operations – or “RMWs” – are more sophisticated than atomic loads and stores. They let you read from a variable in shared memory and simultaneously write a different value in its place. In the C++11 atomic library, all of the following functions perform an RMW:

std::atomic<>::fetch_add()
std::atomic<>::fetch_sub()
std::atomic<>::fetch_and()
std::atomic<>::fetch_or()
std::atomic<>::fetch_xor()
std::atomic<>::exchange()
std::atomic<>::compare_exchange_strong()
std::atomic<>::compare_exchange_weak()

fetch_add, for example, reads from a shared variable, adds another value to it, and writes the result back – all in one indivisible step. You can accomplish the same thing using a mutex, but a mutex-based version wouldn’t be lock-free. RMW operations, on the other hand, are designed to be lock-free. They’ll take advantage of lock-free CPU instructions whenever possible, such as ldrex/strex on ARMv7.

A novice programmer might look at the above list of functions and ask, “Why does C++11 offer so few RMW operations? Why is there an atomic fetch_add, but no atomic fetch_multiply, no fetch_divide and no fetch_shift_left?” There are two reasons:

Because there is very little need for those RMW operations in practice. Try not to get the wrong impression of how RMWs are used. You can’t write safe multithreaded code by taking a single-threaded algorithm and turning each step into an RMW.
Because if you do need those operations, you can easily implement them yourself. As the title says, you can do any kind of RMW operation!

Compare-and-Swap: The Mother of All RMWs

Out of all the available RMW operations in C++11, the only one that is absolutely essential is compare_exchange_weak. Every other RMW operation can be implemented using that one. It takes a minimum of two arguments:

shared.compare_exchange_weak(T& expected, T desired, ...);

This function attempts to store the desired value to shared, but only if the current value of shared matches expected. It returns true if successful. If it fails, it loads the current value of shared back into expected, which despite its name, is an in/out parameter. This is called a compare-and-swap operation, and it all happens in one atomic, indivisible step.

So, suppose you really need an atomic fetch_multiply operation, though I can’t imagine why. Here’s one way to implement it:

uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier)
{
    uint32_t oldValue = shared.load();
    while (!shared.compare_exchange_weak(oldValue, oldValue * multiplier))
    {
    }
    return oldValue;
}

This is known as a compare-and-swap loop, or CAS loop. The function repeatedly tries to exchange oldValue with oldValue * multiplier until it succeeds. If no concurrent modifications happen in other threads, compare_exchange_weak will usually succeed on the first try. On the other hand, if shared is concurrently modified by another thread, it’s totally possible for its value to change between the call to load and the call to compare_exchange_weak, causing the compare-and-swap operation to fail. In that case, oldValue will be updated with the most recent value of shared, and the loop will try again.

The above implementation of fetch_multiply is both atomic and lock-free. It’s atomic even though the CAS loop may take an indeterminate number of tries, because when the loop finally does modify shared, it does so atomically. It’s lock-free because if a single iteration of the CAS loop fails, it’s usually because some other thread modified shared successfully. That last statement hinges on the assumption that compare_exchange_weak actually compiles to lock-free machine code – more on that below. It also ignores the fact that compare_exchange_weak can fail spuriously on certain platforms, but that’s such a rare event, there’s no need to consider the algorithm any less lock-free.

You Can Combine Several Steps Into One RMW

fetch_multiply just replaces the value of shared with a multiple of the same value. What if we want to perform a more elaborate kind of RMW? Can we still make the operation atomic and lock-free? Sure we can. To offer a somewhat convoluted example, here’s a function that loads a shared variable, decrements the value if odd, divides it in half if even, and stores the result back only if it’s greater than or equal to 10, all in a single atomic, lock-free operation:

uint32_t atomicDecrementOrHalveWithLimit(std::atomic<uint32_t>& shared)
{
    uint32_t oldValue = shared.load();
    uint32_t newValue;
    do
    {
        if (oldValue % 2 == 1)
            newValue = oldValue - 1;
        else
            newValue = oldValue / 2;
        if (newValue < 10)
            break;
    }
    while (!shared.compare_exchange_weak(oldValue, newValue));
    return oldValue;
}

It’s the same idea as before: If compare_exchange_weak fails – usually due to a modification performed by another thread – oldValue is updated with a more recent value, and the loop tries again. If, during any attempt, we find that newValue is less than 10, the CAS loop terminates early, effectively turning the RMW operation into a no-op.

The point is that you can put anything inside the CAS loop. Think of the body of the CAS loop as a critical section. Normally, we protect a critical section using a mutex. With a CAS loop, we simply retry the entire transaction until it succeeds.

This is obviously a synthetic example. A more practical example can be seen in the AutoResetEvent class described in my earlier post about semaphores. It uses a CAS loop with multiple steps to atomically increment a shared variable up to a limit of 1.

You Can Combine Several Variables Into One RMW

So far, we’ve only looked at examples that perform an atomic operation on a single shared variable. What if we want to perform an atomic operation on multiple variables? Normally, we’d protect those variables using a mutex:

std::mutex mutex;
uint32_t x;
uint32_t y;

void atomicFibonacciStep()
{
    std::lock_guard<std::mutex> lock(mutex);
    int t = y;
    y = x + y;
    x = t;
}

This mutex-based approach is atomic, but obviously not lock-free. That may very well be good enough, but for the sake of illustration, let’s go ahead and convert it to a CAS loop just like the other examples. std::atomic<> is a template, so we can actually pack both shared variables into a struct and apply the same pattern as before:

struct Terms
{
    uint32_t x;
    uint32_t y;
};

std::atomic<Terms> terms;

void atomicFibonacciStep()
{
    Terms oldTerms = terms.load();
    Terms newTerms;
    do
    {
        newTerms.x = oldTerms.y;
        newTerms.y = oldTerms.x + oldTerms.y;
    }
    while (!terms.compare_exchange_weak(oldTerms, newTerms));
}

Is this operation lock-free? Now we’re venturing into dicey territory. As I wrote at the start, C++11 atomic operations are designed take advantage of lock-free CPU instructions “whenever possible” – admittedly a loose definition. In this case, we’ve wrapped std::atomic<> around a struct, Terms. Let’s see how GCC 4.9.2 compiles it for x64:

We got lucky. The compiler was clever enough to see that Terms fits inside a single 64-bit register, and implemented compare_exchange_weak using lock cmpxchg. The compiled code is lock-free.

This brings up an interesting point: In general, the C++11 standard does not guarantee that atomic operations will be lock-free. There are simply too many CPU architectures to support and too many ways to specialize the std::atomic<> template. You need to check with your compiler to make absolutely sure. In practice, though, it’s pretty safe to assume that atomic operations are lock-free when all of the following conditions are true:

The compiler is a recent version MSVC, GCC or Clang.
The target processor is x86, x64 or ARMv7 (and possibly others).
The atomic type is std::atomic<uint32_t>, std::atomic<uint64_t> or std::atomic<T*> for some type T.

As a personal preference, I like to hang my hat on that third point, and limit myself to specializations of the std::atomic<> template that use explicit integer or pointer types. The safe bitfield technique I described in the previous post gives us a convenient way to rewrite the above function using an explicit integer specialization, std::atomic<uint64_t>:

BEGIN_BITFIELD_TYPE(Terms, uint64_t)
    ADD_BITFIELD_MEMBER(x, 0, 32)
    ADD_BITFIELD_MEMBER(y, 32, 32)
END_BITFIELD_TYPE()

std::atomic<uint64_t> terms;

void atomicFibonacciStep()
{
    Terms oldTerms = terms.load();
    Terms newTerms;
    do
    {
        newTerms.x = oldTerms.y;
        newTerms.y = (uint32_t) (oldTerms.x + oldTerms.y);
    }
    while (!terms.compare_exchange_weak(oldTerms, newTerms));
}

Some real-world examples where we pack several values into an atomic bitfield include:

Implementing tagged pointers as a workaround for the ABA problem.
Implementing a lightweight read-write lock, which I touched upon briefly in a previous post.

In general, any time you have a small amount of data protected by a mutex, and you can pack that data entirely into a 32- or 64-bit integer type, you can always convert your mutex-based operations into lock-free RMW operations, no matter what those operations actually do! That’s the principle I exploited in my Semaphores are Surprisingly Versatile post, to implement a bunch of lightweight synchronization primitives.

Of course, this technique is not unique to the C++11 atomic library. I’m just using C++11 atomics because they’re quite widely available now, and compiler support is pretty good. You can implement a custom RMW operation using any library that exposes a compare-and-swap function, such as Win32, the Mach kernel API, the Linux kernel API, GCC atomic builtins or Mintomic. In the interest of brevity, I didn’t discuss memory ordering concerns in this post, but it’s critical to consider the guarantees made by your atomic library. In particular, if your custom RMW operation is intended to pass non-atomic information between threads, then at a minimum, you should ensure that there is the equivalent of a synchronizes-with relationship somewhere.

When debugging multithreaded code, it’s not always easy to determine which codepath was taken. You can’t always reproduce the bug while stepping through the debugger, nor can you always sprinkle printfs throughout the code, as you might in a single-threaded program. There might be millions of events before the bug occurs, and printf can easily slow the application to a crawl, mask the bug, or create a spam fest in the output log.

One way of attacking such problems is to instrument the code so that events are logged to a circular buffer in memory. This is similar to adding printfs, except that only the most recent events are kept in the log, and the performance overhead can be made very low using lock-free techniques.

Here’s one possible implementation. I’ve written it specifically for Windows in 32-bit C++, but you could easily adapt the idea to other platforms. The header file contains the following:

#include <windows.h>
#include <intrin.h>

namespace Logger
{
    struct Event
    {
        DWORD tid;        // Thread ID
        const char* msg;  // Message string
        DWORD param;      // A parameter which can mean anything you want
    };

    static const int BUFFER_SIZE = 65536;   // Must be a power of 2
    extern Event g_events[BUFFER_SIZE];
    extern LONG g_pos;

    inline void Log(const char* msg, DWORD param)
    {
        // Get next event index
        LONG index = _InterlockedIncrement(&g_pos);
        // Write an event at this index
        Event* e = g_events + (index & (BUFFER_SIZE - 1));  // Wrap to buffer size
        e->tid = ((DWORD*) __readfsdword(24))[9];           // Get thread ID
        e->msg = msg;
        e->param = param;
    }
}

#define LOG(m, p) Logger::Log(m, p)

And you must place the following in a .cpp file.

namespace Logger
{
    Event g_events[BUFFER_SIZE];
    LONG g_pos = -1;
}

This is perhaps one of the simplest examples of lock-free programming which actually does something useful. There’s a single macro LOG, which writes to the log. It uses _InterlockedIncrement, an atomic operation which I’ve talked about in previous posts, for thread safety. There are no readers. You are meant to be the reader when you inspect the process in the debugger, such as when the program crashes, or when the bug is otherwise caught.

Using It to Debug My Previous Post

My previous post, Memory Reordering Caught In the Act, contains a sample program which demonstrates a specific type of memory reordering. There are two semaphores, beginSema1 and beginSema2, which are used to repeatedly kick off two worker threads.

While I was preparing the post, there was only a single beginSema shared by both threads. To verify that the experiment was valid, I added a makeshift assert to the worker threads. Here’s the Win32 version:

DWORD WINAPI thread1Func(LPVOID param)
{
    MersenneTwister random(1);
    for (;;)
    {
        WaitForSingleObject(beginSema, INFINITE);  // Wait for signal
        while (random.integer() % 8 != 0) {} // Random delay

        // ----- THE TRANSACTION! -----
        if (X != 0) DebugBreak();  // Makeshift assert
        X = 1;
        _ReadWriteBarrier();  // Prevent compiler reordering only
        r1 = Y;

        ReleaseSemaphore(endSema, 1, NULL);  // Notify transaction complete
    }
    return 0;  // Never returns
};

Surprisingly, this “assert” got hit, which means that X was not 0 at the start of the experiment, as expected. This puzzled me, since as I explained in that post, the semaphores are supposed to guarantee the initial values X = 0 and Y = 0 are completely propagated at this point.

I needed more visibility on what was going on, so I added the LOG macro in a few strategic places. Note that the integer parameter can be used to log any value you want. In the second LOG statement below, I use it to log the initial value of X. Similar changes were made in the other worker thread.

    for (;;)
    {
        LOG("wait", 0);
        WaitForSingleObject(beginSema, INFINITE);  // Wait for signal
        while (random.integer() % 8 != 0) {} // Random delay

        // ----- THE TRANSACTION! -----
        LOG("X ==", X);
        if (X != 0) DebugBreak();  // Makeshift assert
        X = 1;
        _ReadWriteBarrier();  // Prevent compiler reordering only
        r1 = Y;

        ReleaseSemaphore(endSema, 1, NULL);  // Notify transaction complete
    }

And in the main thread:

    for (int iterations = 1; ; iterations++)
    {
        // Reset X and Y
        LOG("reset vars", 0);
        X = 0;
        Y = 0;
        // Signal both threads
        ReleaseSemaphore(beginSema, 1, NULL);
        ReleaseSemaphore(beginSema, 1, NULL);
        // Wait for both threads
        WaitForSingleObject(endSema, INFINITE);
        WaitForSingleObject(endSema, INFINITE);
        // Check if there was a simultaneous reorder
        LOG("check vars", 0);
        if (r1 == 0 && r2 == 0)
        {
            detected++;
            printf("%d reorders detected after %d iterations\n", detected, iterations);
        }
    }

The next time the “assert” was hit, I checked the contents of the log simply by watching the expressions Logger::g_pos and Logger::g_events in the Watch window.

In this case, the assert was hit fairly quickly. Only 17 events were logged in total (0 - 16). The final three events made the problem obvious: a single worker thread had managed to iterate twice before the other thread got a chance to run. In other words, thread1 had stolen the extra semaphore count which was intended to kick off thread2! Splitting this semaphore into two separate semaphores fixed the bug.

This example was relatively simple, involving a small number of events. In some games I’ve worked on, we’ve used this kind of technique to track down more complex problems. It’s still possible for this technique to mask a bug; for example, when memory reordering is the issue. But even if so, that may tell you something about the problem.

Tips on Viewing the Log

The g_events array is only big enough to hold the latest 65536 events. You can adjust this number to your liking, but at some point, the index counter g_pos will have to wrap around. For example, if g_pos has reached a value of 3630838, you can find the last log entry by taking this value modulo 65536. Using interactive Python:

>>> 3630838 % 65536
26358

When breaking, you may also find that “CXX0017: Error: symbol not found” is sometimes shown in the Watch window, as seen here:

This usually means that the debugger’s current thread and stack frame context is inside an external DLL instead of your executable. You can often fix it by double-clicking a different stack frame in the Call Stack window and/or a different thread in the Threads window. If all else fails, you can always add the context operator to your Watch expression, explicitly telling the debugger which module to use to resolve these symbols:

One convenient detail about this implementation is that the event log is stored in a global array. This allows the log show up in crash dumps, via an automated crash reporting system for example, even when limited minidump flags are used.

What Makes This Lightweight?

In this implementation, I strived to make the LOG macro as non-intrusive as reasonably possible. Besides being lock-free, this is mainly achieved through copious use of compiler intrinsics, which avoid the overhead of DLL function calls for certain functions. For example, instead of calling InterlockedIncrement, which involves a call into kernel32.dll, I used the intrinsic function _InterlockedIncrement (with an underscore).

Similarly, instead of getting the current thread ID from GetCurrentThreadId, I used the compiler intrinsic __readfsdword to read the thread ID directly from the Thread Information Block (TIB), an undocumented but well-known data structure in Win32.

You may question whether such micro-optimizations are justified. However, after building several makeshift logging systems, usually to handle millions of events in high-performance, multi-threaded code, I’ve come to believe that the less intrusive you can make it, the better. As a result of these micro-optimizations, the LOG macro compiles down to a few machine instructions, all inlined, with no function calls, no branching and no blocking:

This technique is attractive because it is very easy to integrate. There are many ways you could adapt it, depending on your needs, especially if performance is less of a concern. You could add timestamps and stack traces. You could introduce a dedicated thread to spool the event log to disk, though this would require much more sophisticated synchronization than the single atomic operation used here.

After adding such features, the technique would begin to resemble Microsoft’s Event Tracing for Windows (ETW) framework, so if you’re willing to go that far, it might be interesting to look at ETW’s support for user-mode provider events instead.

Lock-free programming is a challenge, not just because of the complexity of the task itself, but because of how difficult it can be to penetrate the subject in the first place.

I was fortunate in that my first introduction to lock-free (also known as lockless) programming was Bruce Dawson’s excellent and comprehensive white paper, Lockless Programming Considerations. And like many, I’ve had the occasion to put Bruce’s advice into practice developing and debugging lock-free code on platforms such as the Xbox 360.

Since then, a lot of good material has been written, ranging from abstract theory and proofs of correctness to practical examples and hardware details. I’ll leave a list of references in the footnotes. At times, the information in one source may appear orthogonal to other sources: For instance, some material assumes sequential consistency, and thus sidesteps the memory ordering issues which typically plague lock-free C/C++ code. The new C++11 atomic library standard throws another wrench into the works, challenging the way many of us express lock-free algorithms.

In this post, I’d like to re-introduce lock-free programming, first by defining it, then by distilling most of the information down to a few key concepts. I’ll show how those concepts relate to one another using flowcharts, then we’ll dip our toes into the details a little bit. At a minimum, any programmer who dives into lock-free programming should already understand how to write correct multithreaded code using mutexes, and other high-level synchronization objects such as semaphores and events.

What Is It?

People often describe lock-free programming as programming without mutexes, which are also referred to as locks. That’s true, but it’s only part of the story. The generally accepted definition, based on academic literature, is a bit more broad. At its essence, lock-free is a property used to describe some code, without saying too much about how that code was actually written.

Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free. Conversely, if a given part of your code doesn’t satisfy these conditions, then that part is not lock-free.

In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy. That last point sounds funny, but it’s key. Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again. Of course, real operating systems don’t work that way – we’re merely defining terms.

Here’s a simple example of an operation which contains no mutexes, but is still not lock-free. Initially, X = 0. As an exercise for the reader, consider how two threads could be scheduled in a way such that neither thread exits the loop.

while (X == 0)
{
    X = 1 - X;
}

Nobody expects a large application to be entirely lock-free. Typically, we identify a specific set of lock-free operations out of the whole codebase. For example, in a lock-free queue, there might be a handful of lock-free operations such as push, pop, perhaps isEmpty, and so on.

Herlihy & Shavit, authors of The Art of Multiprocessor Programming, tend to express such operations as class methods, and offer the following succinct definition of lock-free (see slide 150): “In an infinite execution, infinitely often some method call finishes.” In other words, as long as the program is able to keep calling those lock-free operations, the number of completed calls keeps increasing, no matter what. It is algorithmically impossible for the system to lock up during those operations.

One important consequence of lock-free programming is that if you suspend a single thread, it will never prevent other threads from making progress, as a group, through their own lock-free operations. This hints at the value of lock-free programming when writing interrupt handlers and real-time systems, where certain tasks must complete within a certain time limit, no matter what state the rest of the program is in.

A final precision: Operations that are designed to block do not disqualify the algorithm. For example, a queue’s pop operation may intentionally block when the queue is empty. The remaining codepaths can still be considered lock-free.

Lock-Free Programming Techniques

It turns out that when you attempt to satisfy the non-blocking condition of lock-free programming, a whole family of techniques fall out: atomic operations, memory barriers, avoiding the ABA problem, to name a few. This is where things quickly become diabolical.

So how do these techniques relate to one another? To illustrate, I’ve put together the following flowchart. I’ll elaborate on each one below.

Atomic Read-Modify-Write Operations

Atomic operations are ones which manipulate memory in a way that appears indivisible: No thread can observe the operation half-complete. On modern processors, lots of operations are already atomic. For example, aligned reads and writes of simple types are usually atomic.

Read-modify-write (RMW) operations go a step further, allowing you to perform more complex transactions atomically. They’re especially useful when a lock-free algorithm must support multiple writers, because when multiple threads attempt an RMW on the same address, they’ll effectively line up in a row and execute those operations one-at-a-time. I’ve already touched upon RMW operations in this blog, such as when implementing a lightweight mutex, a recursive mutex and a lightweight logging system.

Examples of RMW operations include _InterlockedIncrement on Win32, OSAtomicAdd32 on iOS, and std::atomic<int>::fetch_add in C++11. Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform, so it’s best to know the capabilities of your platform and toolchain. You can call std::atomic<>::is_lock_free to make sure.

Different CPU families support RMW in different ways. Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done. The common RMW operations are usually sufficient.

As illustrated by the flowchart, atomic RMWs are a necessary part of lock-free programming even on single-processor systems. Without atomicity, a thread could be interrupted halfway through the transaction, possibly leading to an inconsistent state.

Compare-And-Swap Loops

Perhaps the most often-discussed RMW operation is compare-and-swap (CAS). On Win32, CAS is provided via a family of intrinsics such as _InterlockedCompareExchange. Often, programmers perform compare-and-swap in a loop to repeatedly attempt a transaction. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS:

void LockFreeQueue::push(Node* newHead)
{
    for (;;)
    {
        // Copy a shared variable (m_Head) to a local.
        Node* oldHead = m_Head;

        // Do some speculative work, not yet visible to other threads.
        newHead->next = oldHead;

        // Next, attempt to publish our changes to the shared variable.
        // If the shared variable hasn't changed, the CAS succeeds and we return.
        // Otherwise, repeat.
        if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead)
            return;
    }
}

Such loops still qualify as lock-free, because if the test fails for one thread, it means it must have succeeded for another – though some architectures offer a weaker variant of CAS where that’s not necessarily true. Whenever implementing a CAS loop, special care must be taken to avoid the ABA problem.

Sequential Consistency

Sequential consistency means that all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code. Under sequential consistency, it’s impossible to experience memory reordering shenanigans like the one I demonstrated in a previous post.

A simple (but obviously impractical) way to achieve sequential consistency is to disable compiler optimizations and force all your threads to run on a single processor. A processor never sees its own memory effects out of order, even when threads are pre-empted and scheduled at arbitrary times.

Some programming languages offer sequentially consistency even for optimized code running in a multiprocessor environment. In C++11, you can declare all shared variables as C++11 atomic types with default memory ordering constraints. In Java, you can mark all shared variables as volatile. Here’s the example from my previous post, rewritten in C++11 style:

std::atomic<int> X(0), Y(0);
int r1, r2;

void thread1()
{
    X.store(1);
    r1 = Y.load();
}

void thread2()
{
    Y.store(1);
    r2 = X.load();
}

Because the C++11 atomic types guarantee sequential consistency, the outcome r1 = r2 = 0 is impossible. To achieve this, the compiler outputs additional instructions behind the scenes – typically memory fences and/or RMW operations. Those additional instructions may make the implementation less efficient compared to one where the programmer has dealt with memory ordering directly.

Memory Ordering

As the flowchart suggests, any time you do lock-free programming for multicore (or any symmetric multiprocessor), and your environment does not guarantee sequential consistency, you must consider how to prevent memory reordering.

On today’s architectures, the tools to enforce correct memory ordering generally fall into three categories, which prevent both compiler reordering and processor reordering:

A lightweight sync or fence instruction, which I’ll talk about in future posts;
A full memory fence instruction, which I’ve demonstrated previously;
Memory operations which provide acquire or release semantics.

Acquire semantics prevent memory reordering of operations which follow it in program order, and release semantics prevent memory reordering of operations preceding it. These semantics are particularly suitable in cases when there’s a producer/consumer relationship, where one thread publishes some information and the other reads it. I’ll also talk about this more in a future post.

Different Processors Have Different Memory Models

Different CPU families have different habits when it comes to memory reordering. The rules are documented by each CPU vendor and followed strictly by the hardware. For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but normally, the x86/64 family of processors from Intel and AMD do not. We say the former processors have a more relaxed memory model.

There’s a temptation to abstract away such platform-specific details, especially with C++11 offering us a standard way to write portable lock-free code. But currently, I think most lock-free programmers have at least some appreciation of platform differences. If there’s one key difference to remember, it’s that at the x86/64 instruction level, every load from memory comes with acquire semantics, and every store to memory provides release semantics – at least for non-SSE instructions and non-write-combined memory. As a result, it’s been common in the past to write lock-free code which works on x86/64, but fails on other processors.

If you’re interested in the hardware details of how and why processors perform memory reordering, I’d recommend Appendix C of Is Parallel Programming Hard. In any case, keep in mind that memory reordering can also occur due to compiler reordering of instructions.

In this post, I haven’t said much about the practical side of lock-free programming, such as: When do we do it? How much do we really need? I also haven’t mentioned the importance of validating your lock-free algorithms. Nonetheless, I hope for some readers, this introduction has provided a basic familiarity with lock-free concepts, so you can proceed into the additional reading without feeling too bewildered. As usual, if you spot any inaccuracies, let me know in the comments.

[This article was featured in Issue #29 of Hacker Monthly.]

Additional References

Anthony Williams’ blog and his book, C++ Concurrency in Action
Dmitriy V’jukov’s website and various forum discussions
Bartosz Milewski’s blog
Charles Bloom’s Low-Level Threading series on his blog
Doug Lea’s JSR-133 Cookbook
Howells and McKenney’s memory-barriers.txt document
Hans Boehm’s collection of links about the C++11 memory model
Herb Sutter’s Effective Concurrency series

Between the time you type in some C/C++ source code and the time it executes on a CPU, the memory interactions of that code may be reordered according to certain rules. Changes to memory ordering are made both by the compiler (at compile time) and by the processor (at run time), all in the name of making your code run faster.

The cardinal rule of memory reordering, which is universally followed by compiler developers and CPU vendors, could be phrased as follows:

Thou shalt not modify the behavior of a single-threaded program.

As a result of this rule, memory reordering goes largely unnoticed by programmers writing single-threaded code. It often goes unnoticed in multithreaded programming, too, since mutexes, semaphores and events are all designed to prevent memory reordering around their call sites. It’s only when lock-free techniques are used – when memory is shared between threads without any kind of mutual exclusion – that the cat is finally out of the bag, and the effects of memory reordering can be plainly observed.

Mind you, it is possible to write lock-free code for multicore platforms without the hassles of memory reordering. As I mentioned in my introduction to lock-free programming, one can take advantage of sequentially consistent types, such as volatile variables in Java or C++11 atomics – possibly at the price of a little performance. I won’t go into detail about those here. In this post, I’ll focus on the impact of the compiler on memory ordering for regular, non-sequentially-consistent types.

Compiler Instruction Reordering

As you know, the job of a compiler is to convert human-readable source code into machine-readable code for the CPU. During this conversion, the compiler is free to take many liberties.

Once such liberty is the reordering of instructions – again, only in cases where single-threaded program behavior does not change. Such instruction reordering typically happens only when compiler optimizations are enabled. Consider the following function:

int A, B;

void foo()
{
    A = B + 1;
    B = 0;
}

If we compile this function using GCC 4.6.1 without compiler optimization, it generates the following machine code, which we can view as an assembly listing using the -S option. The memory store to global variable B occurs right after the store to A, just as it does in the original source code.

$ gcc -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR _B  (redo this at home...)
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Compare that to the resulting assembly listing when optimizations are enabled using -O2:

$ gcc -O2 -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR B
        mov     DWORD PTR B, 0
        add     eax, 1
        mov     DWORD PTR A, eax
        ...

This time, the compiler has chosen to exercise its liberties, and reordered the store to B before the store to A. And why shouldn’t it? The cardinal rule of memory ordering is not broken. A single-threaded program would never know the difference.

On the other hand, such compiler reorderings can cause problems when writing lock-free code. Here’s a commonly-cited example, where a shared flag is used to indicate that some other shared data has been published:

int Value;
int IsPublished = 0;
 
void sendValue(int x)
{
    Value = x;
    IsPublished = 1;
}

Imagine what would happen if the compiler reordered the store to IsPublished before the store to Value. Even on a single-processor system, we’d have a problem: a thread could very well be pre-empted by the operating system between the two stores, leaving other threads to believe that Value has been updated when in fact, it hasn’t.

Of course, the compiler might not reorder those operations, and the resulting machine code would work fine as a lock-free operation on any multicore CPU having a strong memory model, such as an x86/64 – or in a single-processor environment, any type of CPU at all. If that’s the case, we should consider ourselves lucky. Needless to say, it’s much better practice to recognize the possibility of memory reordering for shared variables, and to ensure that the correct ordering is enforced.

Explicit Compiler Barriers

The minimalist approach to preventing compiler reordering is by using a special directive known as a compiler barrier. I’ve already demonstrated compiler barriers in a previous post. The following is a full compiler barrier in GCC. In Microsoft Visual C++, _ReadWriteBarrier serves the same purpose.

int A, B;

void foo()
{
    A = B + 1;
    asm volatile("" ::: "memory");
    B = 0;
}

With this change, we can leave optimizations enabled, and the memory store instructions will remain in the desired order.

$ gcc -O2 -S -masm=intel foo.c
$ cat foo.s
        ...
        mov     eax, DWORD PTR _B
        add     eax, 1
        mov     DWORD PTR _A, eax
        mov     DWORD PTR _B, 0
        ...

Similarly, if we want to guarantee our sendMessage example works correctly, and we only care about single-processor systems, then at an absolute minimum, we must introduce compiler barriers here as well. Not only does the sending operation require a compiler barrier, to prevent the reordering of stores, but receiving side needs one between the loads as well.

#define COMPILER_BARRIER() asm volatile("" ::: "memory")

int Value;
int IsPublished = 0;

void sendValue(int x)
{
    Value = x;
    COMPILER_BARRIER();          // prevent reordering of stores
    IsPublished = 1;
}

int tryRecvValue()
{
    if (IsPublished)
    {
        COMPILER_BARRIER();      // prevent reordering of loads
        return Value;
    }
    return -1;  // or some other value to mean not yet received
}

As I mentioned, compiler barriers are sufficient to prevent memory reordering on a single-processor system. But it’s 2012, and these days, multicore computing is the norm. If we want to ensure our interactions happen in the desired order in a multiprocessor environment, and on any CPU architecture, then a compiler barrier is not enough. We need either to issue a CPU fence instruction, or perform any operation which acts as a memory barrier at runtime. I’ll write more about those in the next post, Memory Barriers Are Like Source Control Operations.

The Linux kernel exposes several CPU fence instructions through preprocessor macros such as smb_rmb, and those macros are reduced to simple compiler barriers when compiling for a single-processor system.

Implied Compiler Barriers

There are other ways to prevent compiler reordering. Indeed, the CPU fence instructions I just mentioned act as compiler barriers, too. Here’s an example CPU fence instruction for PowerPC, defined as a macro in GCC:

#define RELEASE_FENCE() asm volatile("lwsync" ::: "memory")

Anywhere we place RELEASE_FENCE throughout our code, it will prevent certain kinds of processor reordering in addition to compiler reordering. For example, it can be used to make our sendValue function safe in a multiprocessor environment.

void sendValue(int x)
{
    Value = x;
    RELEASE_FENCE();
    IsPublished = 1;
}

In the new C++11 (formerly known as C++0x) atomic library standard, every non-relaxed atomic operation acts as a compiler barrier as well.

int Value;
std::atomic<int> IsPublished(0);

void sendValue(int x)
{
    Value = x;
    // <-- reordering is prevented here!
    IsPublished.store(1, std::memory_order_release);
}

And as you might expect, every function containing a compiler barrier must act as a compiler barrier itself, even when the function is inlined. (However, Microsoft’s documentation suggests that may not have been the case in earlier versions of the Visual C++ compiler. Tsk, tsk!)

void doSomeStuff(Foo* foo)
{
    foo->bar = 5;
    sendValue(123);       // prevents reordering of neighboring assignments
    foo->bar2 = foo->bar;
}

In fact, the majority of function calls act as compiler barriers, whether they contain their own compiler barrier or not. This excludes inline functions, functions declared with the pure attribute, and cases where link-time code generation is used. Other than those cases, a call to an external function is even stronger than a compiler barrier, since the compiler has no idea what the function’s side effects will be. It must forget any assumptions it made about memory that is potentially visible to that function.

When you think about it, this makes perfect sense. In the above code snippet, suppose our implementation of sendValue exists in an external library. How does the compiler know that sendValue doesn’t depend on the value of foo->bar? How does it know sendValue will not modify foo->bar in memory? It doesn’t. Therefore, to obey the cardinal rule of memory ordering, it must not reorder any memory operations around the external call to sendValue. Similarly, it must load a fresh value for foo->bar from memory after the call completes, rather than assuming it still equals 5, even with optimization enabled.

$ gcc -O2 -S -masm=intel dosomestuff.c
$ cat dosomestuff.s
        ...
        mov    ebx, DWORD PTR [esp+32]
        mov    DWORD PTR [ebx], 5            // Store 5 to foo->bar
        mov    DWORD PTR [esp], 123
        call    sendValue                     // Call sendValue
        mov    eax, DWORD PTR [ebx]          // Load fresh value from foo->bar
        mov    DWORD PTR [ebx+4], eax
        ...

As you can see, there are many instances where compiler instruction reordering is prohibited, and even when the compiler must reload certain values from memory. I believe these hidden rules form a big part of the reason why people have long been saying that volatile data types in C are not usually necessary in correctly-written multithreaded code.

Out-Of-Thin-Air Stores

Think instruction reordering makes lock-free programming tricky? Before C++11 was standardized, there was technically no rule preventing the compiler from getting up to even worse tricks. In particular, compilers were free to introduce stores to shared memory in cases where there previously was none. Here’s a very simplified example, inspired by the examples provided in multiple articles by Hans Boehm.

int A, B;

void foo()
{
    if (A)
        B++;
}

Though it’s rather unlikely in practice, nothing prevents a compiler from promoting B to a register before checking A, resulting in machine code equivalent to the following:

void foo()
{
    register int r = B;    // Promote B to a register before checking A.
    if (A)
        r++;
    B = r;          // Surprise! A new memory store where there previously was none.
}

Once again, the cardinal rule of memory ordering is still followed. A single-threaded application would be none the wiser. But in a multithreaded environment, we now have a function which can wipe out any changes made concurrently to B in other threads – even when A is 0. The original code didn’t do that. This type of obscure, technical non-impossibility is part of the reason why people have been saying that C++ doesn’t support threads, despite the fact that we’ve been happily writing multithreaded and lock-free code in C/C++ for decades.

I don’t know anyone who ever fell victim to such “out-of-thin-air” stores in practice. Maybe it’s just because for the type of lock-free code we tend to write, there aren’t a whole lot of optimization opportunities fitting this pattern. I suppose if I ever caught this type of compiler transformation happening, I would search for a way to wrestle the compiler into submission. If it’s happened to you, let me know in the comments.

In any case, the new C++11 standard explictly prohibits such behavior from the compiler in cases where it would introduce a data race. The wording can be found in and around §1.10.22 of the most recent C++11 working draft:

Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard.

Why Compiler Reordering?

As I mentioned at the start, the compiler modifies the order of memory interactions for the same reason that the processor does it – performance optimization. Such optimizations are a direct consequence of modern CPU complexity.

I may going out on a limb, but I somehow doubt that compilers did a whole lot of instruction reordering in the early 80’s, when CPUs had only a few hundred thousand transistors at most. I don’t think there would have been much point. But since then, Moore’s Law has provided CPU designers with about 10000 times the number of transistors to play with, and those transistors have been spent on tricks such as pipelining, memory prefetching, ILP and more recently, multicore. As a result of some of those features, we’ve seen architectures where the order of instructions in a program can make a significant difference in performance.

The first Intel Pentium released in 1993, with its so-called U and V-pipes, was the first processor where I really remember people talking about pipelining and the significance of instruction ordering. More recently, though, when I step through x86 disassembly in Visual Studio, I’m actually surprised how little instruction reordering there is. On the other hand, out of the times I’ve stepped through SPU disassembly on Playstation 3, I’ve found that the compiler really went to town. These are just anecdotal experiences; it may not reflect the experience of others, and certainly should not influence the way we enforce memory ordering in our lock-free code.

If you use source control, you’re on your way towards understanding memory ordering, an important consideration when writing lock-free code in C, C++ and other languages.

In my last post, I wrote about memory ordering at compile time, which forms one half of the memory ordering puzzle. This post is about the other half: memory ordering at runtime, on the processor itself. Like compiler reordering, processor reordering is invisible to a single-threaded program. It only becomes apparent when lock-free techniques are used – that is, when shared memory is manipulated without any mutual exclusion between threads. However, unlike compiler reordering, the effects of processor reordering are only visible in multicore and multiprocessor systems.

You can enforce correct memory ordering on the processor by issuing any instruction which acts as a memory barrier. In some ways, this is the only technique you need to know, because when you use such instructions, compiler ordering is taken care of automatically. Examples of instructions which act as memory barriers include (but are not limited to) the following:

Certain inline assembly directives in GCC, such as the PowerPC-specific asm volatile("lwsync" ::: "memory")
Any Win32 Interlocked operation, except on Xbox 360
Many operations on C++11 atomic types, such as load(std::memory_order_acquire)
Operations on POSIX mutexes, such as pthread_mutex_lock

Just as there are many instructions which act as memory barriers, there are many different types of memory barriers to know about. Indeed, not all of the above instructions produce the same kind of memory barrier – leading to another possible area of confusion when writing lock-free code. In an attempt to clear things up to some extent, I’d like to offer an analogy which I’ve found helpful in understanding the vast majority (but not all) of possible memory barrier types.

To begin with, consider the architecture of a typical multicore system. Here’s a device with two cores, each having 32 KiB of private L1 data cache. There’s 1 MiB of L2 cache shared between both cores, and 512 MiB of main memory.

A multicore system is a bit like a group of programmers collaborating on a project using a bizarre kind of source control strategy. For example, the above dual-core system corresponds to a scenario with just two programmers. Let’s name them Larry and Sergey.

On the right, we have a shared, central repository – this represents a combination of main memory and the shared L2 cache. Larry has a complete working copy of the repository on his local machine, and so does Sergey – these (effectively) represent the L1 caches attached to each CPU core. There’s also a scratch area on each machine, to privately keep track of registers and/or local variables. Our two programmers sit there, feverishly editing their working copy and scratch area, all while making decisions about what to do next based on the data they see – much like a thread of execution running on that core.

Which brings us to the source control strategy. In this analogy, the source control strategy is very strange indeed. As Larry and Sergey modify their working copies of the repository, their modifications are constantly leaking in the background, to and from the central repository, at totally random times. Once Larry edits the file X, his change will leak to the central repository, but there’s no guarantee about when it will happen. It might happen immediately, or it might happen much, much later. He might go on to edit other files, say Y and Z, and those modifications might leak into the respository before X gets leaked. In this manner, stores are effectively reordered on their way to the repository.

Similarly, on Sergey’s machine, there’s no guarantee about the timing or the order in which those changes leak back from the repository into his working copy. In this manner, loads are effectively reordered on their way out of the repository.

Now, if each programmer works on completely separate parts of the repository, neither programmer will be aware of these background leaks going on, or even of the other programmer’s existence. That would be analogous to running two independent, single-threaded processes. In this case, the cardinal rule of memory ordering is upheld.

The analogy becomes more useful once our programmers start working on the same parts of the repository. Let’s revisit the example I gave in an earlier post. X and Y are global variables, both initially 0:

Think of X and Y as files which exist on Larry’s working copy of the repository, Sergey’s working copy, and the central repository itself. Larry writes 1 to his working copy of X and Sergey writes 1 to his working copy of Y at roughly the same time. If neither modification has time to leak to the repository and back before each programmer looks up his working copy of the other file, they’ll end up with both r1 = 0 and r2 = 0. This result, which may have seemed counterintuitive at first, actually becomes pretty obvious in the source control analogy.

Types of Memory Barrier

Fortunately, Larry and Sergey are not entirely at the mercy of these random, unpredictable leaks happening in the background. They also have the ability to issue special instructions, called fence instructions, which act as memory barriers. For this analogy, it’s sufficient to define four types of memory barrier, and thus four different fence instructions. Each type of memory barrier is named after the type of memory reordering it’s designed to prevent: for example, #StoreLoad is designed to prevent the reordering of a store followed by a load.

As Doug Lea points out, these four categories map pretty well to specific instructions on real CPUs – though not exactly. Most of the time, a real CPU instruction acts as some combination of the above barrier types, possibly in addition to other effects. In any case, once you understand these four types of memory barriers in the source control analogy, you’re in a good position to understand a large number of instructions on real CPUs, as well as several higher-level programming language constructs.

#LoadLoad

A LoadLoad barrier effectively prevents reordering of loads performed before the barrier with loads performed after the barrier.

In our analogy, the #LoadLoad fence instruction is basically equivalent to a pull from the central repository. Think git pull, hg pull, p4 sync, svn update or cvs update, all acting on the entire repository. If there are any merge conflicts with his local changes, let’s just say they’re resolved randomly.

Mind you, there’s no guarantee that #LoadLoad will pull the latest, or head, revision of the entire repository! It could very well pull an older revision than the head, as long as that revision is at least as new as the newest value which leaked from the central repository into his local machine.

This may sound like a weak guarantee, but it’s still a perfectly good way to prevent seeing stale data. Consider the classic example, where Sergey checks a shared flag to see if some data has been published by Larry. If the flag is true, he issues a #LoadLoad barrier before reading the published value:

if (IsPublished)                   // Load and check shared flag
{
    LOADLOAD_FENCE();              // Prevent reordering of loads
    return Value;                  // Load published value
}

Obviously, this example depends on having the IsPublished flag leak into Sergey’s working copy by itself. It doesn’t matter exactly when that happens; once the leaked flag has been observed, he issues a #LoadLoad fence to prevent reading some value of Value which is older than the flag itself.

#StoreStore

A StoreStore barrier effectively prevents reordering of stores performed before the barrier with stores performed after the barrier.

In our analogy, the #StoreStore fence instruction corresponds to a push to the central repository. Think git push, hg push, p4 submit, svn commit or cvs commit, all acting on the entire repository.

As an added twist, let’s suppose that #StoreStore instructions are not instant. They’re performed in a delayed, asynchronous manner. So, even though Larry executes a #StoreStore, we can’t make any assumptions about when all his previous stores finally become visible in the central repository.

This, too, may sound like a weak guarantee, but again, it’s perfectly sufficient to prevent Sergey from seeing any stale data published by Larry. Returning to the same example as above, Larry needs only to publish some data to shared memory, issue a #StoreStore barrier, then set the shared flag to true:

Value = x;                         // Publish some data
STORESTORE_FENCE();
IsPublished = 1;                   // Set shared flag to indicate availability of data

Again, we’re counting on the value of IsPublished to leak from Larry’s working copy over to Sergey’s, all by itself. Once Sergey detects that, he can be confident he’ll see the correct value of Value. What’s interesting is that, for this pattern to work, Value does not even need to be an atomic type; it could just as well be a huge structure with lots of elements.

#LoadStore

Unlike #LoadLoad and #StoreStore, there’s no clever metaphor for #LoadStore in terms of source control operations. The best way to understand a #LoadStore barrier is, quite simply, in terms of instruction reordering.

Imagine Larry has a set of instructions to follow. Some instructions make him load data from his private working copy into a register, and some make him store data from a register back into the working copy. Larry has the ability to juggle instructions, but only in specific cases. Whenever he encounters a load, he looks ahead at any stores that are coming up after that; if the stores are completely unrelated to the current load, then he’s allowed to skip ahead, do the stores first, then come back afterwards to finish up the load. In such cases, the cardinal rule of memory ordering – never modify the behavior of a single-threaded program – is still followed.

On a real CPU, such instruction reordering might happen on certain processors if, say, there is a cache miss on the load followed by a cache hit on the store. But in terms of understanding the analogy, such hardware details don’t really matter. Let’s just say Larry has a boring job, and this is one of the few times when he’s allowed to get creative. Whether or not he chooses to do it is completely unpredictable. Fortunately, this is a relatively inexpensive type of reordering to prevent; when Larry encounters a #LoadStore barrier, he simply refrains from such reordering around that barrier.

In our analogy, it’s valid for Larry to perform this kind of LoadStore reordering even when there is a #LoadLoad or #StoreStore barrier between the load and the store. However, on a real CPU, instructions which act as a #LoadStore barrier typically act as at least one of those other two barrier types.

#StoreLoad

A StoreLoad barrier ensures that all stores performed before the barrier are visible to other processors, and that all loads performed after the barrier receive the latest value that is visible at the time of the barrier. In other words, it effectively prevents reordering of all stores before the barrier against all loads after the barrier, respecting the way a sequentially consistent multiprocessor would perform those operations.

#StoreLoad is unique. It’s the only type of memory barrier that will prevent the result r1 = r2 = 0 in the example given in Memory Reordering Caught in the Act; the same example I’ve repeated earlier in this post.

If you’ve been following closely, you might wonder: How is #StoreLoad different from a #StoreStore followed by a #LoadLoad? After all, a #StoreStore pushes changes to the central repository, while #LoadLoad pulls remote changes back. However, those two barrier types are insufficient. Remember, the push operation may be delayed for an arbitrary number of instructions, and the pull operation might not pull from the head revision. This hints at why the PowerPC’s lwsync instruction – which acts as all three #LoadLoad, #LoadStore and #StoreStore memory barriers, but not #StoreLoad – is insufficient to prevent r1 = r2 = 0 in that example.

In terms of the analogy, a #StoreLoad barrier could be achieved by pushing all local changes to the central repostitory, waiting for that operation to complete, then pulling the absolute latest head revision of the repository. On most processors, instructions that act as a #StoreLoad barrier tend to be more expensive than instructions acting as the other barrier types.

If we throw a #LoadStore barrier into that operation, which shouldn’t be a big deal, then what we get is a full memory fence – acting as all four barrier types at once. As Doug Lea also points out, it just so happens that on all current processors, every instruction which acts as a #StoreLoad barrier also acts as a full memory fence.

How Far Does This Analogy Get You?

As I’ve mentioned previously, every processor has different habits when it comes to memory ordering. The x86/64 family, in particular, has a strong memory model; it’s known to keep memory reordering to a minimum. PowerPC and ARM have weaker memory models, and the Alpha is famous for being in a league of its own. Fortunately, the analogy presented in this post corresponds to a weak memory model. If you can wrap your head around it, and enforce correct memory ordering using the fence instructions given here, you should be able to handle most CPUs.

The analogy also corresponds pretty well to the abstract machine targeted by both C++11 (formerly known as C++0x) and C11. Therefore, if you write lock-free code using the standard library of those languages while keeping the above analogy in mind, it’s more likely to function correctly on any platform.

In this analogy, I’ve said that each programmer represents a single thread of execution running on a separate core. On a real operating system, threads tend to move between different cores over the course of their lifetime, but the analogy still works. I’ve also alternated between examples in machine language and examples written in C/C++. Obviously, we’d prefer to stick with C/C++, or another high-level language; this is possible because again, any operation which acts as a memory barrier also prevents compiler reordering.

I haven’t written about every type of memory barrier yet. For instance, there are also data dependency barriers. I’ll describe those further in a future post. Still, the four types given here are the big ones.

If you’re interested in how CPUs work under the hood – things like stores buffers, cache coherency protocols and other hardware implementation details – and why they perform memory reordering in the first place, I’d recommend the fine work of Paul McKenney & David Howells. Indeed, I suspect most programmers who have successfully written lock-free code have at least a passing familiarity with such hardware details.

Generally speaking, in lock-free programming, there are two ways in which threads can manipulate shared memory: They can compete with each other for a resource, or they can pass information co-operatively from one thread to another. Acquire and release semantics are crucial for the latter: reliable passing of information between threads. In fact, I would venture to guess that incorrect or missing acquire and release semantics is the #1 type of lock-free programming error.

In this post, I’ll demonstrate various ways to achieve acquire and release semantics in C++. I’ll touch upon the C++11 atomic library standard in an introductory way, so you don’t already need to know it. And to be clear from the start, the information here pertains to lock-free programming without sequential consistency. We’re dealing directly with memory ordering in a multicore or multiprocessor environment.

Unfortunately, the terms acquire and release semantics appear to be in even worse shape than the term lock-free, in that the more you scour the web, the more seemingly contradictory definitions you’ll find. Bruce Dawson offers a couple of good definitions (credited to Herb Sutter) about halfway through this white paper. I’d like to offer a couple of definitions of my own, staying close to the principles behind C++11 atomics:

Acquire semantics is a property which can only apply to operations which read from shared memory, whether they are read-modify-write operations or plain loads. The operation is then considered a read-acquire. Acquire semantics prevent memory reordering of the read-acquire with any read or write operation which follows it in program order.

Release semantics is a property which can only apply to operations which write to shared memory, whether they are read-modify-write operations or plain stores. The operation is then considered a write-release. Release semantics prevent memory reordering of the write-release with any read or write operation which precedes it in program order.

Once you digest the above definitions, it’s not hard to see that acquire and release semantics can be achieved using simple combinations of the memory barrier types I described at length in my previous post. The barriers must (somehow) be placed after the read-acquire operation, but before the write-release. [Update: Please note that these barriers are technically more strict than what’s required for acquire and release semantics on a single memory operation, but they do achieve the desired effect.]

What’s cool is that neither acquire nor release semantics requires the use of a #StoreLoad barrier, which is often a more expensive memory barrier type. For example, on PowerPC, the lwsync (short for “lightweight sync”) instruction acts as all three #LoadLoad, #LoadStore and #StoreStore barriers at the same time, yet is less expensive than the sync instruction, which includes a #StoreLoad barrier.

With Explicit Platform-Specific Fence Instructions

One way to obtain the desired memory barriers is by issuing explicit fence instructions. Let’s start with a simple example. Suppose we’re coding for PowerPC, and __lwsync() is a compiler intrinsic function which emits the lwsync instruction. Since lwsync provides so many barrier types, we can use it in the following code to establish either acquire or release semantics as needed. In Thread 1, the store to Ready turns into a write-release, and in Thread 2, the load from Ready becomes a read-acquire.

If we let both threads run and find that r1 == 1, that serves as confirmation that the value of A assigned in Thread 1 was passed successfully to Thread 2. As such, we are guaranteed that r2 == 42. In my previous post, I already gave a lengthy analogy for #LoadLoad and #StoreStore to illustrate how this works, so I won’t rehash that explanation here.

In formal terms, we say that the store to Ready synchronized-with the load. I’ve written a separate post about synchronizes-with here. For now, suffice to say that for this technique to work in general, the acquire and release semantics must apply to the same variable – in this case, Ready – and both the load and store must be atomic operations. Here, Ready is a simple aligned int, so the operations are already atomic on PowerPC.

With Fences in Portable C++11

The above example is compiler- and processor-specific. One approach for supporting multiple platforms is to convert the code to C++11. All C++11 identifiers exist in the std namespace, so to keep the following examples brief, let’s assume the statement using namespace std; was placed somewhere earlier in the code.

C++11’s atomic library standard defines a portable function atomic_thread_fence() which takes a single argument to specify the type of fence. There are several possible values for this argument, but the values we’re most interested in here are memory_order_acquire and memory_order_release. We’ll use this function in place of __lwsync().

There’s one more change to make before this example is complete. On PowerPC, we knew that both operations on Ready were atomic, but we can’t make that assumption about every platform. To ensure atomicity on all platforms, we’ll change the type of Ready from int to atomic<int>. I know, it’s kind of a silly change, considering that aligned loads and stores of int are already atomic on every modern CPU that exists today. I’ll write more about this in the post on synchronizes-with, but for now, let’s do it for the warm fuzzy feeling of 100% correctness in theory. No changes to A are necessary.

The memory_order_relaxed arguments above mean “ensure these operations are atomic, but don’t impose any ordering constraints/memory barriers that aren’t already there.”

Once again, both of the above atomic_thread_fence() calls can be (and hopefully are) implemented as lwsync on PowerPC. Similarly, they could both emit a dmb instruction on ARM, which I believe is at least as effective as PowerPC’s lwsync. On x86/64, both atomic_thread_fence() calls can simply be implemented as compiler barriers, since usually, every load on x86/64 already implies acquire semantics and every store implies release semantics. This is why x86/64 is often said to be strongly ordered.

Without Fences in Portable C++11

In C++11, it’s possible to achieve acquire and release semantics on Ready without issuing explicit fence instructions. You just need to specify memory ordering constraints directly on the operations on Ready:

Think of it as rolling each fence instruction into the operations on Ready themselves. [Update: Please note that this form is not exactly the same as the version using standalone fences; technically, it’s less strict.] The compiler will emit any instructions necessary to obtain the required barrier effects. In particular, on Itanium, each operation can be easily implemented as a single instruction: ld.acq and st.rel. Just as before, r1 == 1 indicates a synchronizes-with relationship, serving as confirmation that r2 == 42.

This is actually the preferred way to express acquire and release semantics in C++11. In fact, the atomic_thread_fence() function used in the previous example was added relatively late in the creation of the standard.

Acquire and Release While Locking

As you can see, none of the examples in this post took advantage of the #LoadStore barriers provided by acquire and release semantics. Really, only the #LoadLoad and #StoreStore parts were necessary. That’s just because in this post, I chose a simple example to let us focus on API and syntax.

One case in which the #LoadStore part becomes essential is when using acquire and release semantics to implement a (mutex) lock. In fact, this is where the names come from: acquiring a lock implies acquire semantics, while releasing a lock implies release semantics! All the memory operations in between are contained inside a nice little barrier sandwich, preventing any undesireable memory reordering across the boundaries.

Here, acquire and release semantics ensure that all modifications made while holding the lock will propagate fully to the next thread which obtains the lock. Every implementation of a lock, even one you roll on your own, should provide these guarantees. Again, it’s all about passing information reliably between threads, especially in a multicore or multiprocessor environment.

In a followup post, I’ll show a working demonstration of C++11 code, running on real hardware, which can be plainly observed to break if acquire and release semantics are not used.

There are many types of memory reordering, and not all types of reordering occur equally often. It all depends on processor you’re targeting and/or the toolchain you’re using for development.

A memory model tells you, for a given processor or toolchain, exactly what types of memory reordering to expect at runtime relative to a given source code listing. Keep in mind that the effects of memory reordering can only be observed when lock-free programming techniques are used.

After studying memory models for a while – mostly by reading various online sources and verifying through experimentation – I’ve gone ahead and organized them into the following four categories. Below, each memory model makes all the guarantees of the ones to the left, plus some additional ones. I’ve drawn a clear line between weak memory models and strong ones, to capture the way most people appear to use these terms. Read on for my justification for doing so.

Each physical device pictured above represents a hardware memory model. A hardware memory model tells you what kind of memory ordering to expect at runtime relative to an assembly (or machine) code listing.

Every processor family has different habits when it comes to memory reordering, and those habits can only be observed in multicore or multiprocessor configurations. Given that multicore is now mainstream, it’s worth having some familiarity with them.

There are software memory models as well. Technically, once you’ve written (and debugged) portable lock-free code in C11, C++11 or Java, only the software memory model is supposed to matter. Nonetheless, a general understanding of hardware memory models may come in handy. It can help you explain unexpected behavior while debugging, and — perhaps just as importantly — appreciate how incorrect code may function correctly on a specific processor and toolchain out of luck.

Weak Memory Models

In the weakest memory model, it’s possible to experience all four types of memory reordering I described using a source control analogy in a previous post. Any load or store operation can effectively be reordered with any other load or store operation, as long as it would never modify the behavior of a single, isolated thread. In reality, the reordering may be due to either compiler reordering of instructions, or memory reordering on the processor itself.

When a processor has a weak hardware memory model, we tend to say it’s weakly-ordered or that it has weak ordering. We may also say it has a relaxed memory model. The venerable DEC Alpha is everybody’s favorite example of a weakly-ordered processor. There’s really no mainstream processor with weaker ordering.

The C11 and C++11 programming languages expose a weak software memory model which was in many ways influenced by the Alpha. When using low-level atomic operations in these languages, it doesn’t matter if you’re actually targeting a strong processor family such as x86/64. As I demonstrated previously, you must still specify the correct memory ordering constraints, if only to prevent compiler reordering.

Weak With Data Dependency Ordering

Though the Alpha has become less relevant with time, we still have several modern CPU families which carry on in the same tradition of weak hardware ordering:

ARM, which is currently found in hundreds of millions of smartphones and tablets, and is increasingly popular in multicore configurations.
PowerPC, which the Xbox 360 in particular has already delivered to 70 million living rooms in a multicore configuration.
Itanium, which Microsoft no longer supports in Windows, but which is still supported in Linux and found in HP servers.

These families have memory models which are, in various ways, almost as weak as the Alpha’s, except for one common detail of particular interest to programmers: they maintain data dependency ordering. What does that mean? It means that if you write A->B in C/C++, you are always guaranteed to load a value of B which is at least as new as the value of A. The Alpha doesn’t guarantee that. I won’t dwell on data dependency ordering too much here, except to mention that the Linux RCU mechanism relies on it heavily.

Strong Memory Models

Let’s look at hardware memory models first. What, exactly, is the difference between a strong one and a weak one? There is actually a little disagreement over this question, but my feeling is that in 80% of the cases, most people mean the same thing. Therefore, I’d like to propose the following definition:

A strong hardware memory model is one in which every machine instruction comes implicitly with acquire and release semantics. As a result, when one CPU core performs a sequence of writes, every other CPU core sees those values change in the same order that they were written.

It’s not too hard to visualize. Just imagine a refinement of the source control analogy where all modifications are committed to shared memory in-order (no StoreStore reordering), pulled from shared memory in-order (no LoadLoad reordering), and instructions are always executed in-order (no LoadStore reordering). StoreLoad reordering, however, still remains possible.

Under the above definition, the x86/64 family of processors is usually strongly-ordered. There are certain cases in which some of x86/64’s strong ordering guarantees are lost, but for the most part, as application programmers, we can ignore those cases. It’s true that a x86/64 processor can execute instructions out-of-order, but that’s a hardware implementation detail – what matters is that it still keeps its memory interactions in-order, so in a multicore environment, we can still consider it strongly-ordered. Historically, there has also been a little confusion due to evolving specs.

Apparently SPARC processors, when running in TSO mode, are another example of a strong hardware ordering. TSO stands for “total store order”, which in a subtle way, is different from the definition I gave above. It means that there is always a single, global order of writes to shared memory from all cores. The x86/64 has this property too: See Volume 3, §8.2.3.6-8 of Intel’s x86/64 Architecture Specification for some examples. From what I can tell, the TSO property isn’t usually of direct interest to low-level lock-free programmers, but it is a step towards sequential consistency.

Sequential Consistency

In a sequentially consistent memory model, there is no memory reordering. It’s as if the entire program execution is reduced to a sequential interleaving of instructions from each thread. In particular, the result r1 = r2 = 0 from Memory Reordering Caught in the Act becomes impossible.

These days, you won’t easily find a modern multicore device which guarantees sequential consistency at the hardware level. However, it seems at least one sequentially consistent, dual-processor machine existed back in 1989: The 386-based Compaq SystemPro. According to Intel’s docs, the 386 wasn’t advanced enough to perform any memory reordering at runtime.

In any case, sequential consistency only really becomes interesting as a software memory model, when working in higher-level programming languages. In Java 5 and higher, you can declare shared variables as volatile. In C++11, you can use the default ordering constraint, memory_order_seq_cst, when performing operations on atomic library types. If you do those things, the toolchain will restrict compiler reordering and emit CPU-specific instructions which act as the appropriate memory barrier types. In this way, a sequentially consistent memory model can be “emulated” even on weakly-ordered multicore devices. If you read Herlihy & Shavit’s The Art of Multiprocessor Programming, be aware that most of their examples assume a sequentially consistent software memory model.

Further Details

There are many other subtle details filling out the spectrum of memory models, but in my experience, they haven’t proved quite as interesting when writing lock-free code at the application level. There are things like control dependencies, causal consistency, and different memory types. Still, most discussions come back the four main categories I’ve outlined here.

If you really want to nitpick the fine details of processor memory models, and you enjoy eating formal logic for breakfast, you can check out the admirably detailed work done at the University of Cambridge. Paul McKenney has written an accessible overview of some of their work and its associated tools.