(Recently) Software Engineer @ Google Silicon
(Also recently) EE/CS @ The Cooper Union
On 3/6/2023, 8:27:52 PM
Return to blog
I spent some time reviewing low-level computer-architecture and performance optimization topics recently, both to brush up on some of the topics from my work at Google, and for interview preparation as I continue my job search. Memory seems to be among the most pertinent of these.
For now I'll list some of the resources that I was reading, with a brief description of my learning. That being said, these are fantastic readings and I will probably take a long time to get a good handle on the topics they go into detail on, so I may not have a good grasp on them myself yet. I may also add documents later as I continue exploring this space.
My exploration into memory in Linux began with this Drepper paper. It seems to be a highly recommended paper, also posted around the same time on LWN and mentioned in the Morin article linked below. This Drepper paper has instantly become one of my favorite papers of all time -- I admire the in-depth technical theory, the practical experiments that demonstrate the theory and further understanding, and the clarity of speech (which seems to owe a lot to Jonathan Corbet, a regular poster of fantastic LWN articles).
I wanted to study these problems because they are very relevant in low-level programming in general. Referencing the definitions provided by the Sorin et al. paper listed below, we can define these problems roughly as: consistency: the correctness of shared memory and the (observed) order of loads and stores; coherence: the correctness of unshared caches. Coherence is generally required to have consistency, as (generally) all loads and stores go through the cache.
I encountered these issues at my time in Android because we dealt with an incoherent memory subsystem and ARM has a relaxed memory ordering. It's not too useful in my everyday programming given that x86_64 tends to be strongly ordered and many memory subsystems are coherent, but it's good to know and understand1.
Memory consistency in particular was especially confusing to me: it didn't seem to make sense that different CPUs could see the same memory writes in different orders. But it turns out that there are many different observers of memory operations: "source code order," program order, execution order, and "observed order" (by other CPUs). Source code order may differ from program order (the compiled assembly) due to compiler optimizations; program order may differ from execution order due to out-of-order execution (pipelining) optimizations; and execution order may differ from the order that other CPUs view the memory due to second-order effects such as a cache store buffer (see this Stack Overflow answer for more details). These optimizations can be disabled by providing stricter memory ordering hints (i.e.,
std::memory_order hints to operations on
std::atomics, or directly calling memory fence instructions).
std::atomic<T>standard. I.e., explains relaxed vs. release/acquire vs. sequential consistency memory orders2 with some practical examples. He also clearly explains the difference between sequence points, synchronization, and happens-before relationships, giving a pretty good overall view on synchronization (again, with practical examples in C++).
This study of futexes was born out of a desire to learn more about how synchronization primitives and atomics (at least in C++) are developed. The very high-level summary of my findings is this: all synchronization primitives and atomics are based on atomic CPU instructions, especially Compare-And-Swap (CAS). (Note that this works by temporarily pausing the MESI protocol in the caches).
std::atomic is directly built on top of this, as well as lock-less data structures.
When a thread may need to wait/sleep to synchronize, Linux provides the futex (Fast Userspace mutEXES) API3, which is a generalization of (and is used to implement both) mutexes and condition variables. Note that futexes still use CAS to manage state atomically, and may transfer control to the kernel if the thread needs to block. Futexes also require a kernel waitqueue implementation that cause a quick wakeup for waiting threads upon signalling a futex.
std::atomic<T> is actually more powerful than simply providing atomic access to their contents (of type
T). It turns out that synchronization atomics are much more useful for synchronization if they also provide memory ordering semantics4; this is explained in more detail by Birbachar's talk linked above.
volatilekeyword, which provides a functionality similar to
std::atomicwith memory barriers by default. Like the above Drepper paper, this is a case of a non-trivial synchronization bug. It acts as a great brain-teaser!
1. Sorin et al. make the distinction that memory consistency is important for the programmer to get right when dealing with shared memory and concurrency, whereas coherence is usually a problem abstracted away by the hardware.
2. My reductive rundown of memory orders: relaxed memory order doesn't guarantee anything; release/acquire semantics orders memory operations w.r.t. a single
std::atomic; and sequential consistency gives a total order for all
std::atomics (that doesn't contradict the program order for any single thread).
3. Wherever you find an article detailing the motivation for futexes, the stated motivation for futexes is that they are faster than the previous POSIX mutex synchronization mechanism. This old mutex required a switch into kernel mode, which was expensive especially when the lock was uncontested and a single CAS operation would do. This explanation makes sense to me, but I can't really find any other information on these old POSIX mutexes for some reason; modern POSIX mutexes seem to be based on futexes.
4. By default, the default memory ordering for
std::atomic load/stores is sequential consistency (
std::memory_order::seq_cst). If we only needed purely atomic behavior and didn't care about the relative memory ordering of other variables, then we can use
© Copyright 2023 Jonathan Lam