Jonathan Lam

Core Developer @ Hudson River Trading

Blog

Understanding memory

On 3/6/2023, 8:27:52 PM

I spent some time reviewing low-level computer-architecture and performance optimization topics recently, both to brush up on some of the topics from my work at Google, and for interview preparation as I continue my job search. Memory seems to be among the most pertinent of these.

For now I'll list some of the resources that I was reading, with a brief description of my learning. That being said, these are fantastic readings and I will probably take a long time to get a good handle on the topics they go into detail on, so I may not have a good grasp on them myself yet. I may also add documents later as I continue exploring this space.

Overview of memory in Linux

My exploration into memory in Linux began with this Drepper paper. It seems to be a highly recommended paper, also posted around the same time on LWN and mentioned in the Morin article linked below. This Drepper paper has instantly become one of my favorite papers of all time -- I admire the in-depth technical theory, the practical experiments that demonstrate the theory and further understanding, and the clarity of speech (which seems to owe a lot to Jonathan Corbet, a regular poster of fantastic LWN articles).

What every programmer should know about memory (Drepper, 2007): A fantastic overview of memory in Linux. This interleaves useful knowledge (designed for the programmer) about the hardware with practical examples running on commodity and server hardware. It dives deep into many relevant topics surrounding memory, especially but not limited to: memory hardware, caching, virtualization/paging, CPU and memory subsystem topology (e.g., NUMA factors), hardware/software prefetching, effects of SMT/MT/SMP, TLB/page fault optimization, etc. It's 15 years old but still very relevant: see this Stack Overflow post for more details on its modern relevance.
Low latency optimization: understanding huge pages (Morin, 2022): This was what brought me to the Drepper paper in the first place, as I was exploring the high frequency trading space. After reading the Drepper paper, huge pages make sense for many reasons, but it wasn't recommended at that time (2007). But memory capacity has increased since then and high-frequency trading is extremely performance-critical, so this article shows that this optimization is very useful with practical experiments.

Memory consistency and cache coherence

I wanted to study these problems because they are very relevant in low-level programming in general. Referencing the definitions provided by the Sorin et al. paper listed below, we can define these problems roughly as: consistency: the correctness of shared memory and the (observed) order of loads and stores; coherence: the correctness of unshared caches. Coherence is generally required to have consistency, as (generally) all loads and stores go through the cache.

I encountered these issues at my time in Android because we dealt with an incoherent memory subsystem and ARM has a relaxed memory ordering. It's not too useful in my everyday programming given that x86_64 tends to be strongly ordered and many memory subsystems are coherent, but it's good to know and understand¹.

Memory consistency in particular was especially confusing to me: it didn't seem to make sense that different CPUs could see the same memory writes in different orders. But it turns out that there are many different observers of memory operations: "source code order," program order, execution order, and "observed order" (by other CPUs). Source code order may differ from program order (the compiled assembly) due to compiler optimizations; program order may differ from execution order due to out-of-order execution (pipelining) optimizations; and execution order may differ from the order that other CPUs view the memory due to second-order effects such as a cache store buffer (see this Stack Overflow answer for more details). These optimizations can be disabled by providing stricter memory ordering hints (i.e., std::memory_order hints to operations on std::atomics, or directly calling memory fence instructions).

A primer on memory consistency and cache coherence (Sorin, Hill, and Wood, 2011): I haven't read the whole thing, but this provides a good outline on the titular problems.
Atomic's memory orders, what for? (Birbachar, 2017): A practical walkthrough of different memory orders supported by the C++ std::atomic<T> standard. I.e., explains relaxed vs. release/acquire vs. sequential consistency memory orders² with some practical examples. He also clearly explains the difference between sequence points, synchronization, and happens-before relationships, giving a pretty good overall view on synchronization (again, with practical examples in C++).

Synchronization (atomics and futexes)

This study of futexes was born out of a desire to learn more about how synchronization primitives and atomics (at least in C++) are developed. The very high-level summary of my findings is this: all synchronization primitives and atomics are based on atomic CPU instructions, especially Compare-And-Swap (CAS). (Note that this works by temporarily pausing the MESI protocol in the caches). std::atomic is directly built on top of this, as well as lock-less data structures.

When a thread may need to wait/sleep to synchronize, Linux provides the futex (Fast Userspace mutEXES) API³, which is a generalization of (and is used to implement both) mutexes and condition variables. Note that futexes still use CAS to manage state atomically, and may transfer control to the kernel if the thread needs to block. Futexes also require a kernel waitqueue implementation that cause a quick wakeup for waiting threads upon signalling a futex.

Now, C++'s std::atomic<T> is actually more powerful than simply providing atomic access to their contents (of type T). It turns out that synchronization atomics are much more useful for synchronization if they also provide memory ordering semantics⁴; this is explained in more detail by Birbachar's talk linked above.

Basics of futexes (Bendersky, 2018): A brief overview to the futex API. It's probably good to also read the futex (2) manpage, but I haven't gotten through that yet.
Futexes are tricky (Drepper, 2004): Referenced by the Bendersky article, this shows a basic implementation of a mutex using the futex interface. The implementation is indeed tricky.
Double-Checked Locking is Broken (Bacon et al., ~1999): While we're on the topic of locking, I wanted to read this paper recommended to me by my former tech lead at Google. This issue arises due to the JMM and the pre-Java1.5 inability to add explicit memory fences or control memory order like in C++. Note that this issue is fixed after Java1.5 added the volatile keyword, which provides a functionality similar to std::atomic with memory barriers by default. Like the above Drepper paper, this is a case of a non-trivial synchronization bug. It acts as a great brain-teaser!

Footnotes

1. Sorin et al. make the distinction that memory consistency is important for the programmer to get right when dealing with shared memory and concurrency, whereas coherence is usually a problem abstracted away by the hardware.

2. My reductive rundown of memory orders: relaxed memory order doesn't guarantee anything; release/acquire semantics orders memory operations w.r.t. a single std::atomic; and sequential consistency gives a total order for all std::atomics (that doesn't contradict the program order for any single thread).

3. Wherever you find an article detailing the motivation for futexes, the stated motivation for futexes is that they are faster than the previous POSIX mutex synchronization mechanism. This old mutex required a switch into kernel mode, which was expensive especially when the lock was uncontested and a single CAS operation would do. This explanation makes sense to me, but I can't really find any other information on these old POSIX mutexes for some reason; modern POSIX mutexes seem to be based on futexes.

4. By default, the default memory ordering for std::atomic load/stores is sequential consistency (std::memory_order::seq_cst). If we only needed purely atomic behavior and didn't care about the relative memory ordering of other variables, then we can use std::memory_order::relaxed.