The C semantics assume that (within a single thread) all loads and stores stay in order. That is is not allowed to re-order a store past a load of the same address.
in ssa each argument of an instruction is a pointer to the source instruction. These edges force serialization of the code.
We want to apply this to loads and stores this will make ordering explicit
In Static Single Assignment (SSA) form, memory tokens, representing stores or loads to memory, are typically handled by introducing memory state variables
load: dest = load addrs, memory_token
store memory_token = store value, address, memory_token
calls to functions that might modify memory also need to read and write memory tokens
treat a store as though it created a new copy of memory
we can use phi functions on memory tokens
Maintaining Correct Memory Order: By tracking memory states explicitly in SSA form (through these memory tokens and versioning), SSA ensures that memory operations respect the correct order, even if the control flow of the program is complex. This helps compilers optimize code by making memory dependencies explicit.
if we can prove the load address is the same as the store address- remove the load
if we can prove the load address is different move the load up a store
otherwise go on
multi-threaded programs
Compilers started out assuming targets are single threaded. What optimizations change for multi-threaded code? How do users tell compiler that the target is multi-threaded?
shared memory multi-threading
The most common parallel system is
A single big memory
multiple threads address that memory
an example
Pretend we have parallel programming in bril
setup:
zero: int = const 0;
one: int = const 1;
a: ptr<int> = alloc one;
b: ptr<int> = alloc one;
store a zero;
store b zero;
create two threads somehow
thread 1 thread 2
store a one; store b one
bv: int = load b; av: int load a;
av
bv
allowed
0
0
0
1
1
0
1
1
what is sequential consistency SQ
Program Order is Maintained Within Threads:
Operations (reads and writes) appear to occur in the order they are issued by each individual thread. If a thread performs a write followed by a read, the read cannot appear to happen before the write in the execution.
Global Order of Operations Across Threads:
All threads see the effects of memory operations in the same sequential order. Every thread agrees on the order of reads and writes, though the specific order is not predefined—it just needs to be consistent across all threads. Interleaving of Operations:
The execution can be viewed as an interleaving of instructions from all threads. However, the interleaving must follow the program order within each thread.
no real machine/compiler implements this
compiler effects
Compiler transformations that break multi-thread sequential consistency (SC) often reorder or optimize instructions in ways that do not respect the original program order seen by other threads. These transformations can lead to subtle bugs in multithreaded programs where the expected interleaving of operations is violated.
Load/Store Reordering
Transformation: Compilers might reorder loads and stores to improve performance. Violation: In a multi-threaded environment, this can lead to a situation where one thread sees stale or unexpected data. Example:
Copy code
// Thread 1
x = 1; // Store
r1 = y; // Load
// Thread 2
y = 1; // Store
r2 = x; // Load
Under sequential consistency, if thread 1’s x = 1 happens before thread 2’s r2 = x, then thread 2 should observe r2 == 1. But reordering could result in thread 2 reading x as 0.
Common Subexpression Elimination (CSE)
Transformation: If a variable or expression is computed multiple times, the compiler may optimize by reusing the result of an earlier computation. Violation: This assumes that no other thread modifies shared variables between these uses. Example:
If x is modified by another thread between the two reads, the transformed code will incorrectly assume the value of x hasn’t changed.
Dead Code Elimination (DCE)
Transformation: The compiler may remove stores to variables that are not subsequently read in the same thread. Violation: If the variable is shared and accessed by other threads, removing the store could lead to unexpected behavior. Example:
// Original code
x = 1;
// Transformed code (DCE applied)
// x = 1 is removed because x is not used locally If another thread reads x, it expects the store to have happened, but DCE breaks this assumption.
Speculative Execution (Out-of-Order Execution)
Transformation: Compilers (or hardware) may execute instructions speculatively, assuming certain branches are likely to be taken. Violation: This can cause out-of-order writes or reads visible to other threads, breaking SC. Example:
if (flag) {
r1 = x;
}
If the compiler speculatively reads x before knowing the value of flag, another thread’s write to x might be missed or observed out-of-order.
Loop Invariant Code Motion
Transformation: The compiler moves computations that are invariant inside a loop to outside the loop. Violation: If these computations involve shared variables modified by other threads within the loop, moving them outside could make the code see stale values. Example:
// Original code
while (condition) {
r = shared_variable;
}
// Transformed code (Loop Invariant Code Motion)
temp = shared_variable;
while (condition) {
r = temp;
}
If shared_variable is updated by another thread, the transformed code might keep using the old value.
Register Allocation (Caching Shared Variables in Registers)
Transformation: Compilers can keep a shared variable in a register for efficiency rather than repeatedly loading it from memory. Violation: If another thread modifies that shared variable in memory, the compiler’s register optimization would cause the thread to read stale data. Example:
while (flag == 0) {
// busy-wait
}
If flag is cached in a register, updates to flag by another thread in memory won’t be reflected, breaking SC.
Instruction Fusion (Combining Loads/Stores)
Transformation: The compiler may combine consecutive memory accesses into one, such as merging adjacent stores into a single store or combining two loads. Violation: If other threads expect these loads or stores to happen separately, they might see an inconsistent view of memory. Example:
// Original code
x = 1;
y = 2;
// Transformed code (store fusion)
// x and y are stored together in a single transaction
A thread expecting x and y to be updated separately might observe an inconsistent state if this transformation is applied.
volatile
volatile turns off some optimizations, so compiler writers don’t like it. lots of bugs in compilers around volatile.
spec is confusing, developers think it means something not in the spec.
Best for memory mapped hardware, not good for inter thread communication.
Memory is allowed to tear. No synchronization.
undefined: bitfields, unions, pre/post increment(not atomic), big objects (more then 1 transfer operation)
volatile is like const: does not apply in constructors/destructors and it does not effect call conventions or returns
int i,j;
volatile int vol;
i = vol = j;
volatile int buffer_ready;
char buffer[BUF_SIZE];
void buffer_init() {
int i;
for (i=0; i<BUF_SIZE; i++)
buffer[i] = 0;
buffer_ready = 1;
}
// ok to move buffer assignments below buffer_ready
In C++ using volatile for inter thread communication is a race, and is not a valid program, so the compiler is free to treat it as any way it wants.
thread libraries
start out assuming single threaded, add a threads library like pthreads
multiple threads could access shared memory simultaneously, leading to race conditions, inconsistent data, and undefined behavior.
Modern CPUs and compilers perform optimizations like instruction reordering, which can break assumptions about the order of memory operations in multithreaded programs.
Multithreaded code is harder to test because race conditions and bugs might only manifest under certain timing conditions.
Debugging multithreaded programs is more difficult due to the unpredictable nature of thread execution and interactions.
Some optimizations might reorder instructions in a way that is incompatible with multithreading, introducing subtle bugs or performance regressions.
Caching, prefetching, or other memory optimizations need to account for the fact that multiple threads may be accessing the same memory, which a simple thread library does not handle.
using libraries
Functions such as pthread mutex lock() that are guaranteed by the standard to “synchronize memory” include hardware instructions (“memory barriers”) that prevent hardware reordering of memory operations around the call
To prevent the compiler from moving memory operations around calls to functions such as pthread mutex lock(), they are essentially treated as calls to opaque functions, about which the compiler has no information.
The compiler effectively assumes that pthread mutex lock() may read or write any global variable. Thus a memory reference cannot simply be moved across the call. This approach also ensures that transitive calls, e.g. a call to a function f() which then calls pthread mutex lock(), are handled in the same way more or less appropriately, i.e. memory operations are not moved across the call to f() either, whether or not the entire user program is being analyzed at once.
example 2
for (...) {
...
if (mt) pthread_mutex_lock(...);
x = ... x ...
if (mt) pthread_mutex_unlock(...)
}
becomes
r = x;
for (...) {
...
if (mt) {
x = r; pthread_mutex_lock(...); r = x;
}
r = ... r ...
if (mt) {
x = r; pthread_mutex_unlock(...); r = x;
}
}
x = r;
adding multi-threading to user explaining the intent
c++/c added atomics
Atomic operations are operations that are completed as a single, uninterruptible action. No other thread can observe a partial update or interfere with the operation.
These operations ensure that read-modify-write sequences are safe without needing explicit locks.
an example
#include <atomic>
#include <iostream>
#include <thread>
// Global spinlock using atomic_flag
std::atomic_flag lock = ATOMIC_FLAG_INIT;
void enter_critical_section() {
// Busy-wait (spin) until the lock is acquired
while (lock.test_and_set(std::memory_order_acquire)) {
// Spin and wait for the lock to become available
}
}
void leave_critical_section() {
// Release the lock
lock.clear(std::memory_order_release);
}
// Shared resource
int shared_counter = 0;
void critical_section_task(int num_increments) {
for (int i = 0; i < num_increments; ++i) {
enter_critical_section();
// Begin critical section
++shared_counter;
// End critical section
leave_critical_section();
}
}
load acquire (needs special hardware )
used by default with atomics but not used for non-atomics
all memory reads and writes after the load operation cannot be moved before the load. This ensures that after acquiring the value, any operations that depend on this value (like accessing shared data) will see consistent and up-to-date memory.
a one way fence - nothing can move up
write release (needs special hardware )
prevents the compiler or processor from reordering any memory operations (reads or writes) that appear before the release store. This guarantees that all operations that modify shared data before the release are visible to other threads that subsequently perform an acquire operation.
also a one way fence - nothing can move down
load.acquire -
loads and stores on non-atomics - compiler picks the order for these operations
store.release
using atomics
All operations appear to occur in a single total order that is consistent across all threads. This means that the results of operations are predictable and consistent as if all operations were executed in some sequential order.
limits the hardware and compiler because it prevents reordering
Data Race Free
Data Race Free (DRF) means that a program is free from data races, which occur when:
Two or more threads access the same variable concurrently.
At least one of the accesses is a write.
There is no synchronization mechanism (like mutexes or atomic operations) to control the access. In a data race-free program, every shared variable is accessed in a way that ensures predictable results. C++ provides various synchronization primitives (such as mutexes and atomic types) to help developers write DRF code.
All shared variables must be accessed using synchronization to prevent concurrent threads from modifying shared data simultaneously without coordination.
an example
#include <iostream>
#include <atomic>
#include <thread>
int shared_counter1 = 0; // First non-atomic shared variable
int shared_counter2 = 0; // Second non-atomic shared variable
std::atomic<bool> lock_flag(false); // Atomic flag to control access
void safe_increment() {
for (int i = 0; i < 1000; ++i) {
// Spin until the lock is acquired
while (lock_flag.exchange(true)) {
// Busy-wait (spin) until the lock is free
}
// Critical section: update the non-atomic shared variables
++shared_counter1;
++shared_counter2;
// Release the lock
lock_flag.store(false);
}
}
language rules
C and C++
do not define what happens in the presence of data races. If a program has data races (e.g., multiple threads concurrently reading and writing to the same variable without synchronization), the behavior is considered undefined. This means that the program may produce unexpected results, crash, or behave inconsistently across different executions or platforms.
Java
tries to define what happens but definition is very complex and maybe inconsistent
Rust
Compile-Time Guarantees: Rust’s ownership and borrowing system prevents data races at compile time. If a program is not DRF, the Rust compiler will typically refuse to compile it, enforcing memory safety guarantees.
can the compiler add a race to a drf program
new rule, compiler cannot add a write to a shared variable
if (x ==1) y++
to
y++
if (x!=1) y--
how does this effect hardware?
struct { char a; char b; char c; char d;} s;
s.a = 1
s.c = 3
can a compiler do
char temp[4] = s // load 32 bits
temp[0] = 1
temp[2] = 3
s = temp
not allowed - reads/writes b and d, so compiler incorrectly added writes
options are either have byte addressable hardware, or pad so that each char gets 32 bits