1. Thanks
Our thanks go to the many people who have contributed to this discussion over many years. Hans Boehm, Soham Chakraborty, David Goldblatt, Chung-Kil Hur, Alan Jeffrey, Ori Lahav, Paul McKenney, Scott Owens, Anouk Paradis, Marco Paviotti, William Pugh, James Riely, Jaroslav Ševčík, Ben Simner, Viktor Vafeiadis, Daniel Wright, and others!
2. Introduction
This paper is a step in integrating semantic dependency (sdep) into the C++ standard in order to fix the longstanding thin-air problem. It explains sdep with examples and then poses questions to SG1 members on the finer details of C++ compiler behaviour.
sdep is intended to capture a minimum set of ordering implicit in the data, control and address dependencies of a program -- sdep is made up of only the dependencies that must be preserved by an optimising C++ compiler. [P1780] (Kona 2022) provided standards text introducing sdep as an implementation-defined relation, restricted by a technical report, auxilliary to the standard. Adding sdep is a small change to the standard that fixes a deep problem, avoiding a rewrite of the concurrency specification. We hope this paper will lead to the auxilliary technical report by defining cases where there must be sdep, and the compiler may not optimse, and cases where the compiler must be free to optimise, so there can be no sdep.
We use small example programs with defined outcomes called litmus tests to introduce sdep, and then to discuss the corner cases where the C++ standard should limit the compiler’s optimiser to preseve program order. By convention, locations in litmus tests are atomic and memory accesses are
unless otherwise stated.
Target architectures like ARM and POWER allow relaxed concurrent behaviours, that cannot be explained by any sequential scheduling of the memory events in the program. The following litmus test, called Load Buffering (LB), can produce the outcome
when translated directly to POWER and ARM targets, so C++ relaxed atomics must be allowed to witness this same behaviour. Similarly, optimisers can reorder events, permitting this behavior even on targets whose architectures forbid the relaxed behaviour.
2.1. Load buffering
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; y : = 1 // Thread 1: r2 : = y ; x : = 1
The concurrent behaviour of a program is defined by the C++ memory model as a set of executions -- graphs of the memory events performed in one path of the program’s control flow. The memory reads, writes and fences make up the nodes, the black edges represent sequenced before, sb, and the red edges represent reads from, rf, indicating which write each read is reading from. The memory model forbids executions that have the wrong kind of cycles in these relations, but because we must allow the relaxed outcome in the LB test, it cannot forbid executions containing a cycle in sb and rf as below.
There is a flaw in the existing C++ standard where, in an attempt to allow aggressive compiler optimisation, the language definition goes too far, allowing programs to produce values out of thin air. The test below, LB+datas, is a variant of LB where the values written are taken from the values read on each thread.
2.2. Load buffering with data dependencies
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; y : = r1 // Thread 1: r2 : = y ; x : = r2
The standard allows the outcomes
or even
despite the only writes in the program being of value
: the values
and
come out of thin air. This is a problem: no combination of compiler optimisations and quirks of the target architecture admits this behaviour, and allowing it is known to make it impossible to prove C++ code correct -- the standard must be changed to forbid this outcome [Problem].
The curved arrows in the executions are dependency edges they reflect the fact that the values written by the writes are data dependent on the values read. In the previous test, LB, there were no dependency edges because concrete values were written making the writes independent. By forbidding cycles in dependency and rf, we could forbid the relaxed outcome here, but keep the outcome in the LB test, just as we need to. Note that this choice restricts the compiler: the specification requires the compiler to keep the accesses in order in LB+datas. That is not an onerous resrtiction: we do not expect the compiler to optimise here.
The next test demonstrates that conditional statements can create dependencies too.
2.3. Load buffering with control dependency
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; if ( r1 == 1 ) { y : = r1 } // Thread 1: r2 : = y ; if ( r2 == 1 ) { x : = 1 }
Once again, the execution where
is allowed by the current standard, but it should not be. Again, it can be ruled out by forbidding cycles in dependency and rf.
It is not enough to consider only the syntactic structure of the program when calculating dependencies. In the following test, a value loaded from memory is used in a conditional, but because a write of value
to
occurs regardless, the compiler may optimise this conditional away.
2.4. Load buffering with false-dependency
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; if ( r1 == 1 ) { y : = r1 } // Thread 1: r2 : = y ; if ( r2 == 1 ) { x : = 1 } else { x : = 1 }
The standard must allow the optimisation. Syntactic dependency, sdep, is a relation that captures just the dependencies that should not be optimised away. In particular, there is no sdep from the load of
to the store of
in this example. There is no cycle in sdep and rf, so the following execution is allowed, as required. In all execution graphs the orange arrows represent sdep.
2.5. Java Hotspot
While the audience for this paper is SG1, it is useful to consider other programming languages that have grappled with similar problems. Java has a complex memory consistency model, and they too have had to consider what the meaning of dependency is in their programming language. The following program [Sevcik] is an example where the Java concurrency specification fails to allow all of Hotspot’s optimisations.
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; y : = r1 // Thread 1: r2 : = y ; if ( r2 == 1 ) { r3 : = y ; x : = r3 } else { x : = 1 }
Hotspot will optimise Thread 1 as follows: the load of
in the true branch can take its value from the prior load of
; in the true branch, that load must be of value
; the true branch now performs a concrete write of
to
, just as the false branch does, so there is no need for the conditional; Thread 1 simply loads
into
and
and writes
to
. Now the optimised program can produce the outcome
on architectures that exhibit relaxed behaviour in the LB test.
Rather than describe the list of optisations that is performed, sdep is more abstract, describing the dependencies left in each excecution after optimisation. The optimisations described above impact the execution below, where both loads of
on Thread 1 read from the write of value
on Thread 0. There is no semantic dependency from the load of
to the write of
because the series of optimisations performed by Hotspot can remove it. The exectuion with outcome
is then allowed because there is no cycle in sdep and rf.
2.6. Java Causality Test Case 18
There is an established set of litmus tests for Java that analyses the effects of dependencies on the memory consistency model, called the Java Causality Tests. Test case 18 highlights the impact of global value-range analysis.
// Init: x : = 0 ; y : = 0 ; // Thread 0: r3 : = x ; if ( r3 == 0 ) { x : = 42 }; r1 : = x ; y : = r1 // Thread 1: r2 : = y ; x : = r2
A global analysis could determine that the only writes to
in this program are of values
and
. Then on Thread 0, either the inital load of
sees value
, or the conditional write of
sets it to
, and in any case, the value of
is the concrete value
. Global value range analysis causes the compiler break the sdep between the loads of
and the store of
.
In the execution above, the absence of sdep on Thread 0 means that there is no cycle in sdep and rf, and the outcome where
is allowed.
We believe this optimisation should be allowed. Does SG1 agree?
These introductory tests show examples where sdep should, and should not, be present, and higlight the existence of cases where we need to ask to be sure. There are a selection of tests where this choice is less clear-cut and we are interested in SG1’s opinion, we will discuss these next.
3. Controversial Tests
3.1. Goldblatt alignment
First, we will look at an example from previous discussions with SG1 committee members when looking at [P1780]. David Goldblatt pointed out that the compiler may enforce additional constraints on the values used in a program, e.g. aligning objects, and then optimisation may take advantage of these constraints, as in the follwoing example.
struct S { float vectorData [ 8 ]; }; S s0 ; S s1 ; S s2 ; atomic < S *> x { & s0 }; atomic < S *> y { & s1 }; S * r1 ; S * r2 ; void t1 () { r1 = x . load ( rlx ); if (( uintptr_t ) r1 % 32 == 0 ) { y . store ( & s2 ); } } void t2 () { r2 = y . load ( rlx ); if (( uintptr_t ) r2 % 32 == 0 ) { x . store ( & s2 ); } }
Here
and
are to be run concurrently. The compiler may have chosen to align all
objects to 32 bytes.
With this constraint, the compiler can remove the conditions in
and
: pointers to S-objects will always be
. Therefore, there must not be sdep edges from the loads to the stores in
and
. If this optimisation is allowed, the behaviour allowed by the definition of concurrency is necessarily implementation defined.
We believe compilers do create constraints on values like this, and then optimise in the context of these choices, so there can be no sdep. Does the committee agree?
3.2. RFUB
Continuing with examples from SG1 members, in [P1217R1] Hans Boehm has discussed programs that appear to read from untaken branches.
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; y : = r1 // Thread 1: z : = 0 ; r2 : = y ; if ( ¬r2 == 42 ) { z : = 1 ; r2 : = 42 }; x : = r2 ; r3 : = z
In this program, for similar reasons to the Java Hotspot example, a compiler could reason that, no matter what,
is assigned
at the end of Thread 1: if Thread 1 did not read
from
into
, then in the conditional, it will set
to
. Optimisation would not remove the conditional entirely: the write of
to
is still dependent on the value read from
. Removing the dependency on
yields executions where we observe a store of 42 to
, but not the store of 1 to
.
Does allowing this execution, and chosing this set of sdep edges, match the intention of SG1?
3.3. OOTA4
This example has an execution that is contentious among concurrent models of dependency [PwT]. The program has apparent data dependencies from
to
on Thread 0 and from
to
on Thread 2.
// Init: x : = 0 ; y : = 0 ; z : = 0 ; b : = 0 ; // Thread 0: r2 : = y ; r3 : = b ; if ( ¬r3 == 0 ) { x : = r2 ; z : = r2 } else { x : = 1 } // Thread 1: b : = 1 // Thread 2: r1 : = x ; y : = r1
In the following execution, the rf edges over
and
form a cycle if data dependencies are recognised as sdep edges. In this execution,
reads from the write on Thread 1. We believe that these sdep edges are correct and the execution should be forbidden.
The explanation of why this might be reasonable is rather different to the workings of the C++ concurrency model. It hinges on changing the write that the read on Thread 2 gets its value from, from the
branch on Thread 0 to the true branch. In this pattern of reasoning, we note that the read of
on Thread 0 could read
from the initalisation, leading to the
branch and a write of
to
. Then Thread 2 can read
from
and write
to
. Now we reconsider the execution of Thread 0, changing the read of
so that it reads
from Thread 1. Correspondingly, we change the control flow of Thread 0 so that it follows the true branch and writes
to
and
, allowing the execution above.
We do not have a sequence of optimisations that leads to this behaviour, we believe this execution should be forbidden, and the sdep edges should be as shown above. Does SG1 agree?
3.4. CohCYC
This example also has an execution that is contentious among concurrent models of dependency [WeakestMO]. The program has apparent control dependencies from
to
on Thread 0 and from
to
on Thread 1.
// Init: x : = 0 ; y : = 0 ; // Thread 0: x : = 2 ; r1 : = x ; if ( ¬r1 == 2 ) { y : = 1 } // Thread 1: x : = 1 ; r2 : = x ; r3 : = y ; if ( ¬r3 == 0 ) { x : = 3 }
In the execution of concern, Thread 1 reads the value
from
, and then follows a typical control-dependent load from
and store to
. The explanation of why this might be reasonable hinges on changing the write that satisfies the condition on Thread 0. Thread 0 will write
to
if it reads
from
on Thread 1. Then, if we read
from
on Thread 1,
will be written to
. Now, so the reasoning goes, reading
from
on Thread 0 will also satisfy the conditional and lead to the write of
, so we can swap to reading from the write of
to
.
We do not have a sequence of optimisations that leads to this behaviour, we believe this execution should be forbidden, and the sdep edges should be as shown above. Does SG1 agree?
3.5. Freeze Example
The most tricky example program and optimisation sequence we have is from Ori Lahav and Chung-Kil Hur.
// Init: k : = 0 ; x : = 0 ; y : = 0 ; z : = 0 ; boom : = 0 ; // Thread 0: r1 : = z ; x : = r1 // Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r3 : = x ; if ( r3 == 1 ) { z : = r3 ; boom : = 1 } else { r4 : = y . load ( na ); z : = r4 } } // Thread 2: y . store ( 1 , na , normal ); k . store ( 42 , rel , normal )
In this test, we could reason that an execution that stores
to
could be allowed.
This depends on the sdep edges that we apply in the following execution. Note, we include enough sdep edges here to forbid the behaviour. We will go on to discuss a series of optimisations that may or may not be reasonable. If the optimisation sequence is reasonable, it would lead to removing one of the sdep edges and allowing the behaviour.
The optimisation sequence hinges on the use of a new
operation in LLVMIR that allows the compiler to make assumptions about the values read by non-atomics. Use of
can produce LLVM poison values, whose values are inteterminate, and can lead to undefined behaviours. We would like the SG1 opinion on the
optimisation.
The cycle in sdep and rf that forbids this outcome is over events
,
,
and
. The question is whether there should be an sdep edge from event
to event
: from the load of
to the store of
on Thread 1.
The series of optimisations that might remove that sdep edge and permit this behaviour relies on global analysis. The optimisations are all on Thread 1, and we repeat the transformed program after each step. To start with, Thread 1 is as follows.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r3 : = x ; if ( r3 == 1 ) { z : = r3 ; boom : = 1 } else { r4 : = y . load ( na ); z : = r4 } }
Load introduction
We introduce a non-atomic load of
into
. This step relies on global knowledge: introducing a non-atomic in the general case may introduce a race, but here a global analysis could infer that the only other access of
, on Thread 2, is not racy because the
on Thread 1 guards against it, ensuring happens-before if the read is run.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r5 : = y . load ( na ); r3 : = x ; if ( r3 == 1 ) { z : = r3 ; boom : = 1 } else { r4 : = y . load ( na ); z : = r4 } }
Load forwarding
We replace the non-atomic load of
in the
branch with
from our introduced non-atomic load.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r5 : = y . load ( na ); r3 : = x ; if ( r3 == 1 ) { z : = r3 ; boom : = 1 } else { r4 : = r5 ; // load forward z : = r4 } }
LLVM trace preserving transformation
We
the non-atomic load of
at value
. This adds a branch on
. In the true case of the branch, the compiler can optimise with the knowledge that
, but in the else branch,
carries a poison value that may lead to undefined behaviour.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r5 : = y . load ( na ); if ( freeze ( r5 == 1 )) { // freeze r3 : = x ; if ( r3 == 1 ) { z : = r3 ; boom : = 1 } else { r4 : = r5 ; z : = r4 } } else { /* ... */ } }
LLVM trace preserving transformation
We simplify terms using the predicates on branches.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r5 : = y . load ( na ); if ( freeze ( c == 1 )) { r3 : = x ; if ( r3 == 1 ) { z : = 1 ; // simplification r3 == 1 boom : = 1 } else { r4 : = 1 ; // simplification using freeze(r5==1) z : = r4 } } else { /* ... */ } }
LLVM trace preserving transformation
Here we constant propagate the register value into the store on
and then hoist the
out of the conditional.
The
branch is now empty.
// Thread 1: r2 : = k . load ( acq ); if ( r2 == 42 ) { r5 : = y . load ( na ); if ( freeze ( c == 1 )) { r3 : = x ; z : = 1 ; // hoisted if ( r3 == 1 ) { boom : = 1 } else { } } else { /* ... */ } }
There is now no dependency between
and
so they can be hardware re-ordered.
In context with the rest of the program
is now observable and the program can store
to
. This execution would be allowed if the sdep edge were removed between events
and
in the execution above.
Note that this execution is race-free, the non-atomic store is happens-before the load, thanks to the
/
pair on
.
Is SG1 happy with this set of transformations? Should this weak execution be allowed?
4. sdep in Java Causality Tests
To contextualise sdep with the Java Causality Tests, we print the tests, the interesting execution, and our conclusion on acyclicity of sdep and rf. See [JCT].
4.1. JCTC1
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; if ( r1 >= 0 ) { y : = 1 } // Thread 1: r2 : = y ; x : = r2
Allowed, no cycle. A compiler can observe that
is always greater than or equal to zero.
4.2. JCTC2
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; r2 : = x ; if ( r1 == r2 ) { y : = 1 } // Thread 1: r3 : = y ; x : = r3
Allowed, no cycle. A compiler can assert that two relaxed loads to the same location will yield the same value.
4.3. JCTC3
// Init: x : = 0 ; y : = 0 ; // Thread 0: x : = 2 // Thread 1: r1 : = x ; r2 : = x ; if ( r1 == r2 ) { y : = 1 } // Thread 2: r3 : = y ; x : = r3
Allowed, no cycle. Similar to above.
4.4. JCTC4
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; y : = r1 // Thread 1: r2 : = y ; x : = r2
Forbidden, cycle in sdep ∪ rf. No values from thin air.
4.5. JCTC5
// Init: x : = 0 ; y : = 0 ; z : = 0 ; // Thread 0: z : = 1 // Thread 1: r3 : = z ; x : = r3 // Thread 2: r2 : = y ; x : = r2 // Thread 3: r1 : = x ; y : = r1
Forbidden, cycle in sdep ∪ rf. No values from thin air.
4.6. JCTC6
// Init: A : = 0 ; B : = 0 ; // Thread 0: r1 : = A ; if ( r1 == 1 ) { B : = 1 } // Thread 1: r2 : = B ; if ( r2 == 1 ) { A : = 1 }; if ( r2 == 0 ) { A : = 1 }
Allowed, no cycle. A compiler can observe that the
cases cover all possible values of
in this program, so there is always a store of
to
.
4.7. JCTC7
// Init: x : = 0 ; y : = 0 ; z : = 0 ; // Thread 0: r1 : = z ; r2 : = x ; y : = r2 // Thread 1: r3 : = y ; z : = r3 ; x : = 1
Allowed, no cycle. Hardware and compiler reordering makes this observable.
4.8. JCTC8
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; r2 : = ( 1 + ( r1 * r1 - r1 )); y : = r2 // Thread 1: r3 : = y ; x : = r3
Allowed, no cycle.
Interthread analysis could determine that x and y are always either 0 or 1, and thus determine that r2 is always 1. Once this determination is made, the write of 1 to y could be moved early in thread 1.
4.9. JCTC9
// Init: x : = 0 ; y : = 0 ; // Thread 0: x : = 2 // Thread 1: r1 : = x ; y : = ( 1 + ( r1 * r1 - r1 )) // Thread 2: r3 : = y ; x : = r3
Allowed, no cycle.
Allowed. Similar to test case 8, except that the x is not always 0 or 1. However, a compiler might determine that the read of x by thread 2 will never see the write by thread 3 (perhaps because thread 3 will be scheduled after thread 1). Thus, the compiler can determine that r1 will always be 0 or 1.
4.10. JCTC10
// Init: x : = 0 ; y : = 0 ; z : = 0 ; // Thread 0: r3 : = z ; if ( r3 == 1 ) { x : = 1 } // Thread 1: r2 : = y ; if ( r2 == 1 ) { x : = 1 } // Thread 2: z : = 1 // Thread 3: r1 : = x ; if ( r1 == 1 ) { y : = 1 }
Forbidden, cycle in sdep ∪ rf. No values from thin air.
4.11. JCTC11
// Init: w : = 0 ; x : = 0 ; y : = 0 ; z : = 0 ; // Thread 0: r1 : = z ; w : = r1 ; r2 : = x ; y : = r2 // Thread 1: r4 : = w ; r3 : = y ; z : = r3 ; x : = 1
Allowed, no cycle. Observable through compiler and hardare re-ordering.
4.12. JCTC12
// Init: x : = 0 ; y : = 0 ; a0 : = 1 ; a1 : = 2 ; // Thread 0: r1 : = x ; if ( r1 == 0 ) { a0 : = 0 } else { if ( r1 == 1 ) { a1 : = 0 } }; r2 : = a0 ; y : = r2 // Thread 1: r3 : = y ; x : = r3
Forbidden, cycle in sdep ∪ rf. No values from thin air.
4.13. JCTC13
// Init: x : = 0 ; y : = 0 ; // Thread 0: r1 : = x ; if ( r1 == 1 ) { y : = 1 } // Thread 1: r2 : = y ; if ( r2 == 1 ) { x : = 1 }
Forbidden, cycle in sdep ∪ rf. No values from thin air.
4.14. JCTC14
We skip this as it involves loops.
4.15. JCTC15
We skip this as the test probes Java’s weak coherence-order guarantees, and is not relevant to C++.
4.16. JCTC16
We skip this as the test probes Java’s weak coherence-order guarantees, and is not relevant to C++.
4.17. JCTC17
We skip this as the test probes Java’s weak coherence-order guarantees, and is not relevant to C++.
4.18. JCTC18
See § 2.6 Java Causality Test Case 18.
4.19. JCTC19
We skip this as it probes thread joining, which is out-of-scope for this paper.
4.20. JCTC20
We skip this as it probes thread joining, which is out-of-scope for this paper.