P1780R2
Modular Relaxed Dependencies: A new approach to the Out-Of-Thin-Air Problem

Draft Proposal,

This version:
http://wg21.link/P1780r2
Authors:
(University of Kent)
(University of Kent)
(University of Kent)
(Ecole Polytechnique)
(University of Kent)
(University of Kent)
Audience:
SG1
Project:
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++

Abstract

This note describes an alteration of the existing concurrency specification that allows compiler optimisations while forbidding thin-air values. It is a pleasingly small change that introduces a semantic dependency relation, tracking program dependencies that must not be removed by compiler optimisation. The rest of the existing concurrency definition remains largely unaltered. There are two proposals for the calculation of semantic dependency, together with a logic that shows this approach recovers the -- currently broken -- ability to formally verify concurrent code. We present the context, the ideas behind the change, reference the academic work supporting this approach, and the textual changes required to the standard.

1. Changelog

1.1. Revision 0 (Cologne June 2019)

1.2. Revision 1 (Belfast October 2019)

1.3. Revision 2 (Kona November 2022)

2. Recap: The out-of-thin-air problem

From its introduction in 2011, the C++ concurrency model has faced a great deal of scrutiny, leading to various refinements of the standard. [P0668R5] Revising the C++ Memory Model, describes changes needed to incorporate the fixes of [RC11], the latest revision of the existing C++ concurrency design, solving most known problems.

This process of scrutiny and refinement has produced a mature model, but it has also uncovered major problems. The out-of-thin-air problem, discussed most recently in [P1217R1] allows unintuitive and unobservable behaviours and makes reasoning formally about the correctness of concurrent code impossible. The thin air problem does not yet have a convenient solution: existing solutions -- like the Promising Semantics [Promising] -- discard the mature concurrency model of the standard and start afresh with a different paradigm. Implementing such a change would require a wide-ranging rewrite of the standard.

The out-of-thin-air problem arises from an under-specification of which program dependencies ought to order memory accesses. The trouble is neatly explained by example.

LB
// Thread 1
r1 = x.load(memory_order::relaxed);
y.store(1, memory_order::relaxed);

// Thread 2
r2 = y.load(memory_order::relaxed);
x.store(1, memory_order::relaxed);

In the example above, called Load Buffering (LB), C++ allows the execution where r1 and r2 have value 1 at the end of execution. This outcome must be allowed to permit efficient compilation of relaxed accesses to plain loads and stores on the Power and ARM architectures, where the corresponding behaviour is allowed.

LB+datas (P1217R1 OOTA Example 1, JCTC4)
// Thread 1
r1 = x.load(memory_order::relaxed);
y.store(r1, memory_order::relaxed);

// Thread 2
r2 = y.load(memory_order::relaxed);
x.store(r1, memory_order::relaxed);

In the example above, taken from [P1217R1], there are now data dependencies from load to store and no explicit write of any value. Surprisingly, the outcome 42 is allowed by the C++ standard -- we say that the value 42 is conjured from thin-air.

LB+ctrls (P1217R1 OOTA Example 2, JCTC13)
// Thread 1
r1 = x.load(memory_order::relaxed);
if(r1 == 42) {
  y.store(42, memory_order::relaxed);
}

// Thread 2
r2 = y.load(memory_order::relaxed);
if(r2 == 42) {
  x.store(42, memory_order::relaxed);
}

A similar behaviour can be exhibited using control dependencies, as above. The thin-air outcomes of LB+datas and LB+ctrls are the result of a design choice that aims to permit compiler optimisation: some optimisations remove syntactic dependencies, so if C++ bestows ordering to no dependencies, the compiler is free to remove what it likes.

LB+false+ctrl (JCTC6)
// Thread 1
r1 = x.load(memory_order::relaxed);
if(r1 == 42) {
  y.store(42, memory_order::relaxed);
} else {
  y.store(42, memory_order::relaxed);
}

// Thread 2
r2 = y.load(memory_order::relaxed);
if(r2 == 42) {
  x.store(42, memory_order::relaxed);
}

In Thread 1 above, a compiler may spot that, whatever value is read, the write will be made regardless, optimising to the program below, and allowing both threads to read 42:

// Thread 1
r1 = x.load(memory_order::relaxed);
y.store(42, memory_order::relaxed);

As it stands, the standard allows both threads to read 42 and hence it permits the optimisation, but it also allows the thin-air outcomes in LB+datas and LB+ctrls. [P1217R1] explains that these outcomes cannot occur in practice and must be forbidden to enable reasoning and formal verification.

3. Our Solution

[P0422R0] highlights a series of litmus tests, borrowed from Java, that constrain the ideal behaviour of an improved semantics (e.g. JCTC4, JCTC13 and JCTC6 above). The paper makes the observation that the thin-air execution of each test features a cycle in (rfsdep) where sdep, semantic dependency, is an as yet undefined relation that captures only the real dependencies that the compiler will leave in place. If only we forbid cycles in (rfsdep), then the model would have the behaviour we desire.

[P1217R1] highlights a simple way to guarantee a lack of thin-air behaviour: forbid the reordering of relaxed loads with following stores. This is equivalent to forbidding cycles in (rfsb), and is sufficient to avoid cycles in (rfsdep). This approach is controversial because it may have a substantial overhead, especially on GPUs.

We have worked as academics on building a definition for sdep which can be calculated over an input program. We have found two such approaches, one presented in an earlier version of this paper, and in academic literature as [MRD], and another in the academic literature which uses a different approach to build sdep [PWT]. These models are complex and by no means industrially verified, but it leads us to agree with [P0422R0] and propose a standard wording change which incorporates an idea of sdep. In further work, we adapted established concurrent reasoning techiques (Hoare triples with Owicki-Gries global invariants) to work above the C++ model with sdep, establishing that this approach once again enables one to prove C++ code correct [owicki-gries].

The standard describes the three stages of working out the concurrent behaviour of a program:

  1. Generate a list of pre-executions from the source of a program, each corresponding to an arbitrary choice of read values.

  2. Pair each pre-execution with reads-from and modification order relations that capture the dynamic memory behaviour of the program, and filter, keeping those that are consistent with the rules of the memory model.

  3. Finally, check for races: race free programs have the consistent executions calculated in step 2, racy programs have undefined behaviour.

We augment this scheme by parameterising sdep in step 1, and by providing a technical note on constraints which a good C++ implementation should put on sdep—for example, forbidding the out-of-thin-air execution by always including sdep edges in JCTC4.

This approach is distinct from [Promising] and other suggested academic models, because nearly all of the existing wording on memory consitency can be preserved. It also is certainly compatible with existing implementations, as constraints to place on the "shape" of sdep would be provided in a technical note, which can be written with taste for current implementations of C++.

4. Conclusion

We have found several possible solutions in academic literature to the thin-air problem, but none look like they can be plucked and placed into the standard without considerable re-writes of the existing memory model text. We consider this to be undesirable, and propose a fix as was originally suggested in [P0422R0], but with the added knowledge that sdep can be calculated and used for reasoning. This aproach has several benefits:

5. Proposed Standard Change

[...]

Multi-threaded executions and data races

[...]

Thin-air restriction

Some side effects are semantically dependent on evaluation operations. An evaluation A is causally before an evaluation B if:

  • A is a side effect on an atomic object M, and B is an evaluation of M that takes its value, or

  • A is semantically dependent on B, or

  • for some operation X, A is causally before X and X is causally before B.

The implementation shall ensure that no program execution demonstrates a cycle in the causally before relation.

[Note: Semantic dependency is implementation defined, TRxxxx gives a specification on precisely what semantic dependencies should be preserved by an implementation. —end note]

[...]

Order and consistency
[...]

Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.

[Note: For example, with x and y initially zero,

// Thread 1:
r1 = y.load(memory_order::relaxed);
x.store(r1, memory_order::relaxed);
// Thread 2:
r2 = x.load(memory_order::relaxed);
y.store(r2, memory_order::relaxed);

should not produce r1 == r2 == 42, since the store of 42 to y is only possible if the store to x stores 42, which circularly depends on the store to y storing 42. Note that without this restriction, such an execution is possible.

[Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:

// Thread 1:
r1 = x.load(memory_order::relaxed);
if (r1 == 42) y.store(42, memory_order::relaxed);
// Thread 2:
r2 = y.load(memory_order::relaxed);
if (r2 == 42) x.store(42, memory_order::relaxed);

end note]

[...]

References

Informative References

[MRD]
Marco Paviotti; et al. Modular Relaxed Dependencies in Weak Memory Concurrency. URL: https://link.springer.com/chapter/10.1007%2F978-3-030-44914-8_22
[OWICKI-GRIES]
Daniel Wright; Mark Batty; Brijesh Dongol. Owicki-Gries Reasoning for C11 Programs with Relaxed Dependencies. URL: https://link.springer.com/chapter/10.1007/978-3-030-90870-6_13
[P0422R0]
Paul E. McKenney, Alan Jeffrey, Ali Sezgin, Tony Tye. Out-of-Thin-Air Execution is Vacuous. 27 July 2016. URL: https://wg21.link/p0422r0
[P0668R5]
Hans-J. Boehm, Olivier Giroux, Viktor Vafeiades. Revising the C++ memory model. 9 November 2018. URL: https://wg21.link/p0668r5
[P1217R1]
Hans-J. Boehm. Out-of-thin-air, revisited, again. 10 March 2019. URL: https://wg21.link/p1217r1
[Promising]
Jeehoon Kang; et al. A Promising Semantics for Relaxed-Memory Concurrency. URL: https://people.mpi-sws.org/~viktor/papers/popl2017-promising.pdf
[PWT]
Alan Jeffrey; et al. The leaky semicolon: compositional semantic dependencies for relaxed-memory concurrency. URL: https://dl.acm.org/doi/10.1145/3498716
[RC11]
Ori Lahav; et al. Repairing sequential consistency in C/C++11. URL: https://dl.acm.org/doi/10.1145/3062341.3062352