P2426R0
Coherence order and program dependencies

Draft Proposal,

This version:
https://graymalk.in/p2426r0-coherence-and-dependencies/
Authors:
Simon Cooksey (University of Kent)
Daniel Wright (University of Kent)
Mark Batty (University of Kent)
James Riely (University of Chicago)
Alan Jeffrey (Roblox)
Anton Podkopaev (JetBrains Research)
Ilya Kaysin (JetBrains Research)
Project:
ISO/IEC JTC1/SC22/WG21 14882: Programming Language — C++

Abstract

The interaction between coherence order and dependencies that the compiler should preserve is not obvious. This paper attempts to outline some of the interesting programs we have constructed while exploring this problem.

1. Recap

There has been much discussion in SG1 about how to solve the so-called out-of-thin-air problem, where the memory model of C++ permits nonsense executions only with some non-normative weasel wording as warding against these behaviours.

This is necessary as the C++ memory model must be weak enough to permit existing important compiler optimisations, like code hoisting.

void thread1(void) {
  int r1 = atomic_load_explicit(&x, memory_order_relaxed);
  if(r1 == 0) {
    atomic_store_explicit(&y, 1, memory_order_relaxed);
  } else {
    atomic_store_explicit(&y, 1, memory_order_relaxed);
  }
}
void thread2(void) {
  int r2 = atomic_load_explicit(&y, memory_order_relaxed);
  atomic_store_explicit(&x, r2, memory_order_relaxed);
}

In this program thread 1 can be optimised to hoist the store of y outside a conditional, yielding:

void thread1(void) {
  int r1 = atomic_load_explicit(&x, memory_order_relaxed);
  atomic_store_explicit(&y, 1, memory_order_relaxed);
}

This optimisation can be witnessed in GCC and LLVM with -O2.

A final execution where x == 1 && y == 1 is now obviously observable, by executing thread1 first, and then executing thread2.

If we modify thread1() to instead to a dependent store:

void thread1(void) {
  int r1 = atomic_load_explicit(&x, memory_order_relaxed);
  if(r1 == 0) {
    /* Now we store 0 in the true branch */
    atomic_store_explicit(&y, 0, memory_order_relaxed);
  } else {
    atomic_store_explicit(&y, 1, memory_order_relaxed);
  }
}

The hardware and compiler may not re-order the load and store, as there is a "real" control dependency. A final execution where x == 1 && y == 1 would be somehow 'self satisfying', and should be forbidden.

In each of these programs, we can draw a cycle in dependency (dep) and reads from (rf). It would be convinient to forbid such cycles, but there was a control dependency in the first program too -- that dependency could clearly be ignored by a compiler. The challenge is, which dep edges should be preserved and forbidden from forming cycles? Semantic dependencies (sdep) are exactly the dependencies which should have some semantic force, rather than merely being present in the syntax. Unfortunately, it is hard to presicely define a rigorous definition of sdep, although recent academic work has had a go. [MRD], which has been presented here in [P1780r0], gives a complex calculation for semantic dependency; and [PWP] provides a logic which evaluates to a sdep relation.

2. The Problem

In the process of studying [MRD] and [PWP], we have noticed that there is a space of choices to be made about which dependencies should be promoted to sdep, and even ones which are not obviously present in the syntactic structure of the program. A key tuning point on both these semantics is exactly the strength implied by modification order (mo) — although the exact mechanisms for this are quite complex and out of scope for discussion in this paper. Instead, we will focus on a discussion of which sdep edges would be desirable for C++, using this paper to solicit feedback from the SG1 audience.

We have found some programs which create dependencies around mo edges, and it is not clear what the correct semantics for these programs should be.

3. Dead code elimination

In our first example, we will look at how dead code elimination can create questions about preservation of mo. Consider these two definitions of a function which stores to x dependent on some boolean:
void f1(bool b) {
  if (b) {
    x.store(2, memory_order_relaxed);
  }
  x.store(1, memory_order_relaxed);
  x.store(2, memory_order_relaxed);
  if (!b) {
    x.store(1, memory_order_relaxed);
  }
  x.store(3, memory_order_relaxed);
}
void f2(bool b) {
  if (!b) {
    x.store(1, memory_order_relaxed);
  }
  x.store(2, memory_order_relaxed);
  x.store(1, memory_order_relaxed);
  if (b) {
    x.store(2, memory_order_relaxed);
  }
  x.store(3, memory_order_relaxed);
}

TODO: The more I think about it, the more this example is messing me up. I don’t understand why it can be optimised to x = 1; x = 2; x = 3.

In each, if b is true, then the program should store values to x of 2, 1, 2, 3 in that order -- but dead store removal can eliminate the 1st unconditional store of each program, meaning f1 has the stores x = 1; x = 2; x = 3;, and f2 has the stores x = 2; x = 1; x = 3. For a given true/false value of b these two functions should evaluate equivalently, but in the presence of compiler optimisations we can observe a different pattern of stores to x.

gcc and clang are extremely conservative optimising around relaxed atomics. See https://godbolt.org/z/E46axT9vo
;; Clang output, f2 is identical. Clang uses some clever arithmetic around the bool
;; to calculate the values for stores and elide control flow -- but all 4 stores 
;; are emitted. No dead store elimination.
f1(bool):                                 # @f1(bool)
        mov     eax, 2
        sub     eax, edi
        add     edi, 1
        mov     dword ptr [rip + x], edi
        mov     dword ptr [rip + x], eax
        mov     dword ptr [rip + x], edi
        mov     dword ptr [rip + x], 3
        ret

In the semantics of C++ should dead code elimination be forbidden here, and how should this occur?

References

Informative References

[MRD]
Marco Paviotti; et al. Modular Relaxed Dependencies in Weak Memory Concurrency. URL: https://link.springer.com/chapter/10.1007%2F978-3-030-44914-8_22
[P1780r0]
Mark Batty, Simon Cooksey, Scott Owens, Anouk Paradis, Marco Paviotti, Daniel Wright. Modular Relaxed Dependencies: A new approach to the Out-Of-Thin-Air Problem. 14 June 2019. URL: https://wg21.link/p1780r0
[PWP]
Radha Jagadeesen; Alan Jeffrey; James Riely. Pomsets with preconditions: a simple model of relaxed memory. URL: https://dl.acm.org/doi/10.1145/3428262