Generated code for std::atomic load/store operations

Generated code for std::atomic load/store operations

Hi,

I wrote a small test program to analyze the generated code for load/store operations with different memory order of the new std::atomic type.

#include <atomic>
std::atomic v(42);
__declspec(noinline) size_t load_relaxed() { return v.load(std::memory_order_relaxed); }
__declspec(noinline) size_t load_acquire() { return v.load(std::memory_order_acquire); }
__declspec(noinline) size_t load_consume() { return v.load(std::memory_order_consume); }
__declspec(noinline) size_t load_seq_cst() { return v.load(std::memory_order_seq_cst); }
__declspec(noinline) void store_relaxed(size_t arg) { v.store(arg, std::memory_order_relaxed); }
__declspec(noinline) void store_release(size_t arg) { v.store(arg, std::memory_order_release); }
__declspec(noinline) void store_seq_cst(size_t arg) { v.store(arg, std::memory_order_seq_cst); }
int main(int argc, char* argv[])
{
    size_t x = 0;
    x += load_relaxed();
    x += load_acquire();
    x += load_consume();
    x += load_seq_cst();
    
    store_relaxed(x);
    store_release(x);
    store_seq_cst(x);
    
    return (int)x;
}

The result with the Intel Composer XE 2013 looks as follows:

with Intel atomic header (__USE_INTEL_ATOMICs)

v.load(std::memory_order_relaxed);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax]

v.load(std::memory_order_acquire);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax]
lfence

v.load(std::memory_order_seq_cst);
lea rax,[v (013FE33020h)]
mfence
mov rax,qword ptr [rax]
mfence

v.store(arg, std::memory_order_relaxed);
lea rdx,[v (013FE33020h)]
mov qword ptr [rdx],rax

v.store(arg, std::memory_order_release);
lea rdx,[v (013FE33020h)]
sfence
mov qword ptr [rdx],rax

v.store(arg, std::memory_order_seq_cst);
lea rdx,[v (013FE33020h)]
xchg rax,qword ptr [rdx]

with Microsoft atomic header

v.load(std::memory_order_relaxed);
v.load(std::memory_order_acquire);
v.load(std::memory_order_seq_cst);
lea rdi,[v (013FA93020h)]
mov rax,qword ptr [rdi]
retry:
mov rdx,rax
or rdx,rcx
lock cmpxchg qword ptr [rdi],rdx
jne retry (013FA91081h)

v.store(arg, std::memory_order_relaxed);
v.store(arg, std::memory_order_release);
mov qword ptr [v (013FA93020h)],rcx

v.store(arg, std::memory_order_seq_cst);
lea rcx,[v (013FA93020h)]
xchg rax,qword ptr [rcx]

The generated code for the atomic loads with the Microsoft header is something I have to report to Microsoft (this implementation is a catastrophe from a perfomance point of view).
But what I don't understand why the generated code with the Intel header contains all kinds of lfence/sfence.
Especially: why does v.store(arg, std::memory_order_release) require a sfence before the write operation? Write opertions are guaranteed to be executed in program order anyway, right?

Thanks,
Manuel

publicaciones de 5 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hi,

Here's a partial response to your post. I don't have an answer about the use of the fence instructions, hope to have more information about that soon.

You should not define the __USE_INTEL_ATOMICS symbol. The correct value of this symbol is determined according to the version of Visual Studio you are using. For vs2012, this symbol is defined to be 0 which forces use of Microsoft's atomic header

Microsoft introduced the atomic header in Visual Studio 2012. Intel has been shipping its own support for atomic operations and an Intel-supplied header file. If you are using atomics with Visual Studio 2012, then you need to use the Microsoft definition of atomic.

In my experiments, the Microsoft-generated code uses a different locking mechanism than what you quoted above. (I'm using Microsoft (R) C/C++ Optimizing Compiler Version 17.00.40825.2 for x64. cl -c /FAs) Here's what I see,


; 25   :      size_t x = 0;
        mov     QWORD PTR x$[rsp], 0
; 26   :

; 27   :      x += load_relaxed();
        call    ?load_relaxed@@YA_KXZ                   ; load_relaxed

        mov     rcx, QWORD PTR x$[rsp]

        add     rcx, rax

        mov     rax, rcx

        mov     QWORD PTR x$[rsp], rax
; 28   :

...Later on there's a call to do the atomic store. As I understand it, Microsoft support uses a lock object. If there's an atomic object which is accessed by both Intel- and Microsoft- generated code, then the locking mechanism needs to be the same to ensure correct access. If you are using vs2012, you're going to see calls to e.g. load_relaxed, and calls to do atomic store, even in the Intel-compiled code.

Hi,

I'm using Visual Studio 2012 (v11.0.50727.1) with Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x64. I defined __USE_INTEL_ATOMICS to enforce the compiler to use the Intel atomic header, because I wanted to compare the generated code for both implementations.
Apparently my description of what I was doing and what I wanted to analyze wasn't clear enough.

I compiled the sample program from my first posting with full optimization, therefore the compiler inlined the .store/.load calls. I declared my own load_* and store_* functions with noinline to prohibit inlining of those. This makes it easier for me to analyze the generated code since there are no reorderings/interleavings or other optimizations between to calls to load_*/store_*.

The generated assembler code I posted was from the actual load/store calls inside my small helper functions. In the main function there are of course the calls to load_*/store_*, but inside these functions everything is inlined - resulting in the code posted above.

Thanks for the further information

The main point I wanted to make is that users shouldn't define the __USE_INTEL_ATOMICS symbol

We're still investigating your question about fence instructions, DPD200238776 is tracking the issue. Thanks for bringing it up.

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya