Generated code for std::atomic load/store operations

Generated code for std::atomic load/store operations

Hi,

I wrote a small test program to analyze the generated code for load/store operations with different memory order of the new std::atomic type.

#include <atomic>
std::atomic v(42);
__declspec(noinline) size_t load_relaxed() { return v.load(std::memory_order_relaxed); }
__declspec(noinline) size_t load_acquire() { return v.load(std::memory_order_acquire); }
__declspec(noinline) size_t load_consume() { return v.load(std::memory_order_consume); }
__declspec(noinline) size_t load_seq_cst() { return v.load(std::memory_order_seq_cst); }
__declspec(noinline) void store_relaxed(size_t arg) { v.store(arg, std::memory_order_relaxed); }
__declspec(noinline) void store_release(size_t arg) { v.store(arg, std::memory_order_release); }
__declspec(noinline) void store_seq_cst(size_t arg) { v.store(arg, std::memory_order_seq_cst); }
int main(int argc, char* argv[])
{
    size_t x = 0;
    x += load_relaxed();
    x += load_acquire();
    x += load_consume();
    x += load_seq_cst();
    
    store_relaxed(x);
    store_release(x);
    store_seq_cst(x);
    
    return (int)x;
}

The result with the Intel Composer XE 2013 looks as follows:

with Intel atomic header (__USE_INTEL_ATOMICs)

v.load(std::memory_order_relaxed);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax]

v.load(std::memory_order_acquire);
lea rax,[v (013FE33020h)]
mov rax,qword ptr [rax]
lfence

v.load(std::memory_order_seq_cst);
lea rax,[v (013FE33020h)]
mfence
mov rax,qword ptr [rax]
mfence

v.store(arg, std::memory_order_relaxed);
lea rdx,[v (013FE33020h)]
mov qword ptr [rdx],rax

v.store(arg, std::memory_order_release);
lea rdx,[v (013FE33020h)]
sfence
mov qword ptr [rdx],rax

v.store(arg, std::memory_order_seq_cst);
lea rdx,[v (013FE33020h)]
xchg rax,qword ptr [rdx]

with Microsoft atomic header

v.load(std::memory_order_relaxed);
v.load(std::memory_order_acquire);
v.load(std::memory_order_seq_cst);
lea rdi,[v (013FA93020h)]
mov rax,qword ptr [rdi]
retry:
mov rdx,rax
or rdx,rcx
lock cmpxchg qword ptr [rdi],rdx
jne retry (013FA91081h)

v.store(arg, std::memory_order_relaxed);
v.store(arg, std::memory_order_release);
mov qword ptr [v (013FA93020h)],rcx

v.store(arg, std::memory_order_seq_cst);
lea rcx,[v (013FA93020h)]
xchg rax,qword ptr [rcx]

The generated code for the atomic loads with the Microsoft header is something I have to report to Microsoft (this implementation is a catastrophe from a perfomance point of view).
But what I don't understand why the generated code with the Intel header contains all kinds of lfence/sfence.
Especially: why does v.store(arg, std::memory_order_release) require a sfence before the write operation? Write opertions are guaranteed to be executed in program order anyway, right?

Thanks,
Manuel

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

Here's a partial response to your post. I don't have an answer about the use of the fence instructions, hope to have more information about that soon.

You should not define the __USE_INTEL_ATOMICS symbol. The correct value of this symbol is determined according to the version of Visual Studio you are using. For vs2012, this symbol is defined to be 0 which forces use of Microsoft's atomic header

Microsoft introduced the atomic header in Visual Studio 2012. Intel has been shipping its own support for atomic operations and an Intel-supplied header file. If you are using atomics with Visual Studio 2012, then you need to use the Microsoft definition of atomic.

In my experiments, the Microsoft-generated code uses a different locking mechanism than what you quoted above. (I'm using Microsoft (R) C/C++ Optimizing Compiler Version 17.00.40825.2 for x64. cl -c /FAs) Here's what I see,


; 25   :      size_t x = 0;
        mov     QWORD PTR x$[rsp], 0
; 26   :

; 27   :      x += load_relaxed();
        call    ?load_relaxed@@YA_KXZ                   ; load_relaxed

        mov     rcx, QWORD PTR x$[rsp]

        add     rcx, rax

        mov     rax, rcx

        mov     QWORD PTR x$[rsp], rax
; 28   :

...Later on there's a call to do the atomic store. As I understand it, Microsoft support uses a lock object. If there's an atomic object which is accessed by both Intel- and Microsoft- generated code, then the locking mechanism needs to be the same to ensure correct access. If you are using vs2012, you're going to see calls to e.g. load_relaxed, and calls to do atomic store, even in the Intel-compiled code.

Hi,

I'm using Visual Studio 2012 (v11.0.50727.1) with Microsoft (R) C/C++ Optimizing Compiler Version 17.00.50727.1 for x64. I defined __USE_INTEL_ATOMICS to enforce the compiler to use the Intel atomic header, because I wanted to compare the generated code for both implementations.
Apparently my description of what I was doing and what I wanted to analyze wasn't clear enough.

I compiled the sample program from my first posting with full optimization, therefore the compiler inlined the .store/.load calls. I declared my own load_* and store_* functions with noinline to prohibit inlining of those. This makes it easier for me to analyze the generated code since there are no reorderings/interleavings or other optimizations between to calls to load_*/store_*.

The generated assembler code I posted was from the actual load/store calls inside my small helper functions. In the main function there are of course the calls to load_*/store_*, but inside these functions everything is inlined - resulting in the code posted above.

Thanks for the further information

The main point I wanted to make is that users shouldn't define the __USE_INTEL_ATOMICS symbol

We're still investigating your question about fence instructions, DPD200238776 is tracking the issue. Thanks for bringing it up.

Leave a Comment

Please sign in to add a comment. Not a member? Join today