Memory reordering: Can loads be reordered with earlier stores to different but encompassing location?

Memory reordering: Can loads be reordered with earlier stores to different but encompassing location?

Ritratto di syljak

In intel's processor manual:linkin section 8.2.3.4 it is stated that loads may be reordered with earlier stores to different locations, but not with earlier stores to the same location.

So I understand that the following two operations can be reordered:

x = 1;
y = z;

And that the following two operations can not be reordered:

x = 1;
y = x;

But what happens when the store and the load are for different locations, but the load encompasses the store completely, e.g:

typedef union {
  uint64_t shared_var;
  uint32_t individual_var[2];
} my_union_t;

my_union_t var;
var.shared_var = 0;

var.individual_var[1] = 1;
int y = var.shared_var;

So can 'y' in this case be 0?

15 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di Sergey Kostrov
Quoting syljak ...But what happens when the store and the load are for different locations, but the load encompasses the store completely, e.g:
typedef union {
  uint64_t shared_var;
  uint32_t individual_var[2];
} my_union_t;
my_union_t var;



var.shared_var = 0; var.individual_var[1] = 1;

int y = var.shared_var;

So can 'y' in this case be 0?

No.I use C unions extensively and they are at the core of some C union basedfunctionality. It would be a complete
technical disaster ifa CPUwould re-order these twoassignments.

Best regards,
Sergey

Ritratto di Sergey Kostrov
Quoting syljak In intel's processor manual:linkin section 8.2.3.4 it is stated that loads may be reordered with earlier
stores to
different locations...

Two union members shown in your example are not at different memoryaddresslocations.

Please take a look in a Debugger(Memory window)what happens when 1st assignment is done and then howthe 2nd assignment
changes the value at the same memory address forsome variable of type 'my_union_t'.

Best regards,
Sergey

Ritratto di jimdempseyatthecove

Syljak,

Keep in mind that the inner levels of the CPU deal in cache line sized data with respect to loads and stores.

Then modifying your original post, assume that your var spanned two cache lines (var.individual_var[0] in one cach line, [1] in the next cache line). Assume further that the var components are not in cache.

var.shared_var = 0;
post read into cache RAM containing [0]
post read into cache RAM containing [1]
post modify [0] portion of cache line containing [0]
post modify [1] portion of cache line containing [1]
(writes may reorder)

var.individual_var[1] = 1;
cach hit on [1], read of cache line stalls while write pending for same cache line
when prior write complete, post write [1] portion of cache line containing [1]
Note, write ordering assures [0] portion written first on = 0 statement

int y = var.shared_var;
cache hit on [0] portion
cache hit on [1] portion, but write may be pending, so pipeline may stall for write to cach to complete

IOW you will never see 0 in y (unless CPU microcode is defective)

The above is descriptive as opposed to technical. The CPU designers are very good at their job. In the above circumstance, the refetch may bypass the write completion (at least for same core).

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Sergey Kostrov
Quoting jimdempseyatthecove ...Then modifying your original post, assume that your var spanned two cache lines ( var.individual_var[0] in one cach line, [1] in the next cache line )...

Thank you, Jim! It is a verygoodexample.

Best regards,
Sergey

Ritratto di Sergey Kostrov
Hi everybody,

I'd like to reiterate interest in that really great post and here are results of my investifation.

Quoting jimdempseyatthecove ...Then modifying your original post, assume that your var spanned two cache lines (var.individual_var[0] in one cach line, [1] in the next cache line)...

Jim, Is it really possible? Please take a look at my comments.

This is a C union andthis is not a C struct. Because of this a memory for'shared' 64-bit component of the unionis
shared between two 32-bit 'individual' components ( array of two )of the same union.

Then, C++ compilers for Windows platforms have 8-byte default alignment.If a _declspec( align(#) ) specificator
is not used, as inour case,a C++ compiler should align data on natural boundaries and in that case this is 8-byte alignment.

sizeof( my_union_t ) = 8

I'd like to re-phrase your question:

Is it possible that in case ofa 64-bit variable Vthelow-part VL( 32-bit )will be in a cache line A and
the high-part VH( 32-bit ) will be in a cache line B?

Best regards,
Sergey

Ritratto di Sergey Kostrov

Here is a test-case:

...

typedef unsigned __int64	RTuint64;

typedef __int32			    RTint32;

...

typedef union tagMyUnion_t

{

	RTuint64 shared_var;

	RTint32 individual_var[2];

} MyUnion_t;
MyUnion_t uVar;
RTuint64 uiVar64;

RTint32 iVar32;
printf( ">> Size of Data type <
printf( "tSizeof( MyUnion_t ) = %ld bytesnn", ( RTint )sizeof( MyUnion_t ) );
printf( ">> Alignment Requirements of Data types <
printf( "t__alignof( MyUnion_t ) = %d bytesn", __alignof( uVar ) );

printf( "t__alignof( RTuint64  ) = %d bytesn", __alignof( uiVar64 ) );

printf( "t__alignof( RTint32   ) = %d bytesnn", __alignof( iVar32 ) );
printf( ">> Case 1 <
uVar.shared_var = 0;

printf( "tuVar.shared_var = 0nn" );

printf( "tMyUnion_t.shared_var        = %I64dn", ( RTuint64 )uVar.shared_var );

printf( "tMyUnion_t.individual_var[0] = %ldn", ( RTint32 )uVar.individual_var[0] );

printf( "tMyUnion_t.individual_var[1] = %ldn", ( RTint32 )uVar.individual_var[1] );
printf( ">> Case 2 <
uVar.individual_var[0] = 1;

printf( "tuVar.individual_var[0] = 1nn" );

printf( "tMyUnion_t.shared_var        = %I64dn", ( RTuint64 )uVar.shared_var );

printf( "tMyUnion_t.individual_var[0] = %ldn", ( RTint32 )uVar.individual_var[0] );

printf( "tMyUnion_t.individual_var[1] = %ldn", ( RTint32 )uVar.individual_var[1] );
printf( ">> Case 3 <
uVar.individual_var[1] = 1;

printf( "tuVar.individual_var[1] = 1nn" );

printf( "tMyUnion_t.shared_var        = %I64dn", ( RTuint64 )uVar.shared_var );

printf( "tMyUnion_t.individual_var[0] = %ldn", ( RTint32 )uVar.individual_var[0] );

printf( "tMyUnion_t.individual_var[1] = %ldn", ( RTint32 )uVar.individual_var[1] );
printf( ">> Case 4 <
uVar.individual_var[0] = 55;

uVar.individual_var[1] = 77;

printf( "tuVar.individual_var[0] = 55n" );

printf( "tuVar.individual_var[1] = 77nn" );

printf( "tMyUnion_t.shared_var        = %I64dn", ( RTuint64 )uVar.shared_var );

printf( "tMyUnion_t.individual_var[0] = %ldn", ( RTint32 )uVar.individual_var[0] );

printf( "tMyUnion_t.individual_var[1] = %ldn", ( RTint32 )uVar.individual_var[1] );
printf( ">> Case 5 <
RTint64 y = uVar.shared_var;

printf( "tRTint64 y = uVar.shared_varnn" );

printf( "tVariable 'y'                = %I64dn", ( RTuint64 )uVar.shared_var );

...

Ritratto di Sergey Kostrov

Here is output of the test-case:

...

>> Size of Data type <<

        Sizeof( MyUnion_t ) = 8 bytes
>> Alignment Requirements of Data types <<

        __alignof( MyUnion_t ) = 8 bytes

        __alignof( RTuint64  ) = 8 bytes

        __alignof( RTint32   ) = 4 bytes
>> Case 1 <<

        uVar.shared_var = 0
        MyUnion_t.shared_var        = 0

        MyUnion_t.individual_var[0] = 0

        MyUnion_t.individual_var[1] = 0
>> Case 2 <<

        uVar.individual_var[0] = 1
        MyUnion_t.shared_var        = 1

        MyUnion_t.individual_var[0] = 1

        MyUnion_t.individual_var[1] = 0
>> Case 3 <<

        uVar.individual_var[1] = 1
        MyUnion_t.shared_var        = 4294967297

        MyUnion_t.individual_var[0] = 1

        MyUnion_t.individual_var[1] = 1
>> Case 4 <<

        uVar.individual_var[0] = 55

        uVar.individual_var[1] = 77
        MyUnion_t.shared_var        = 330712481847

        MyUnion_t.individual_var[0] = 55

        MyUnion_t.individual_var[1] = 77
>> Case 5 <<

        RTint64 y = uVar.shared_var
        Variable 'y'                = 330712481847

...

Ritratto di Sergey Kostrov
Quoting syljak ...
typedef union {
uint64_t shared_var;
uint32_t individual_var[2];
} my_union_t;

my_union_t var;

var.shared_var = 0;
var.individual_var[1] = 1;

int y = var.shared_var;

So can 'y' in this case be 0?

>> Case 1 <<

If assignments are done in the above order ('shared' first, 'individual[1]' second, and then 'y')output is as follows:

var.shared_var = 4294967296
var.individual_var[0] = 0
var.individual_var[1] = 1

y = 4294967296

>> Case 2 <<

If assignments are re-ordered ('individual[1]'first,'shared' second, and then 'y')outpit isas follows:

var.shared_var =0
var.individual_var[0] = 0
var.individual_var[1] =0

y = 0

Ritratto di jimdempseyatthecove

>>Then, C++ compilers for Windows platforms have 8-byte default alignment.If a _declspec( align(#) ) specificator is not used, as inour case,a C++ compiler should align data on natural boundaries and in that case this is 8-byte alignment.

The above is a false assumption. There is nothing in the C++ (or C) specification that assures such an alignment without compiler supported alignment directive (that also enforceable). In the event that a given version of a compiler (and runtime system) provides natural alignment, there is no assurance that a next version, or some other vendor's compilerwill assure such alignment.

On Windows 7 x64, VS 2010, 32-bit application:

// my_union.cpp : Defines the entry point for the console application.

//
#include "stdafx.h"

#include 
#define uint64_t __int64

#define uint32_t __int32
typedef union {

  uint64_t shared_var;

  uint32_t individual_var[2];

} my_union_t;
int _tmain(int argc, _TCHAR* argv[])

{

	my_union_t	var1;

	char	c;

	my_union_t	var2;
	std::cout << &var1 << " " << &var2 << std::endl;

	// displays 0037FBF0 0037FBD4

	my_union_t* avar1 = new my_union_t;

	my_union_t* avar2 = new my_union_t;

	std::cout << &avar1->shared_var << " " << &avar2->shared_var << std::endl;

	return 0;

}

Note the comment after the cout, the var2 address is not on a multiple of 8. This is what was displayed on the above mentioned system.

When compiling as 64-bit application we see:

000000000028FE28 000000000028FE48


Which is aligned on natural boundary.

---------------

The natural alignment of a union is not the focus of the issue of this thread. Instead, the focus of this thread is "Can loads be reorder... with respect to writes". The "assumes" of my post were to construct a scenario whereby there is a plausibility or conceivability that your question could be tested. Create a test scenario that tests the memory order issue of your concern.

Note, the assumptions in my post specified a cache line split between individual_var[0] and [1], but not inside either individual_var[0] of[1], IOW the union was at least 32-bit aligned.

The argument you wish (need) to resolve is not specific to your my_union_t but rather in general to reads and writes.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Sergey Kostrov
Quoting jimdempseyatthecove >>Then, C++ compilers for Windows platforms have 8-byte default alignment.If a _declspec( align(#) ) specificator is not used, as inour case,a C++ compiler should align data on natural boundaries and in that case this is 8-byte alignment.

The above is a false assumption.

[SergeyK] Could we trust to MSDN? It looks like No and a software developer should always verify statements.

...Note, the assumptions in my post specified a cache line split between individual_var[0] and [1], but not inside either individual_var[0] of[1], IOW the union was at least 32-bit aligned...

I didn't mean inside of 'individual_var' members. There is another member 'shared_var'of the union and this is the 64-bit data type.

I wanted to stress thatthese three members are sharing the same memory blockof 8 bytes in the union. Also, there is a mapping
between these members of the union, like:

32-bitindividual_var[0]is a lower part of 64-bitshared_var

and

32-bitindividual_var[1] is ahigher part of 64-bitshared_var

It means that theunion 'my_union_t' can not be splitted in two cache lines. Sorry for a little deviation from the main subject.

Best regards,
Sergey

Ritratto di Sergey Kostrov
Quoting jimdempseyatthecove ...The argument you wish (need) to resolve is not specific to your my_union_t but rather in general to reads and writes...

Jim Dempsey

Let me repeat my question:

Is it possible that in case ofa 64-bit variable Vthelow-part VL( 32-bit )will be in a cache line A and
the high-part VH( 32-bit ) will be in a cache line B?

Best regards,
Sergey

Ritratto di jimdempseyatthecove

>>It means that theunion 'my_union_t' can not be splitted in two cache lines. Sorry for a little deviation from the main subject

Did you read the comment in the source code I posted. It contains the display values for the addresses of two of your my_uniont_t structs. (x32 app)

std::cout << &var1 << " " << &var2 << std::endl;
// displays 0037FBF0 0037FBD4

The first var lies on 16 byte aligned address, the secone lies on 4 byte aligned address. Meaing your 64-bit variable shared_var could be split across two cache lines.

Jim

www.quickthreadprogramming.com
Ritratto di Sergey Kostrov
Quoting jimdempseyatthecove ...std::cout << &var1 << " " << &var2 << std::endl;
// displays 0037FBF0 0037FBD4

The first var lies on

16 byte aligned address, the secone lies on 4 byte aligned address.

[SergeyK] Yes, I read it and even verified with Windows Calculator.

Meaing your 64-bit variable shared_var could be split across two cache lines.

[SergeyK] I see. Thank you, Jim.

Best regards,
Sergey

Ritratto di sureshgupta22

I've created a smalltest-case in C++ that uses some OpenMP functionalityand compiled it with Intel C++ compiler.
I could post the source codes if you need and please confirm me.

Accedere per lasciare un commento.