Corruption with the optimization compiler option

Corruption with the optimization compiler option

Why does the optimization crush the code so much it changes the outcome of the code and how can I stop it?

36 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

If the results change entirely, it's probably because optimization makes the result more sensitive to latent bugs, e.g. undefined data, pointer over-runs, ...  

However, you should also investigate safe options such as including /fp:source, which will disable optimizations which are outside the C and C++ standards.

Besides turning on options which may flag uninitialized data and the like, Other investigative tools include enabling vectorization loop by loop,...

 

I managed to get my code working with optimization on.

Now if I may continue on this thread what I'm trying to do is create a simple RAM bandwidth test, I'm using a C# application for a basic interface and a very small dll with the following methods that I PInvoke and time with the Stopwatch in .net 4.0:

unsigned long long* data;
bool Create(){
data=new unsigned long long[134217728];
}
bool Destroy(){
delete[] data;
}
bool Write(unsigned long long dval){
for(auto i=0;i<134217728;++i){data[i]=dval;}
return true;
}
bool Read(unsigned long long dval){
auto errxor=dval;
for(auto i=0;i<134217728;++i){errxor=data[i]^errxor;}
return errxor!=dval;
}

The write speed is pretty good but the read performance is lacking, I have AVX intrinsics and parallelization enabled and loop unrolling set to 32, on inspection of the assembly code it appears to be using 128bit instructions.

I am unsure of the best way to test the read performance, could you suggest any improvements?

Why did you set unrolling value to 32?

If you have Haswell CPU it can perform two load/store addresses operations per clock.I am not sure if unrolling 32x will be helpful in your case.You can try to set unrolling by 2 and measure the memory reading performance.

movaps xmm0,xmmword ptr[esi] //esi == input pointer

movaps xmm1,xmmword ptr[esi+16]

bool Read() function will also execute slower because of bitwise xor operator when each value of data[I] array will be xored with errxor variable.If I am not wrong there could be also an issue of loads blocked by store forwarding errxor = data[I]^errxor;

Can you run VTune analysis on your code?

I have a Sandy Bridge E i7 3820 4.3GHz with quad channel 2133MHz RAM.
A loop does other stuff like increment the loop variable that creates overhead so the more you can do per loop the less impact the overhead has that's why I'm unrolling it, it doesn't do a whole lot but high numbers do squeeze out a few hundred more MB/s, I'm only trying to learn about this stuff its nothing important.

I want it to detect errors too but cant think of a better way to do it, the XOR is faster than "if(data[i]!=dval)error=true;" by about 18GB/s to 16GB/s(with Qparallel off), I know my ram will go about 50GB/s.

What information would you like from the VTune analysis?

Loop logic can be executed in parallel.Port0 can  increment int variable and Port5 can calculate branch of course execution of the loop code must wait for the branch result.If loop was vectorised successfully then Port1 could handle  vector int arithmetics.

VTune analysis could shed more light on performance issues.

Btw if you want you can also measure loop overhead only without executing code inside the loop.Moreover unrolling by such large number could increase register pressure.

Without modifying the Read or Write methods the write performance is slower with a vector down from about 16 to about 10GB/s.

After inspecting the disassembly the vector array has no intrinsics applied and is not being unrolled, with loop unrolling set to 16 this is the disassembly of the Write loop with a vector:

0FC5126C  mov         esi,eax  
0FC5126E  inc         eax  
0FC5126F  shl         esi,4  
0FC51272  mov         edi,dword ptr [data (0FC54980h)]  
0FC51278  mov         dword ptr [edi+esi],ecx  
0FC5127B  mov         dword ptr [edi+esi+4],edx  
0FC5127F  mov         edi,dword ptr [data (0FC54980h)]  
0FC51285  mov         dword ptr [edi+esi+8],ecx  
0FC51289  mov         dword ptr [edi+esi+0Ch],edx  
0FC5128D  cmp         eax,4000000h  
0FC51292  jb          Write+0Ch (0FC5126Ch)

and this is a normal array:

0F621080  vmovntdq    xmmword ptr [ecx+esi*8],xmm0  
0F621085  vmovntdq    xmmword ptr [ecx+esi*8+10h],xmm0  
0F62108B  vmovntdq    xmmword ptr [ecx+esi*8+20h],xmm0  
0F621091  vmovntdq    xmmword ptr [ecx+esi*8+30h],xmm0  
0F621097  vmovntdq    xmmword ptr [ecx+esi*8+40h],xmm0  
0F62109D  vmovntdq    xmmword ptr [ecx+esi*8+50h],xmm0  
0F6210A3  vmovntdq    xmmword ptr [ecx+esi*8+60h],xmm0  
0F6210A9  vmovntdq    xmmword ptr [ecx+esi*8+70h],xmm0  
0F6210AF  vmovntdq    xmmword ptr [ecx+esi*8+80h],xmm0  
0F6210B8  vmovntdq    xmmword ptr [ecx+esi*8+90h],xmm0  
0F6210C1  vmovntdq    xmmword ptr [ecx+esi*8+0A0h],xmm0  
0F6210CA  vmovntdq    xmmword ptr [ecx+esi*8+0B0h],xmm0  
0F6210D3  vmovntdq    xmmword ptr [ecx+esi*8+0C0h],xmm0  
0F6210DC  vmovntdq    xmmword ptr [ecx+esi*8+0D0h],xmm0  
0F6210E5  vmovntdq    xmmword ptr [ecx+esi*8+0E0h],xmm0  
0F6210EE  vmovntdq    xmmword ptr [ecx+esi*8+0F0h],xmm0  
0F6210F7  add         esi,20h  
0F6210FA  cmp         esi,eax  
0F6210FC  jb          Write+50h (0F621080h) 

Compiler decided to use only one XMM register(XMM0)  to hold the function argument and I think that CPU Port2 and Port3 can still issue two memory stores per clock.I do not know why compiler did not precalculated the pointer offset in advance thus probably diminishing the load on the AGU.

 

vmovntdq  xmmword ptr [ecx+esi],xmm0

vmovntdq xmmword ptr [ecx+esi+16],xmm0

vmovntdq xmmword ptr [ecx+esi+32],xmm0

// unrolling code continues

add esi,256

cmp esi,eax

 

So I need to find out why its not using the other registers...

The best option is to perform VTune analysis and post the results.

What type of analysis in VTune?

That might explain why I'm getting a higher read speed:

77B611F0  vpxor       xmm0,xmm0,xmmword ptr [edi+eax*8]  
77B611F5  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+10h]  
77B611FB  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+20h]  
77B61201  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+30h]  
77B61207  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+40h]  
77B6120D  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+50h]  
77B61213  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+60h]  
77B61219  vpxor       xmm7,xmm6,xmmword ptr [edi+eax*8+70h]  
77B6121F  vpxor       xmm0,xmm7,xmmword ptr [edi+eax*8+80h]  
77B61228  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+90h]  
77B61231  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]  
77B6123A  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]  
77B61243  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]  
77B6124C  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]  
77B61255  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]  
77B6125E  vpxor       xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]  
77B61267  add         eax,20h  
77B6126A  cmp         eax,ecx  
77B6126C  jb          Read+50h (77B611F0h) 

its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?

Taking a guess about what you may have done or plan to do, you may need to specify and assert 32-byte alignment to get ymm moves for data which aren't defined as m256 types.  Your code excerpt in #3 above would produce 16-byte alignment (you have no control although it may happen to be 32-byte aligned). On Sandy Bridge corei7-2, splitting moves down to 128-bits is a big advantage when there are misalignments, and may still prove better on Ivy Bridge corei7-3. 

In my intrinsics code I make the switch to AVX instructions conditional on building for AVX2, as I find the SSE2 intrinsics running faster with 50% of data not aligned on early AVX platforms. In turn, I make the use of SSE2 intrinsics conditional on building for SSE3, since early SSE2 platforms performed poorly on unaligned data.  Intel C++ promotes those SSE2 intrinsics to AVX-128 such as you show when AVX is set, and drops the vzeroupper which has to be inserted afterwards in order to use MSVC or gcc to run SSE2 with adequate performance on an AVX platform.

My code is still the same as it is in post #3, could you show me how do this alignment?

I tried "#pragma vector aligned" but it made no difference.

Any improvements to my code are quite welcome.

I set the vectorizer diagnostic level to 6 and this is what it said:

warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZED

what does it want me to do?

Citação:

CommanderLake escreveu:

That might explain why I'm getting a higher read speed:

77B611F0  vpxor       xmm0,xmm0,xmmword ptr [edi+eax*8]  
77B611F5  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+10h]  
77B611FB  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+20h]  
77B61201  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+30h]  
77B61207  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+40h]  
77B6120D  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+50h]  
77B61213  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+60h]  
77B61219  vpxor       xmm7,xmm6,xmmword ptr [edi+eax*8+70h]  
77B6121F  vpxor       xmm0,xmm7,xmmword ptr [edi+eax*8+80h]  
77B61228  vpxor       xmm1,xmm0,xmmword ptr [edi+eax*8+90h]  
77B61231  vpxor       xmm2,xmm1,xmmword ptr [edi+eax*8+0A0h]  
77B6123A  vpxor       xmm3,xmm2,xmmword ptr [edi+eax*8+0B0h]  
77B61243  vpxor       xmm4,xmm3,xmmword ptr [edi+eax*8+0C0h]  
77B6124C  vpxor       xmm5,xmm4,xmmword ptr [edi+eax*8+0D0h]  
77B61255  vpxor       xmm6,xmm5,xmmword ptr [edi+eax*8+0E0h]  
77B6125E  vpxor       xmm0,xmm6,xmmword ptr [edi+eax*8+0F0h]  
77B61267  add         eax,20h  
77B6126A  cmp         eax,ecx  
77B6126C  jb          Read+50h (77B611F0h)

its using all of the xmm registers, but 256bit AVX uses ymm registers which have 2 xmm registers each so would it not use ymm if it were using AVX intrinsics?

Actually in this case probably two loads can be executed in parallel,but only one of them will be xored at the same time because of variable interdependency.I suppose that this kind unrolling puts a lot of pressure on register usage , but could transfer more data into cache lines (probably by hardware prefetching) when the CPU is busy executing vpxor instructions. Also vpxor is probably cached because its exhibits high frequency of usage.

Regarding VTune perform at the beginning Bandwidth Analysis.

http://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-96C7C275-28FB-484F-AE3B-7304C0DE91C2.htm

Citação:

CommanderLake escreveu:

My code is still the same as it is in post #3, could you show me how do this alignment?

I tried "#pragma vector aligned" but it made no difference.

Any improvements to my code are quite welcome.

I set the vectorizer diagnostic level to 6 and this is what it said:

warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: streaming store was generated for data
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : vectorization support: unroll factor set to 4
warning : LOOP WAS VECTORIZED
warning : vectorization support: reference data has unaligned access
warning : vectorization support: unaligned access used inside loop body
warning : REMAINDER LOOP WAS VECTORIZED

what does it want me to do?

That means that loop was vectorized by using xmm registers and streaming stores was used to transfer data.

If your data are defined without alignment where the compiler can see it, the pragma vector aligned may be ignored.  Of course, if the compiler doesn't see the lack of alignment and applies the pragma, your code may break.  You ask us to suggest improvements without showing us what you are currently using.

I said my code is still pretty much what it was in post #3 but I just tried splitting the array into 2 and even 4 arrays and I got a performance improvement but it doesn't seem to want to go over 20GB/s:

What I was trying to get at with this thread originally was to do with the error checking, the compiler assumes errxor is of no significance so the optimization makes it harder to do the error checking.

"//data0=(unsigned long long*)_mm_malloc(268435456,16);"

You can pass  value 64 as a alignment argument to _mm_malloc() , but it must go with the memory being allocated in multiplies of one page(4KB).

I managed to get 21GB/s read and 19.3GB/s write with 2 arrays I think that must be as fast as a single thread can go:

#include <memory.h>
#include <ia32intrin.h>
unsigned long long* data0;
unsigned long long* data1;
bool Create(){
    data0=new unsigned long long[67108864];
    data1=new unsigned long long[67108864];
    return true;
}
bool Destroy(){
    delete[] data0;
    delete[] data1;
    return true;
}
bool Write(unsigned long long dval){
    #pragma simd
    for(long i=0;i<67108864;++i){data1[i]=data0[i]=dval;}
    return true;
}
//volatile unsigned long long errxor;
bool Read(unsigned long long dval, unsigned long long errxor){
    #pragma simd
    for(long i=0;i<67108864;++i){errxor=errxor^data1[i]^data0[i];}
    return errxor!=dval;
}

Write disassembly:

vmovdqu     xmmword ptr [ebx+eax*8],xmm0
vmovntdq    xmmword ptr [ecx+eax*8],xmm0
add         eax,2  
cmp         eax,edx  
jb          Write+50h (0FDA10A0h)  

Read disassembly:

vpxor       xmm0,xmm0,xmmword ptr [ebx+eax*8]  
vpxor       xmm1,xmm0,xmmword ptr [edx+eax*8]  
vpxor       xmm2,xmm1,xmmword ptr [ebx+eax*8+10h]
vpxor       xmm3,xmm2,xmmword ptr [edx+eax*8+10h]
vpxor       xmm4,xmm3,xmmword ptr [ebx+eax*8+20h]
vpxor       xmm5,xmm4,xmmword ptr [edx+eax*8+20h]
vpxor       xmm6,xmm5,xmmword ptr [ebx+eax*8+30h]
vpxor       xmm0,xmm6,xmmword ptr [edx+eax*8+30h]
add         eax,8  
cmp         eax,ecx  
jb          Read+47h (0FDA1127h)  

Bandwidth analysis attached, Qparallel is off and loop unrolling is decided by the compiler, this is the code used:

#include <memory.h>
#include <ia32intrin.h>
unsigned long long* data0;
unsigned long long* data1;
bool Create(){
    data0=new unsigned long long[67108864];
    data1=new unsigned long long[67108864];
    return true;
}
bool Destroy(){
    delete[] data0;
    delete[] data1;
    return true;
}
bool Write(unsigned long long dval){
#pragma simd
//#pragma parallel
    for(long i=0;i<67108864;++i){data1[i]=data0[i]=dval;}
    return true;
}
bool Read(unsigned long long dval, unsigned long long errxor){
#pragma simd
//#pragma parallel
    for(long i=0;i<67108864;++i){errxor=errxor^data1[i]^data0[i];}
    return errxor!=dval;
}

I still can't make the error checking work on the Read method please help.

Anexos: 

AnexoTamanho
Download Bandwidth.7z5.49 MB

Check this out I maxed out the Write speed at about 55GB/s by adding my own AVX intrinsics and enabling Qparallel:

bool Write(double dval){
    __m256d ymm0=_mm256_set1_pd(dval);
#pragma parallel
    for(long i=0;i<134217728;i+=4){
        _mm256_store_pd(data0+i, ymm0);
    }
    return true;
}

and heres those big juicy ymm's!

inc             ebx 
vmovntpd    ymmword ptr [edi+edx],ymm0  
vmovntpd    ymmword ptr [edi+edx+20h],ymm0
add            edi,40h  
cmp           ebx,esi  
jb               Write+17Bh (0F2C11BBh)  

It seems that you finally got real improvement in increasing memory speed of your test case.

 

I'm still having trouble with the Read method and error checking, how do I make this parallelizable(if that's even a word):

bool Read(double dval, double errxor){
    double errxord[4]={errxor, errxor, errxor, errxor};
    __m256d ymm0=_mm256_loadu_pd(errxord);
    for(long i=0;i<134217728;i+=4){
        ymm0=_mm256_xor_pd(ymm0, _mm256_loadu_pd(data0+i));
    }
    if(ymm0.m256d_f64[0]!=dval||ymm0.m256d_f64[1]!=dval||ymm0.m256d_f64[2]!=dval||ymm0.m256d_f64[3]!=dval)return true; else return false;
}

OpenMP provides for xor reduction; if you don't like OpenMP, you will have to think about the analogy.  Each thread does a private reduction, results combined at the end, probably in tree fashion if using a significant number of threads.

You could do parallelization over a for loop by giving to each thread iterate over the loop index.For example threadID 0 (i = 1<i =2n<arrayLen),threadID 1(i = 2<i = 2n<arrayLen),threadID 2 (i = 3<i =2n<arrayLen),threadID 4 (i = 4<i = 2n<arrayLen).The hardest part will be thread synchronisation.

Iv dumped the error checking now I just want the fastest way to read RAM, how do I do that?

This is what iv tried:

for(long i=0;i<134217728;i+=4){
    volatile __m256i ymm0=_mm256_set_epi64x(data0[i], data0[i+1], data0[i+2], data0[i+3]);
}

for(long i=0;i<134217728;++i){
    volatile __int64 ptr=data0[i];
}

both of them read into then out of the register and I think thats whats slowing it down, why cant I just read memory without doing anything else?

Hey look I got the assembly from the memory read test loop in aida64, how do they do that?:

sub         rsi,200h  
movntdqa    xmm0,xmmword ptr [rsi+170h]
movntdqa    xmm1,xmmword ptr [rsi+160h]
movntdqa    xmm2,xmmword ptr [rsi+150h]
movntdqa    xmm3,xmmword ptr [rsi+140h]
movntdqa    xmm4,xmmword ptr [rsi+130h]
movntdqa    xmm5,xmmword ptr [rsi+120h]
movntdqa    xmm6,xmmword ptr [rsi+110h]
movntdqa    xmm7,xmmword ptr [rsi+100h]
movntdqa    xmm8,xmmword ptr [rsi+0F0h]
movntdqa    xmm9,xmmword ptr [rsi+0E0h]
movntdqa    xmm10,xmmword ptr [rsi+0D0h]
movntdqa    xmm11,xmmword ptr [rsi+0C0h]
movntdqa    xmm12,xmmword ptr [rsi+0B0h]
movntdqa    xmm13,xmmword ptr [rsi+0A0h]
movntdqa    xmm14,xmmword ptr [rsi+90h]
movntdqa    xmm15,xmmword ptr [rsi+80h]
movntdqa    xmm0,xmmword ptr [rsi+70h]  
movntdqa    xmm1,xmmword ptr [rsi+60h]  
movntdqa    xmm2,xmmword ptr [rsi+50h]  
movntdqa    xmm3,xmmword ptr [rsi+40h]  
movntdqa    xmm4,xmmword ptr [rsi+30h]  
movntdqa    xmm5,xmmword ptr [rsi+20h]  
movntdqa    xmm6,xmmword ptr [rsi+10h]  
movntdqa    xmm7,xmmword ptr [rsi]  
movntdqa    xmm8,xmmword ptr [rsi-10h]  
movntdqa    xmm9,xmmword ptr [rsi-20h]  
movntdqa    xmm10,xmmword ptr [rsi-30h]
movntdqa    xmm11,xmmword ptr [rsi-40h]
movntdqa    xmm12,xmmword ptr [rsi-50h]
movntdqa    xmm13,xmmword ptr [rsi-60h]
movntdqa    xmm14,xmmword ptr [rsi-70h]
movntdqa    xmm15,xmmword ptr [rsi-80h]
sub         rsp,200h  
jne         0000000000490180  

 

I do not understand what you want to achieve.Do not use registers for memory transfer?

As I said in post #3 I just want to know how to benchmark RAM, it seems I will have to use inline assembly can you tell me how I can read and write RAM using inline AVX assembly without the optimization interfering or doing any sort of processing that will slow it down?

Code inside _asm{} block is not optimised.Compiler will probably add xmmword ptr or ymmword ptr to your code.You  can write a set of simple SSE and AVX inline assembly based routines to read and write memory only.Allocate memory with _mm_malloc() aligned on 64 bytes and allocating memory in blocks of page size.For measuring speed of execution my favourite option is to use rdtsc intrinsic.

Btw. You could get a better answer on ISA forum by asking Dr. McCalpin.He is author of STREAM benchmark.

What do you mean by "allocating memory in blocks of page size"?

Post edited because I just realized my mistake I shouldn't have used "#pragma vector always".

Allocating memory in multiplies of page size (4KB).

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!