do _mm256_load_ps slower than _mm_load_ps?

do _mm256_load_ps slower than _mm_load_ps?

I'm tried to improve performance of simple code via SSE and AVX, but I found the AVX code need more time then the SSE code:

void testfun()

{  

int dataLen = 4800;  

int N = 10000000;

 float *buf1 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

float *buf2 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

float *buf3 = reinterpret_cast<float*>(_aligned_malloc(sizeof(float)*dataLen, 32));  

for(int i=0; i<dataLen; i++)  

{  

 buf1[i] = 1;   buf2[i] = 1;   buf3[i] = 0;  

}

 int timePassed;  int t;  

//=========================SSE CODE=====================================   

t = clock();  

__m128 *p1, *p2, *p3;  

for(int j=0;j<N; j++)  

{   

 p1 = (__m128 *)buf1;   

p2 = (__m128 *)buf2;   

p3 = (__m128 *)buf3;    

for(int i=0; i<dataLen/4; i++)   

{   

 *p3 = _mm_add_ps(_mm_mul_ps(*p1, *p2), *p3);   

 p1++;    p2++;    p3++;   

}  

}

 timePassed = clock() - t;  

std::cout<<"SSE time used: "<<timePassed<<"ms"<<std::endl;

 for(int i=0; i<dataLen; i++)  {   buf3[i] = 0;  }  

t = clock();  

//=========================AVX CODE=====================================

 __m256  *pp1, *pp2, *pp3;  

for(int j=0;j<N; j++)

 {   

 pp1 = (__m256*) buf1;   

pp2 = (__m256*) buf2;  

 pp3 = (__m256*) buf3;   

for(int i=0; i<dataLen/8; i++)  

 {      

*pp3 = _mm256_add_ps(_mm256_mul_ps(*pp1, *pp2), *pp3);   

 pp1++;    pp2++;    pp3++;  

 }  

}

 timePassed = clock() - t;  std::cout<<"AVX time used: "<<timePassed<<"ms"<<std::endl;

 _aligned_free(buf1);  _aligned_free(buf2);

}

I changed the "dataLen" and get different efficiency:

dataLen = 400   SSE time:758  ms    AVX time:483       ms          SSE > AVX

dataLen = 2400  SSE time:4212 ms    AVX time:2636    ms         SSE > AVX

dataLen = 2864 SSE time:6115   ms   AVX time:6146     ms        SSE ~= AVX

dataLen = 3200  SSE time:8049   ms    AVX time:9297    ms       SSE < AVX

dataLen = 4000  SSE time:10170  ms    AVX time:11690   ms    SSE < AVX

My L1 cache is 32KB, L2 cache 1MB.It seems that  sometimes  load 256 Bytes   is slower load 128Bytes,  why?It is the same result if I change the code to SIMD Instructions, just like"_mm256_load_ps  ","_mm_load_ps ","mm_add_ps" .....

 

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
jimdempseyatthecove's picture

You haven't indicated your processor. My guess is you are Sandybridge.

Next gen Haswell will correct for this in your example by a 2x wider data bus between the CPU and the L1/L2/L3

see http://www.realworldtech.com/haswell-cpu/5/ for some insight.

Also Haswell can perform FMA (Fused Multiply and Add) in one instruction (... = (B*C) + A)
And depending on Haswell CPU, you can have upto 4 memory banks.

Use Sandybridge for learning how to use _mm256 (and converting applications).
Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

 

www.quickthreadprogramming.com
iliyapolak's picture

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));

Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

Quote:

jimdempseyatthecove wrote:

You haven't indicated your processor. My guess is you are Sandybridge.

Use Haswell for production code.

Sandybridge gave you a year to adopt your code for Haswell (and later).

Jim Dempsey

 

 

Hi jim, Thanks for your reply!

Yes my processor is Sandybridge(xeon e3 1225 v2).Others can recreate my results on sandybridge processor.

For extra large arrays and simple calculation,like

for(int i=0; i<1000000000; i++)

{

A[i] += B[i]*C[i];

}

This loop is memory band width limited,so it can not  fully display the FMA's advantage.

I've tested my memory read and write speed , it is about 20GB/s(DDR3, Dual channel),I think this speed is fast,but still not enough.

Can I get faster?

iliyapolak's picture

>>>This loop it is memory band width limited,so it can not  fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

Quote:

iliyapolak wrote:

Hi Zheng

Do you have a transition penalty between SSE and AVX-256bit code?Maybe during the execution of your code SSE and AVX-256 got intermixed into YMM registers?

I have seen the assembly code,and it use the XMM registers in the sse code , and use the YMM registers in the AVX code!

Also I can comment the SSE code or comment the AVX code , and the result is the same!

Quote:

iliyapolak wrote:

>>>This loop it is memory band width limited,so it can not  fully display the FMA's advantage.>>>

Can you run VTune analysis on that code to see where the pipeline stalls are?Here I mean front-end pipeline stalls.

Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop

( keep the amount of calculation the same)  then I will get different performance.

When the array can all stored in the cache, then the performance is the best!

Quote:

Sergey Kostrov wrote:

>>float *buf1 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf2 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));
>>
>>float *buf3 = reinterpret_cast(_aligned_malloc(sizeof(float)*dataLen, 32));

Did you verify that these three pointers are really aligned on 32-byte boundary? Also, You've overcomplicated memory allocation and why do you need reinterpret_cast C++ operator?

 

Hi,  I have checked the allocated memory, they  are  really aligned on 32-byte.

If I change them to 16Byte aligned , then the allocated memory is 16Byte aligened but not 32 Byte aligened.

"reinterpret_cast<float*> " just change the address  to a float pointer. I am not so familiar...

could you please give a example for  creating a 32 Byte aligened emory? Thanks!

iliyapolak's picture

>>>Hi,I don't have this software...

I get the conclusion just by some tests. For example, reduce the size of the array and add the iterations of the loop>>>

You can download trial version of Parallel Studio.It is hard to say without collecting cpu counters data what is exactly the limiting factor  in your case

jimdempseyatthecove's picture

On Sandy Bridge the internal data path is still 128 bits. See: http://www.realworldtech.com/sandy-bridge/6/ 
Similar issue with Ivy Bridge.

Haswell has expanded the internal data path to 256 bits. See: http://www.hardwaresecrets.com/printpage/Inside-the-Intel-Haswell-Microarchitecture/1777

Jim Dempsey

www.quickthreadprogramming.com
John D. McCalpin's picture

Intel has not made definitive statements on this, but I can think of a few reasons why the 128-bit loads might be more efficient than the 256-bit loads on Sandy Bridge processors (and presumably Ivy Bridge processors as well).

(1) Two 128-bit loads can be issued to the two load ports in a single cycle, while the 256 bit loads are issued to one port which is then occupied for two cycles.   It is certainly plausible that the former case allows better low-level scheduling.

(2) When the data is bigger than the L1 Dcache: There is evidence that the L1 Dcache can either provide 32 Bytes/cycle to the core *or* receive 32 bytes per cycle to reload a line from the L2, but not both at the same time.  Again, having independent 128-bit loads that can execute in a single cycle might allow better scheduling with respect to the timing of the L1 Dcache refills from the L1 cache than having 256-bit loads that occupy the L1 for two cycles.

(3) Intel has not disclosed enough details about the L1 Data Cache banking to really understand what is going on there.  It is possible that 256-bit loads hit bank conflicts more often than 128-bit loads, or that the impact of these delays is larger (because the 256-bit loads occupy a port for two cycles instead of one cycle).

(4) For data bigger than L1:  The L1 hardware prefetcher is activated by "streams" of load addresses.  Using 128-bit loads gets you a "stream" of loads faster than using 256-bit loads. Since the hardware prefetchers have to start all over again for every 4 KiB page (64 cache lines), being able to start prefetching from the L2 even a few cycles earlier might make a noticeable difference.

(5) It is important to check the assembly code carefully when using intrinsics!  Although these "look like" inline assembly, they are not, and the compiler may perform high-level optimizations that you don't expect.  I think that the differences seen here are real, but some of the details may depend on exactly what code the compiler decides to generate.

John D. McCalpin, PhD "Dr. Bandwidth"

The core LS performance isn't impacted by cache size or the hw pref in terms of answering this question.  If you ran and measured LS latency for the given instruction, you'd find that 256-bit loads within a cacheline have a latency of 9 clks on SB/IB.  That's 2 clks greater than the non-256-bit loads.  Now if your 256-bit load spans a cacheline boundary.. you pay a penalty of 21 clks on SB/IB (it's 13 clks on HW).  That's why this is not prudent on SB/IB.  If you are using 128-bit vector.. and you're not 16B aligned.. then it's crossing a cacheline boundary 1/4 of the time... but in 256-bit it's happening 1/2 the time.  If you're 16B aligned.. then you won't pay this cacheline spanning penalty in SSE/AVX128.. but you will in AVX256.. since you're not 32B aligned.  To align.. just do it in C code by adding some buffer to your malloc and then anding by 31 and then taking that number / your granularity and use that index to feed your code.  

perfwise

iliyapolak's picture

I suppose that in case of 256-bit load two additional cycles are needed to physically transport additional 16bytes of data.

>>...
>>could you please give a example for creating a 32 Byte aligened emory?..

In case of Release configuration try to use _mm_malloc and _mm_free intrinsic functions. Here is an example:

...
int *piMemoryBlock1 = NULL;
...
piMemoryBlock1 = ( int * )_mm_malloc( 777 * sizeof( int ), 32 );
...
// Some Processing
...
if( piMemoryBlock1 != NULL )
_mm_free( piMemoryBlock1 );
...

PS: I use these two intrinsic functions in ~95% of all memory allocation / de-allocation cases.

PS2: In Debug configuration I use Microsoft's malloc_dbg and free_dbg functions since they allow to detect memory leaks and buffer overflows.

Login to leave a comment.