Using load to register __m512

Using load to register __m512

We have the following sample code to run on Xeon Phi.

#include <iostream>
#include <memory.h>
#include <immintrin.h>
using namespace std;
void f( float* _amatr)
{
__m512 a;
a = _mm512_load_ps(_amatr+1);
_mm512_store_ps(_amatr+1, a);
}
int main(int argc, char* argv[])
{
__attribute__((aligned(64))) float _amatr[256];
for(int i=0; i<256; i++)
_amatr[i] = i+1;
f(_amatr);
return 0;
}

We can load to register __m512 16 single-precision floating-point elements since _amatr, _amatr+16, _amatr+32, etc.
But when we try to load those 16 elements with offset which is not multiple 16 (_amatr+1, etc), we get segmentation fault.
Here http://software.intel.com/en-us/comment/1762336#comment-1762336 said this is because of the _amatr[0] is 64-bit aligned and _amatr[1] is not. We don't understand it. Can you clarify it?

Thanks.

15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Use _mm512_loadu_ps / _mm512_storeu_ps

"u" for unaligned

Jim Dempsey

www.quickthreadprogramming.com

Citation :

jimdempseyatthecove a écrit :

Use _mm512_loadu_ps / _mm512_storeu_ps

I have not found these intrinsics in documentation and compiler don't build code with them. I suppose, You suggested it similar for AVX.

It is listed in the Intel Intrinsics Guide-windows-v3.0.1

Did you forget to #include "zmmintrin.h"? (note the "z")

You will also need to include the (cross) compiler optimization switch to indicate the target is (or code contains offloads for) Xeon Phi.

Jim Dempsey

www.quickthreadprogramming.com

Fine, there is description how to use  _mm512_loadu_ps in Intel Intrinsics Guide-windows-v3.0.1, but I still can't find  _mm512_loadu_ps definition in zmmintrin.h, although _mm512_load_ps defined there.

Compiler version is "icc (ICC) 14.0.0 20130728"

This may require an update. Someone from Intel might step in here. For now, see if you can #define the intrinsic to use _mm512_mask_expandload_ps or _mm512_mask_loadu_ps (set mask to all floats). Or write the macro to use _mm256_loadu_ps... just as a work around until a fix is made.

Or write an inline assembler statement for the missing intrinsic.

Jim Dempsey

Jim Dempsey

www.quickthreadprogramming.com

Jim - I have not found that doc. Can you send me the location please and I will follow-up on that?

Here's guidance I received regarding intrinsic use:

The intrinsics _mm512_load_ps and _mm512_store_ps require the address parameter to be 64-byte aligned.
 
The array _amatr is explicitly 64-byte aligned because of this:
  
 __attribute__((aligned(64))) float _amatr[256];
 
(_amatr+1) is *not* 64-byte aligned, because it has 4 bytes offset from the aligned pointer (so it is 64*n+4).
 
In order to do unaligned load/store on KNC, the following intrinsics can be used:
 

void f( float* _amatr)
{
 __m512 a;
 a = _mm512_loadunpacklo_ps(a, _amatr+1);
 a = _mm512_loadunpackhi_ps(a, _amatr+1+16);
 _mm512_packstorelo_ps(_amatr+1, a);
 _mm512_packstorehi_ps(_amatr+1+16, a);
}

 

www.quickthreadprogramming.com

Kevin, Jim,

many thanks.

Now everything is clear.

Jim, we have Xeon Phi 5110P Knights Corner Coprocessors, and, as I know, only the following http://software.intel.com/en-us/node/461060 intrinsics are supported on this architecture.

 _mm512_loadu_ps / _mm512_storeu_ps /_mm512_mask_expandloadu_ps intrinsics are described in the section "AVX-512" http://software.intel.com/en-us/node/485315 , http://software.intel.com/en-us/node/485288 and I think AVX-512 instructions are supported only on next generation - Xeon Phi Knights Landing.

Is my understanding correct?

>>

a = _mm512_loadunpacklo_ps(a, _amatr+1);  
a = _mm512_loadunpackhi_ps(a, _amatr+1+16);  
_mm512_packstorelo_ps(_amatr+1, a);  
_mm512_packstorehi_ps(_amatr+1+16, a);  
 

Those intrinsics are not listed in the previously linked to v3.0.1 intrinsics guide.

Jim Dempsey

www.quickthreadprogramming.com

Relatively little of the Xeon Phi information has been integrated into the standard documentation, but it is usually not too hard to find the Xeon Phi specific documents.

The detailed descriptions of the _mm512_loadunpacklo and _mm512_loadunpackhi instructions (and macros) made my head hurt.  I still can't tell exactly what happens if you happen to give an aligned address to the instruction pair.  It looks like both of the instructions will load the same data into the register, but I could be mistaken.

The reason that I was puzzled is that the earlier compilers put tests around the "vloadunpackhd" instructions so that they would not execute if the pointer happened to be aligned.  The compiler does not put the test in there any longer.   The fact that it worked before (with only vloadunpackld) and still works now (with both instructions, but vloadunpackhd executed after vloadunpackld) suggests that either of the instructions will load all of the data into the register correctly if the pointer happens to be 16 Byte aligned.

John D. McCalpin, PhD "Dr. Bandwidth"

There’s a good discussion in James’ blog: http://software.intel.com/en-us/blogs/2013/avx-512-instructions confirming availability in Knights Landing.

The Xeon Phi intrinsics are (also) in the C/C++ User and Reference guides. http://software.intel.com/en-us/intel-software-technical-documentation. I was not familiar with these separate intrinsic guides before.

John - I’ll try to get clarification.

It was also advised that the correct usage is via including immintrin.h and not zmmintrin.h

Thank you all for your posts.

John, my apologies for the delayed follow-up. Here's some additional explanation from Development regarding your earlier comments.

The lo/hi pair of loadunpack instructions works as follows:

1) Unaligned case:

For the picture above, loadunpacklo(addr) will load yellow part (elements starting from ‘addr’ to the first 64-byte-aligned address following ‘addr’), and loadunpackhi(addr+64) will load blue part (elements starting from the first 64-byte-aligned address following ‘addr’).

2) 64-byte-aligned case:

In case of 64-byte aligned address ‘addr’, the , loadunpacklo(addr) will load all elements starting from ‘addr’, and loadunpackhi(addr+64) will be a no-op, i.e. it will load nothing.

Regarding the comment: "The reason that I was puzzled is that the earlier compilers put tests around the "vloadunpackhd" instructions so that they would not execute if the pointer happened to be aligned."

The “earlier compilers” here probably means KNF compilers (and maybe very early KNC compiler). On KNF, there was a hardware problem where loadunpackhi produced fault in case of 64-byte-aligned address. So, as a workaround, the compiler for KNF generated such run-time check to avoid executing loadunpackhi with aligned address. KNC hardware does not have this problem, so it is safe to always use the lo/hi pair of instructions.

Thanks for the clarification! (Were the pictures supposed to be included in the post?)

It does not really matter very much whether the vloadunpackhi is a nop or if it reloads the same data, but getting rid of nagging uncertainties like this one helps keep the brain a little clearer when trying to work through the meaning of the assembly-language output of the compiler.  

And you are correct that the compare and branch around the vloadunpackhi was something that I saw back in the KNF days.

John D. McCalpin, PhD "Dr. Bandwidth"

Glad that helped.

Yes, I intended to have the earlier attached images embedded into my earlier reply where they now appear; however, the underlying Forum implementation was garbling the image URLs and only displaying a large empty box with red-X missing image symbol. I found a work around. Upload the image files to a different post (like this one) and then use the URLs for the file attachments in the earlier post to embed the image. Crazy but it worked.

Attachments: 

AttachmentSize
Download unaligned.jpg9.98 KB
Download aligned.jpg11.15 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today