Why it is so difficult to write AVX code on MIC!

Why it is so difficult to write AVX code on MIC!

Hello,

I am writing an AVX code to calculate the complex multiplication. The code is listed below,

  1 typedef std::complex<float> Value;
  2 void Benchmark::gridKernel(const int support,
  3                            const Value C[],
  4                            Value grid[], const int gSize)
  5 {
  6     int Nvec=8;
  7     int nBlock,nrest,sSize_b;
  8
  9     nrest=sSize%Nvec;
 10     nBlock=(sSize-nrest)/Nvec;
 11     sSize_b=sSize-nrest;
 12 …
 13     for (int dind = bs; dind <= be; ++dind) {
 14 …
 15                 gind=…
 16                 cind=…
 17             Value gridc[sSize_b],Cc[sSize_b];
 18             for (int suppu = 0; suppu < sSize_b; suppu++) {
 19                gridc[suppu] = grid[gind+suppu];
 20                Cc[suppu]    = C[cind+suppu];
 21             }
 22             const Value d = samples[dind].data;
 23             for (int suppu = 0; suppu < nBlock; suppu++) {
 24               int sl=suppu*Nvec;
 25               __m512 sam = _mm512_load_ps(( Real *) &Cc[sl]);
 26               __m512 *gridptr = (__m512 *) &gridc[sl];
 27               __m512 data_r = _mm512_set1_ps(d.real());
 28               __m512 data_i = _mm512_set1_ps(d.imag());
 29               __m512 t7 = _mm512_mul_ps(data_r, sam);
 30               __m512 t6 = _mm512_mul_ps(data_i, sam);
 31               __m512 t8 = _mm512_swizzle_ps(t6,_MM_SWIZ_REG_CDAB);
 32               __m512 t7c= t7;
 33               __m512 t9 = _mm512_mask_sub_ps(t7c, 0x5555, t7, t8);
 34               __m512 t9c= t9;

35               __m512 t10= _mm512_mask_add_ps(t9c, 0xAAAA, t9, t8);
 36               gridptr[0] = _mm512_add_ps(gridptr[0], t10);
 37             }//end suppu
 38
 39             for(int suppu=0;suppu<sSize_b;suppu++){
 40                 grid[gind+suppu]=gridc[suppu];
 41             }
 42
 43             for (int suppu = sSize_b; suppu < sSize; suppu++) {
 44                 grid[gind+suppu] += d * C[cind+suppu];
 45             }
 46     }//end dind
 47 }

As you see above, this code calculates the multiplication of “C” and “d”,and the results are added into array “grid”. The memory of array “grid” and “C” are allocated in another function with the following codes,

grid = (Value *) _mm_malloc(gSize*gSize*sizeof(Value),64);
if(grid == NULL) exit (1);
C = (Value *) _mm_malloc(sizeofC*sizeof(Value),64);
if(C == NULL) exit (1);

These two arrays are 64 bytes aligned. This code can be running on MIC correctly.

You may be very curious about why I use two temporary array “gridc” and “Cc” to hold pieces of array “grid” and “C” before the computation. That will add many memory copy and memory set operations and will reduce performance. Because if I delete these codes, including the codes from row 17 to row 21, and codes from row 39 to row 41, and replace codes from row 25 to row 26 with the following codes,

__m512 sam = _mm512_load_ps(( Real *) &C[cind + sl]);
__m512 *gridptr = (__m512 *) &grid[gind + sl];

There will be a “Segmentation fault (signal 11)” error when it is running on MIC card. The icpc version is 14.0.2.144 Build 20140120.

I don’t know where this error comes from, and how to solve it.

Any advice?

Shaohua

 

 

 

 

 

 

 

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Are you completely sure that the addresses from which you are loading are 64byte aligned? (You don't show the calculations of cind and gind, so we can't tell the alignment properties of "&C[cind + sl]")

For now you could add some assertions (or simply code) to check the alignment, something like this (composed into here, so not compiled or tested!)

#define FULLY_ALIGNED(p) ((uintptr_t(p)) & ((1<<6)-1)) == 0)

ASSERT (FULLY_ALIGNED(&C[cind + sl]));

etc.

Hello James,

Thanks for your quick reply.

I have checked the alignment with your advice, the output is like below,

void Benchmark::gridKernel(int, const std::complex<float> *, std::complex<float> *, int): Assertion `( ( ( uintptr_t(&C[cind+sl]) ) & ( (1<<6)-1 )) == 0 )' failed.

So it seems like the C[cind+sl] is actually not 64 bytes aligned, is it? But I do align it with the _mm_malloc(sizeofC*sizeof(Value),64). Why this doesn’t work?

Thanks a lot!

Shaohua

Hello James,

I also checked the alignment of temporary array “Cc” and “gridc”, the results show they are 64 bytes aligned. So these static arrays are aligned automatically, but those allocated array are not rightly aligned, although I have used the _mm_malloc(..,64) to allocate them.

The array “C” is a private array in the class Benchmark, the memory allocation and initialization of “C” is in another function which is embodied in the same class Benchmark.

I am very confused. Why _mm_malloc(..,64) doesn’t work? What should I do?

Best regards,

Shaohua

If you allocated C with the aligned malloc it will be aligned (you can easily check, of course), but that doesn't mean that &C[cind+sl] is properly aligned. Consider the case where cind == 1, for instance.

Hello James,

Yes, you are right. I just tested the alignment of C for each element, and I found that C[0] is 64 bytes aligned, but C[1] is not. So how can I make sure every element in C is 64 bytes aligned?

Regards,

Shaohua

Hello James,

Because the size of complex<float> is 8 bytes, if &C[0] is 64 bytes aligned, then &C[i] will also be 64 bytes aligned only if i%8=0. In the general case the (cind+sl)%8 !=0, so the &C[cind+sl] may also be not 64 bytes aligned. That’s why I cannot run my code without the temporary arrays Cc and gridc.

But using the temporary arrays will add many memory copy and memory set operations and will reduce performance. Is there any other better solution to this case?

Regards,

Shaohua

Surely the point of doing all this is that you want to be operating on eight complex numbers at once? (So cind should always be a multiple of eight).

I assume that before you started into this machine code level optimization you did investigate how well the compiler can vectorize this (when you tell it about alignments, appropriate "restrict"s and so on).

 typedef std::complex<float> Value;

sizeof(Value) == 2*sizeof(float) == 8

 const Value C[],

C is an array of values, you tested C for 64-byte alignment

512-bit/64-byte vector boundaries of C are at intervals

[0], [8], [16], ...

 int Nvec=8;

 for (int suppu = 0; suppu < nBlock; suppu++) {
   int sl=suppu*Nvec; //0, 8, 16, ...

sl vector stride walking along C[sl] access is OK

If the error is caused on the line referencing C[sl], then a likely culprit is then number of elements in C is not a multiple of 8.

Or....

if C happens to be the row of a 2D array (held as 1D), array aligned 64-bytes, but row in 1D representation is not multiple of 8 elements of Value, then the first row would be aligned properly on call (when you looked at the alignment) but will not be aligned on the second call for second row.

Did you insert the asserts into your code or simply look using the debugger?

Write a macro that performs the assert with a print, then use the macro in front of all statements referencing memory. Something will show up. You can then define the macro as empty to noop it for full runtime speed.

Jim Dempsey

Hello Jim,

The C is a 1D array. I have tested the alignment of C. The results show that C[0] is 64 bytes aligned, but in the general case the &C[cind+sl] is not 64 bytes aligned. That’s why I cannot run my code without the temporary arrays Cc and gridc.

But using the temporary arrays will add many memory copy and memory set operations and will reduce performance. Is there any other better solution to this case?

Thanks a lot!

Shaohua.

Can you reformulate your data sizes such that gind and cind will always be multiple of vectorsize?

Often this can be done by having non-equal work or adding cells and discarding extra results (which are computed for free).

Jim Dempsey

Hello Jim,

I use another method to fix the problem. Remember that the size of grid[gind] is 8 bytes, and the array grid has been 64 bytes aligned using _mm_malloc(grid,64).

Given a gind, if (gind%8) !=0, then the computation from gind to (gind+ 8-gind%8 -1) will use the non-vectorized code. The memory address are 64 bytes aligned at (gind+ 8-gind%8), so the following computation beginning at (gind+ 8-gind%8) will use the avx code.

In this method, the vectorization begins at the nearest 64 bytes aligned place from gind. The speedup is 2.7 compared to the non-vectorized version.

Regards,

Shaohua

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today