Hello,

I am writing an AVX code to calculate the complex multiplication. The code is listed below,

1 typedef std::complex<float> Value;

2 void Benchmark::gridKernel(const int support,

3 const Value C[],

4 Value grid[], const int gSize)

5 {

6 int Nvec=8;

7 int nBlock,nrest,sSize_b;

8

9 nrest=sSize%Nvec;

10 nBlock=(sSize-nrest)/Nvec;

11 sSize_b=sSize-nrest;

12 …

13 for (int dind = bs; dind <= be; ++dind) {

14 …

15 gind=…

16 cind=…

17 Value gridc[sSize_b],Cc[sSize_b];

18 for (int suppu = 0; suppu < sSize_b; suppu++) {

19 gridc[suppu] = grid[gind+suppu];

20 Cc[suppu] = C[cind+suppu];

21 }

22 const Value d = samples[dind].data;

23 for (int suppu = 0; suppu < nBlock; suppu++) {

24 int sl=suppu*Nvec;

25 __m512 sam = _mm512_load_ps(( Real *) &Cc[sl]);

26 __m512 *gridptr = (__m512 *) &gridc[sl];

27 __m512 data_r = _mm512_set1_ps(d.real());

28 __m512 data_i = _mm512_set1_ps(d.imag());

29 __m512 t7 = _mm512_mul_ps(data_r, sam);

30 __m512 t6 = _mm512_mul_ps(data_i, sam);

31 __m512 t8 = _mm512_swizzle_ps(t6,_MM_SWIZ_REG_CDAB);

32 __m512 t7c= t7;

33 __m512 t9 = _mm512_mask_sub_ps(t7c, 0x5555, t7, t8);

34 __m512 t9c= t9;

35 __m512 t10= _mm512_mask_add_ps(t9c, 0xAAAA, t9, t8);

36 gridptr[0] = _mm512_add_ps(gridptr[0], t10);

37 }//end suppu

38

39 for(int suppu=0;suppu<sSize_b;suppu++){

40 grid[gind+suppu]=gridc[suppu];

41 }

42

43 for (int suppu = sSize_b; suppu < sSize; suppu++) {

44 grid[gind+suppu] += d * C[cind+suppu];

45 }

46 }//end dind

47 }

As you see above, this code calculates the multiplication of “C” and “d”，and the results are added into array “grid”. The memory of array “grid” and “C” are allocated in another function with the following codes,

grid = (Value *) _mm_malloc(gSize*gSize*sizeof(Value),64);

if(grid == NULL) exit (1);

C = (Value *) _mm_malloc(sizeofC*sizeof(Value),64);

if(C == NULL) exit (1);

These two arrays are 64 bytes aligned. This code can be running on MIC correctly.

You may be very curious about why I use two temporary array “gridc” and “Cc” to hold pieces of array “grid” and “C” before the computation. That will add many memory copy and memory set operations and will reduce performance. Because if I delete these codes, including the codes from row 17 to row 21, and codes from row 39 to row 41, and replace codes from row 25 to row 26 with the following codes,

__m512 sam = _mm512_load_ps(( Real *) &C[cind + sl]);

__m512 *gridptr = (__m512 *) &grid[gind + sl];

There will be a “Segmentation fault (signal 11)” error when it is running on MIC card. The icpc version is 14.0.2.144 Build 20140120.

I don’t know where this error comes from, and how to solve it.

Any advice?

Shaohua