# ZGEMM on Intel MIC

## ZGEMM on Intel MIC

In my effort to perform ZGEMM on Intel's MIC, I have come across the following problem and would appreciate any help regarding the same. Say, I need to perform: OUT = M*IN, where OUT,M,IN are complex doubles. I have the following where I have multiplied 4 rows of 'M' with a single column of IN and get the following 4 vectors:

O1-> |a8|a7|a6|a5|a4|a3|a2|a1|//M1*IN0

O2-> |b8|b7|b6|b5|b4|b3|b2|b1|//M2*IN0

O3-> |c8|c7|c6|c5|c4|c3|c2|c1|//M3*IN0

O4-> |d8|d7|d6|d5|d4|d3|d2|d1|//M4*IN0

I have to rearrange it into:

O1_new ->|d2|d1|c2|c1|b2|b1|a2|a1|

O2_new ->|d4|d3|c4|c3|b4|b3|a4|a3|

O3_new ->|d6|d5|c6|c5|b6|b5|a6|a5|

O4_new ->|d8|d7|c8|c7|b8|b7|a8|a7|

which is similar to a transpose. How can I achieve this using the C intrinsics for Larrabee with the fewest cycles?

7 帖子 / 0 全新

Hi Bharat N.,
You may try something like code below.
My suggestion is to try ZGEMM from Intel MKL for MIC.
Thanks, Evgueni.

```

__m512d y2 = (__m512d)_mm512_mask_alignr_epi32(a2, 0x00ff, a0, a0, 8);

__m512d y0 = (__m512d)_mm512_mask_alignr_epi32(a0, 0xff00, a2, a2, 8);

__m512d y3 = (__m512d)_mm512_mask_alignr_epi32(a3, 0x00ff, a1, a1, 8);

__m512d y1 = (__m512d)_mm512_mask_alignr_epi32(a1, 0xff00, a3, a3, 8);

```

Hi Evgueni,

I can't use Intel's MKL since I'm trying to code this on my own :)

Regarding the method that you have outlined, two queries:

1. Is _mm512_mask_blend_pd() available since it isn't mentioned in the Intel manual.

2. Won't this be extremely slow or result in loss of precision since we are casting it to an integer and then recasting it back?

While the compiler doc team answers your question re _mm512_mask_blend_pd, you may use the following equivalent sequence.

The type cast from __m512i to __m512d does not convert from int to double. It only tells the compiler how to interpret the 512 bits under the cast.

```     __m512d a1 = _mm512_mask_mov_pd(w1, 0x33, _mm512_swizzle_pd(w0, _MM_SWIZ_REG_BADC));
__m512d y2 = (__m512d)_mm512_mask_alignr_epi32(a2, 0x00ff, a0, a0, 8);
__m512d y0 = (__m512d)_mm512_mask_alignr_epi32(a0, 0xff00, a2, a2, 8);
__m512d y3 = (__m512d)_mm512_mask_alignr_epi32(a3, 0x00ff, a1, a1, 8);
__m512d y1 = (__m512d)_mm512_mask_alignr_epi32(a1, 0xff00, a3, a3, 8);```

The _mm512_mask_blend_epi32, _mm512_mask_blend_epi64, _mm512_mask_blend_ps, _mm512_mask_blend_pd (present in zmmintrin.h) are missing from the C++ User Guide. I reported this to the Documentation team under the internal tracking id noted below.

(Internal tracking id: DPD200242070)

(Resolution Update on 11/27/2013): This defect is fixed in the Intel C++ Composer XE 2013 SP1 Initial Release (2013.1.0.080 - Linux)

Hi Evgueni,

Thanks for the help! It works correctly, have to now try and see how this affects performance.

Hi Frank,

Thanks for looking into this. It is weird that there is a discrepancy between the number of intrinsics available for integers as opposed to doubles. There are no permute or shuffle intrinsics for doubles that work on two vectors? Or is it ok to use the corresponding ones for integers like _mm512_shuffle_epi32/_mm512_permute4f128_epi32?

Thanks,

Bharat.

In regard to your comment/questions (to Frank), our Development team replied:

On KNC, only 32-bit versions of the shuffle and permute are available, with the semantics defined by KNC EAS. The shuffle/permute instructions on KNC are non-typed, and it does not have shuffle/permute instructions operating on 64-bit elements. Here is the list of shuffle/permute intrinsics for KNC:

extern __m512i __ICL_INTRINCC _mm512_shuffle_epi32(__m512i, _MM_PERM_ENUM);

extern __m512i __ICL_INTRINCC _mm512_permutevar_epi32(__m512i, __m512i);

extern __m512i __ICL_INTRINCC _mm512_permute4f128_epi32(__m512i, _MM_PERM_ENUM);

extern __m512 __ICL_INTRINCC _mm512_permute4f128_ps(__m512, _MM_PERM_ENUM);