How to use ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ from user space

How to use ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ from user space

I noticed there is not user-level intrinsic ‘_mm512_sincos_ps’ or ‘_mm512_mask_sincos_ps’ defined in zmmintrin.h.

However, I have just found out that Intel compiler is emitting ‘__svml_sincosf16’ and ’ __svml_sincosf16_mask’ when it autovectorises code and finds ‘cos’ and ‘sin’ operations on the same value.

 I have been doing some tests and if I define my own ‘_mm512_sincos_ps’ at user-space level, Intel compiler recognises it and translates it into the appropriate ‘__svml_sincosf16’. However, the result of my code is incorrect, maybe because the parameters of my function are not the expected.

 Could anyone please tell me why ‘_mm512_sincos_ps’ has not been defined in zmmintrin.h and what the expected parameters are so that I can define it appropriately ?

 Thank you in advance.

Best regards.

(Using icc 14.0.1)

Barcelona Supercomputing Center
14 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Guys,

I simply would like to note that if you have some issues with Intel intrinsic functions it is better to post a description of the problem / issue on ISA forum. We're on Intel C++ compiler forum at the moment.

However, the question is very interesting and I'll take a look.

Thanks in advance for posting in right IDZ Forums.

>>...I have been doing some tests and if I define my own ‘_mm512_sincos_ps’ at user-space level, Intel compiler recognises it and
>>translates it into the appropriate ‘__svml_sincosf16’. However, the result of my code is incorrect, maybe because the parameters of
>>my function are not the expected...

Could you post a complete test case in order to reproduce the problem?

Please also upload zmmintrin.h for verification ( I could have a different release / update... ).

>>...it is better to post a description of the problem / issue on ISA forum...

Here is the web-link: http://software.intel.com/en-us/forums/intel-isa-extensions

Why wouldn't you combine _mm512_sin_ps and _mm512_cos_ps intrinsic functions in a macro or in a naked function like _mm512_sincos_ps ( inputs are two arguments )?

The same applies to the 2nd function you need ( inputs are six arguments ).

Another solution is to implement what you need using inline assembler.

Hi Diego

If you plan to use inline assembly to implement your own trigo function I can share my code with you.My implementation uses SSE technology,but you can rewrite it to use AVX.I did not implement argument reduction.

 

Allegati: 

AllegatoDimensione
Download vecttrigfunctions.cpp22.32 KB

Thank you for your reply!

I agree. If someone could move the post to the ISA forum that would be great. I wouldn't like to replicate the post.

I need to generate something similar to what Intel Compiler does, so I should use the '__svml_sincos16' function. Using sin and cos separately would have a significant impact in performance.

Let me show you this example:

#include <math.h>
#pragma omp declare simd

	float __attribute__((noinline)) sin_cos(float a, float *cos)

	{

	    float sin;
    sin = sinf(a);

	    *cos = cosf(a);
    return sin;

	}
void main(int argc, char *argv[])

	{

	    float cos[16];

	    float sin[16];

	    float input = atof(argv[1]);
    int i;

	#pragma omp simd

	    for (i=0; i<16; i++)

	    {

	        sin[i] = sin_cos(input + i, &cos[i]);

	    }
    for (i=0; i<16; i++)

	    {

	        printf("%d: sin=%f, cos=%fn", i, sin[i], cos[i]);

	    }

	}

I compile it with: icc -fopenmp -O3 sincos.c -S -mmic

(note that I use MIC, not AVX2).

In the generated code, the main function call to the following function:

# -- Begin  _ZGVMN16vv_sin_cos.U

	# mark_begin;

	# Threads 4

	        .align    16,0x90

	    .globl _ZGVMN16vv_sin_cos.U

	_ZGVMN16vv_sin_cos.U:

	# parameter 1: %zmm0

	# parameter 2: %zmm1

	# parameter 3: %zmm2

	..B2.1:                         # Preds ..B2.0 Latency 17

	        pushq     %rbp                                          #5.1

	        movq      %rsp, %rbp                                    #5.1

	        andq      $-64, %rsp                                    #5.1

	        subq      $320, %rsp                                    #5.1 c1

	        vmovaps   %zmm23, 64(%rsp)                              #5.1 c5

	        vmovaps   %zmm1, %zmm23                                 #5.1 c9

	        vmovaps   %zmm20, 128(%rsp)                             #5.1 c9

	        vmovaps   %zmm2, %zmm20                                 #5.1 c13

	        call      __svml_sincosf16                              #8.11 c17

	..B2.10:                        # Preds ..B2.1 Latency 29

	        vmovaps   %zmm1, %zmm3                                  #8.11 c1

	        movl      $255, %eax                                    #9.6 c1

	        kmov      %eax, %k1                                     #9.6 c5

	        movl      $43690, %eax                                  #9.6 c5

	        kmov      %eax, %k2                                     #9.6 c9

	        movl      $21845, %eax                                  #9.6 c9

	        kmov      %k5, %ecx     ...

I just wanted to know how could I call this '__svml_sincosf16' function from user-space.

As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16". But the generated code is not correct :

#include <immintrin.h>
extern __m512  __ICL_INTRINCC _mm512_sincos_ps(__m512*, __m512);
__m512 __attribute__((noinline)) sin_cos(__m512 a, __m512 *cos)

	{

	    __m512 sin;
    sin = _mm512_sincos_ps(cos, a);
    return sin;

	}
void main(int argc, char *argv[])

	{

	    float __attribute__((aligned(64))) cos[16];

	    float __attribute__((aligned(64))) sin[16];

	    __m512 * vsin = (__m512 *) &sin;

	    __m512 * vcos = (__m512 *) &cos;
    __m512 input = _mm512_add_ps(_mm512_set1_ps(atof(argv[1])), _mm512_set_ps(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));
    _mm512_store_ps(vsin, sin_cos(input, vcos));
    int i;

	    for (i=0; i<16; i++)

	    {

	        printf("%d: sin=%f, cos=%fn", i, sin[i], cos[i]);

	    }

	}

sin_cos:

	# parameter 1: %zmm0

	# parameter 2: %rdi

	..B2.1:                         # Preds ..B2.0 Latency 1

	..___tag_value_sin_cos.19:                                      #6.1

	        jmp       __svml_sincosf16                              #9.11 c1

	        .align    16,0x90

	..___tag_value_sin_cos.21:                                      #

	                                # LOE

	# mark_end;

Any idea?

Barcelona Supercomputing Center

>>...As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16".
>>But the generated code is not correct...

If this is a bug, or an undocumented feature of Intel C++ compiler, then I wouldn't expect a quick fix ( it could actually take many weeks if not months ) and you should go ahead with a workaround based on a call to __svml_sincos16 function.

Another workaround could be based on MKL vectorized functions and take a look at mkl_vml_functions.h:
...
/* Sine & cosine: r1[i] = sin(a[i]), r2[i]=cos(a[i]) */
_MKL_API( void,VSSINCOS,(const MKL_INT *n, const float a[], float r1[], float r2[]) )
_MKL_API( void,VDSINCOS,(const MKL_INT *n, const double a[], double r1[], double r2[]) )
_mkl_api( void,vssincos,(const MKL_INT *n, const float a[], float r1[], float r2[]) )
_mkl_api( void,vdsincos,(const MKL_INT *n, const double a[], double r1[], double r2[]) )
_Mkl_Api( void,vsSinCos,(const MKL_INT n, const float a[], float r1[], float r2[]) )
_Mkl_Api( void,vdSinCos,(const MKL_INT n, const double a[], double r1[], double r2[]) )

_MKL_API( void,VMSSINCOS,(const MKL_INT *n, const float a[], float r1[], float r2[], MKL_INT64 *mode) )
_MKL_API( void,VMDSINCOS,(const MKL_INT *n, const double a[], double r1[], double r2[], MKL_INT64 *mode) )
_mkl_api( void,vmssincos,(const MKL_INT *n, const float a[], float r1[], float r2[], MKL_INT64 *mode) )
_mkl_api( void,vmdsincos,(const MKL_INT *n, const double a[], double r1[], double r2[], MKL_INT64 *mode) )
_Mkl_Api (void,vmsSinCos,(const MKL_INT n, const float a[], float r1[], float r2[], MKL_INT64 mode) )
_Mkl_Api (void,vmdSinCos,(const MKL_INT n, const double a[], double r1[], double r2[], MKL_INT64 mode) )
...

>>...As I said, if you declare a function "_m512_sincos_ps", Intel Compiler translates it to this "__svml_sincosf16".
>>But the generated code is not correct...

Does it crash the test application?

It should crash but the cosine is not computed.

Barcelona Supercomputing Center

I'll try to investigate this week...

I see that immintrin.h has two _mm256_sincos_xx intrinsic functions:
...
extern __m256 __ICL_INTRINCC _mm256_sincos_ps(__m256 *, __m256);
extern __m256d __ICL_INTRINCC _mm256_sincos_pd(__m256d *, __m256d);
...
and I can't explain so far why zmmintrin.h does not have 512-bit versions of these intrinsic functions.

Any comments from Intel Software Engineers?

I would also try to use these IPP functions as a workaround:
...
IPPAPI( IppStatus, ippsSinCos_32f_A11, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n))
IPPAPI( IppStatus, ippsSinCos_32f_A21, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n))
IPPAPI( IppStatus, ippsSinCos_32f_A24, (const Ipp32f a[],Ipp32f r1[],Ipp32f r2[],Ipp32s n))
IPPAPI( IppStatus, ippsSinCos_64f_A26, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n))
IPPAPI( IppStatus, ippsSinCos_64f_A50, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n))
IPPAPI( IppStatus, ippsSinCos_64f_A53, (const Ipp64f a[],Ipp64f r1[],Ipp64f r2[],Ipp32s n))
...
because I have no answer so far.

The user maybe could call the functions __svml_sincosf16 and __svml_sincosf16_mask directly.
But it will be difficult to do from a program written in C, since these functions have a special interface.
These functions return the result to the two registers, that is very difficult (or impossible) to use the program in C.
Synopsis of these functions is something like the following:
    (sin_res, cos_res) = __svml_sincosf16(source)
    (sin_res, cos_res) = __svml_sincosf16_mask(sin_dest, cos_dest, mask, source)

Maybe you should try using the assembly code in your C program like the way compiler does in order to make the generated code correct.

 

Best, Qiao

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi