# loop is not vectorised when it contains call to fmod() (Windows)

## loop is not vectorised when it contains call to fmod() (Windows)

Greetings,

I wonder why fmod() is not vectorised (using it as intrinsic function) when I use mutithreaded debug DLL as CRT. In assembler output there is call to CRT's fmod().

c:\path\to\echo2.hpp(159,32): message : loop was not vectorized: statement cannot be vectorized
c:\path\to\echo2.hpp(159,32): message : vectorization support: call to function fmod cannot be vectorized

When I use mutithreaded static CRT, in the assembler output there are few floating point instructions for fmod() and whole loops is vectorised.

I would be fine if it were (at least in release builds) vectorised and treat as an intrinsic function. And other C/++'s floating point functions as well, of course.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.
25 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

I suppose fmod() may be expected to prefer accuracy over speed in debug mode.

Quote:

Tim Prince wrote:

I suppose fmod() may be expected to prefer accuracy over speed in debug mode.

I'm using IPO and HLO and such even in debug mode, because my application is generating sound in real-time, but of course, release mode execution is faster, but takes too long to build. When I'm sure in debug mode that there is no bug, then I use release build.

But in either case, I am using "precise" floating point model.

I'm not sure I got your point, but I expect both accuracy and speed, even in Debug mode, since without optimizations in debug mode it is even impossible to start-up application in like 5 minutes (I mean its internal initialization).

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Maybe I was not so clear. Even in release build this issue is present.

When I use DLL version of CRT, there is call to fmod() in CRT, but if I use STATIC version of CRT, there is not such call but fmod() implementation in few assembler instructions.

I'm afraid CALL instruction adds overhead in opposite to "inline" implementation, and what is worse, it prevents vectorisation of whole loop.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

I checked Agner's instruction table and unfortunately there were no information present about the latency in CPU cycles of CALL instruction.

ntel Optimization Manual states the latency of CALL instruction is 5 cycles.

It is all okay regarding CALL. But the point is that the loop containing such functions is not vectorised.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

@Intel

This is a feature request, to use inlined assembly for floating point library calls, instead of calls to MS's DLL CRT library, just inline assembly code, when e.g. "#pragma intrinsic(fmod)" is present. This case is related to https://software.intel.com/en-us/forums/topic/405440 ,except ICC should provide it's own assembler instructions in case of DLL CRT (istead of CALL instructions) when it detects intrinsic pragma before function. This is automatic in "STATIC" CRT, but I'd like to see it when DLL CRT is used as well, because linking statically is heavy time consuming operation.

And therefore vectorisation of whole loop containing these intrinsic assembler code is automatic, I guess (unlike when CALL instructions to CRT DLL are present).

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

NB: the point is, when using "DLL CRT" ICC prevents vectorisation of functions containing such functions as "fmod()", unlike "static CRT". And I take this as a performance impact, because this issue is present in release profiles as well.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

are you using the 14.0 or 15.0 compiler? I'm trying to duplicate the issue with simple testcase, but couldn't.

It's great that you've found a work-around with "#pragma intrinsic(fmod)". could you send a testcase?

thanks again,

Jennifer

Please, disregard my post, since I have found there is "CALL fmod". I must have been bind.

butt stilI engourage you to implement inlined intrinsic form without CALL to slow (multiversioned) MSVC library.

#pragma instrinsic(any_PF_function)

where "any_PF_function" would be floating-point instruction as inline instead of CALL to MSVC (slow, multiversioned) library. I would like to see iline assembled instructions in output instead of "CALL fmod". It might be faster... This is a feature request.

"#pragma instrinsic(fn)" is not supported by ICC, but with MSVC it is do.

this is related to request feature @ https://software.intel.com/en-us/forums/topic/405440

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

jimdempseyatthecove wrote:

Maybe you can use this:

https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-DAFA16CE-DB78-4FFD-9C1E-4AE0EA96CEA7.htm

Thank you Jim, but I'd like to be the code more portable, even though most of compiler have header file for these intrinsics.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Yeh, we have a feature request (DPD200042138) for the "#pragma intrinsic". I will associate this thread with the existing FR.

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Quote:

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Yes, this command-line argument helps a lot. But still I have in assembly output instructions "call fmod". Worse, I am unable to do reproducer to this issue.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

Great news! I was able to make a reproducer. Please, see attachment "fmod.7z", and select "Relase|x64" profile and observe error:

1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(56,2): message : loop was not vectorized: unsupported loop structure
1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(196,9): message : vectorization support: call to function fmod cannot be vectorized
1>c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp(196,9): message : loop was not vectorized: type conversion prohibits vectorization

I'm afraid of above "vectorization support: call to function fmod cannot be vectorized" which turns to .asm output:

;c:\Users\vdmn\Documents\develop\Recorder7.1\tmp\fmod\fmod\fmod.cpp:196.9
\$LN246:
0007a e8 fc ff ff ff   call fmod                              

This should be intrinsically computed instead of call to MSVC CRT library, which could be slow when I use /QxHOST . I am on i7 Haswell, ICC 14.0.

I belive intrinsic implementation is much faster than MSVS's CRT library call.

## Attachments:

AttachmentSize
5.6 MB
--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Jennifer J. (Intel) wrote:

But "icl" should use the intrinsic when possible. It is strange that the intrinsic fmod is not used. Maybe try adding the "/Qfast_transcendentals" to the compilation.

Jennifer

This adding didn't helped. *.asm dumps read the same. Moreover, I read this option is causing lose of last few ULP's in float calculations, which is not desirable in my case.

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

just a side note, there are more transcendentals that are not computed in-line like they were intrinsics (not only fmod), but there are calls to Windows's CRT instead. I'd like to exploit AVX-2 (and below) to compute them, or even vectorise them, if possible, plus get rid of "CALL" instruction which could flush instruction cache, plus compute it in slow way compared to my CPU possibilities, since CRT is "universal".

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

This thread has become rather confusing.  The built-in partial support for math functions (other than sqrt) is for x87 long double non-vector.  I doubt there is any feasible way or incentive to make simd math functions in-line.  Even scalar simd math functions from external library are likely to run faster than x87 intrinsics.

/Qfast-transcendentals is usually associated with vectorization using Intel svml (short vector) library.  It wouldn't be expected to produce in-lining, but it should produce significant performance gains for any reasonable loop length (even shorter or longer loops than are optimum for in-line vector code).   A reason for making it optional is the possibility that svml is less accurate (possibly up to 4 ULP error).  fast-transcendentals is set off by options like /fp:source but then can be re-enabled.

As Intel wrote off high performance x87 math intrinsics with the introduction of SSE2, which now has become the default architecture for Intel compilers even in ia32 mode, the possibility of optimizing x87 math functions seems difficult to support.

If there are specific math functions where currently ICL links to Microsoft math library (if that is what was meant above) it may take some real evidence that improvements are possible, if for example there is an opportunity to augment svml.

Quote:

Tim Prince wrote:

The built-in partial support for math functions (other than sqrt) is for x87 long double non-vector.  I doubt there is any feasible way or incentive to make simd math functions in-line.

Yes, long double is problem. But what about double? That was my question.

Quote:

Tim Prince wrote:

Even scalar simd math functions from external library are likely to run faster than x87 intrinsics.

Really? Even with /QxHOST, when ICC knows my CPU's (Haswell, x64 target) "metrics" ? I don't deploy my application, I am bound with my build on my own machine. (Though it is written in portable way).

Quote:

Tim Prince wrote:

/Qfast-transcendentals is usually associated with vectorization using Intel svml (short vector) library.  It wouldn't be expected to produce in-lining, but it should produce significant performance gains for any reasonable loop length (even shorter or longer loops than are optimum for in-line vector code).   A reason for making it optional is the possibility that svml is less accurate (possibly up to 4 ULP error).  fast-transcendentals is set off by options like /fp:source but then can be re-enabled.

Jennifer from Intel recommended that above... but I finally dropped it from my command line.

Thanks!

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Tim Prince wrote:

As Intel wrote off high performance x87 math intrinsics with the introduction of SSE2, which now has become the default architecture for Intel compilers even in ia32 mode, the possibility of optimizing x87 math functions seems difficult to support.

I am not speaking of ia32 mode you have mentioned, but about intel64 mode (x64 on Haswell architecture).

Or you meant by ia32 all Intel-based architectures, even x64 one?

Quote:

Tim Prince wrote:

If there are specific math functions where currently ICL links to Microsoft math library (if that is what was meant above) it may take some real evidence that improvements are possible, if for example there is an opportunity to augment svml.

I will try to play with SVML. Can you give me better results and tutorials than Googole? Or, is it better to Google?

TIA!

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.

Quote:

Marián "VooDooMan" Meravý wrote:

Quote:

Tim Prince wrote:

As Intel wrote off high performance x87 math intrinsics with the introduction of SSE2, which now has become the default architecture for Intel compilers even in ia32 mode, the possibility of optimizing x87 math functions seems difficult to support.

I am not speaking of ia32 mode you have mentioned, but about intel64 mode (x64 on Haswell architecture).

Or you meant by ia32 all Intel-based architectures, even x64 one?

Quote:

Tim Prince wrote:

If there are specific math functions where currently ICL links to Microsoft math library (if that is what was meant above) it may take some real evidence that improvements are possible, if for example there is an opportunity to augment svml.

I will try to play with SVML. Can you give me better results and tutorials than Googole? Or, is it better to Google?

TIA!

Yes, I meant ia32 in the sense used by the Intel 32-bit "ia32" compiler, where /arch:IA32 or -mia32 (no longer a default) selects x87 code compatible with CPUs prior to P4.  The Intel64 compilers had an undocumented feature where it was possible to set either debug or -mp/Op options and get x87 code, but that has been dropped, while gnu compilers use x87 code for x86_64 only for long double data types.  With the increasing efficiency of simd support and lack of register level compatibility between SSE or AVX and x87, it's usually difficult to make effective use of mixed simd double and x87 long double.

I see that the SVML abbreviation has been used in other ways than in the Intel short vector math library, but you should find some interesting references on it by web search.  Intel doesn't document it thoroughly, as the intended use is by automatic invocation by Intel vectorizing compilers, under the fast-transcendentals option.

>>>Yes, I meant ia32 in the sense used by the Intel 32-bit "ia32" compiler, where /arch:IA32 or -mia32 (no longer a default) selects x87 code compatible with CPUs prior to P4.  The Intel64 compilers had an undocumented feature where it was possible to set either debug or -mp/Op options and get x87 code, but that has been dropped, while gnu compilers use x87 code for x86_64 only for long double data types.  With the increasing efficiency of simd support and lack of register level compatibility between SSE or AVX and x87, it's usually difficult to make effective use of mixed simd double and x87 long double.>>>

Thanks for providing this information. I tried to figure out which circumstances will force ICC to emit x87 code.

>>>With the increasing efficiency of simd support and lack of register level compatibility between SSE or AVX and x87, it's usually difficult to make effective use of mixed simd double and x87 long double.>>>

At hardware level they are managed by two different execution stacks: FP SIMD and x87 so unless there is some kind of hardware support in terms of specific instruction to convert long double to double I do not know how it could be implemented.

>>>Yes, long double is problem. But what about double? That was my question>>>

Do you really need extended precision of long double in your application?

@iliyapolak

I know it doesn't mean I should, just because I can, but I just want to.

Removing 128-bit float would need deleting parts of code (many cpp units), because user can choose 32-, 64-, or 128-bit float respectively to use on startup.

so why not?

--
With best regards,
VooDooMan
-
If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.