Dear Intel Community

How is standards , range , accuracy trigonometric functions in Intel C++ Compiler ? How is Your advise ?

For more complete information about compiler optimizations, see our Optimization Notice.

Your question seems to cover a variety of topics, many of which probably don't interest you, if you haven't taken the time to read a few bits of references such as

https://software.intel.com/sites/products/documentation/studio/composer/...

https://software.intel.com/en-us/articles/consistency-of-floating-point-...

-fast-transcendentals option involves accepting fewer bits of accuracy in the interest of accepting vectorized math functions. In some corner cases it may increase roundoff error from 0.5 ULP with that option off to nearly 4 ULP with the option set.

Maybe if you would check some of these references about icc you could ask more specific questions.

**Quote:**

iliyapolakwrote:

Those implement long double precision (64-bit precision). They have the somewhat peculiar feature of working only for |x| < (1<<64). Within that range, range reduction doesn't use any extra precision beyond what is provided by the accident of Pi being correct to 66 bits due to the 65th and 66th bits being zero.

Although icc brings its own math libraries for float and double data types, long double relies on glibc provided with gcc on linux and is not fully covered on Windows. You could check this by examining the object files built by the compiler, using nm on linux or dumpbin on Windows (where you would see long double implemented as double). So you could look up references on glibc to learn about the linux long double math functions.

I was under assumption that x87 trigo instruction implementation uses 80-bit precision PI value.

I have found this thread very informative.

Of course precision of long double is 64-bit.

Thanks for correcting me.

Regarding my previous post I wanted to add that if x87 fsin,fcos implementation is based on Taylor approximation with the Horner scheme then also pre-computed coefficients should have 64-bit precision. Intel vectorized implementation if it is based on Horner scheme then coefficients probably have precision of 53-bit. It could be interested to compare the results from both of implementation when the argument is close to PI/2 and -PI/2.

I still think that x87 trigo which is implemented as a micro-code assists probably have hardcoded value of PI which cannot be set by OS to 53-bit. I could be wrong on my assumption.

I do want to discuss the subject of Chebyshev economized polynomials and the like, and the methods for organizing polynomial evaluation faster but as accurate as Horner (published millenia before Horner re-popularized the method), in upcoming blogs. It's not as important a subject as it might otherwise be due to the possibility of gaining better performance yet by tabular interpolation, which of course is 53-bit double for vectorizability.

You may remember the notorious "Pentium fdiv bug" which showed that even x87 division involved some kind of tabular lookup as long as 2 decades ago, with such a big table that ordinary test suites might not expose errors.

As Iliya said, the x87 precision mode doesn't affect the built-in x87 math functions (other than sqrt) which use long double. But Intel gave up on making those performance competitive, beginning with P4.

The x87 doesn't implement any serious protection against accuracy degradation during range reduction such as you would see in Moshier's implementations on netlib.org, so the long double accuracy is retained only over a small interval. There are non-vector math library implementations which should avoid accuracy degradation due to range reduction in sin(). Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for non-vector, and reduced accuracy vector library functions are offered.

>>>Intel appears to have devoted much effort to finding methods which are sufficiently accurate without costing performance, but you will note that default vectorized math functions are rated at up to 4 ULP error vs. 0.5 ULP for non-vector, and reduced accuracy vector library functions are offered>>>

I think that most non hard real-time applications will not benefit from 64-bit precision.

**Quote:**

Rafał B.wrote:how is precision , and is counted together precision Taylor complex trigonometric function and Horner calculated in x87 ?

The quoted title below partly answers your question. Just gogle it this is pdf doc.

"Accurate Floating Point Arithmetic through Hardware Error-Free Transformations"

Consider following scenario where there are two implementation of Taylor approximation: naive and Horner scheme.

The naive implementation simply implements in code Taylor formula and Horner scheme uses pre-calculated coefficient without lengthy div operation. By comparing the results you can expect that Horner scheme will outperform in terms of resulting precision and accuracy the naive implementation.

The culprit of lesser precision can be usage of division operation in naive method.

Btw, I have in MASM assembly both of those implementation and I plan to test for accurracy of the result.

**Quote:**

Rafał B.wrote:x87 is that only method for representation trigonometric functions in Intel Core i5 , Intel Xenon E5 - 2609 0 ?

If you mean machine code instructions then yes. On the other hand Intel has special and transcendental functions vector libraries which have lesser accuracy ,but have greater throughput.

Btw, there are three version of those function where HA (high accurate) has precision < 1.0 ULP.

https://software.intel.com/sites/products/documentation/hpc/mkl/vml/func...

**Quote:**

Rafał B.wrote:

How are translated trigonometric functions of x87 on Intel C + +?

Compiler emits opcodes for x87 built-in functions like fcos , fsin etc....

IIRC for compilation of 64-bit executables containing floating point code compiler will not emit x87 opcodes.

@Rafal B

If you are interested in AMD K5 implementation of transcendental function google "K5 Transcendental Functions"

Complex transcendentals involve combinations such that the 4 ULP claim would be best case applying relative to the nominal 24-bit float or 53-bit double precision for the vector math function library, or close to 0.5 ULP for the non-vector implementations.

The glibc implementation of expl() (presumably used on linux by both icc and gcc) isn't necessarily done carefully enough to quote any specific error estimate relative to the nominal 64-bit precision (if you over-ride any measures your compiler may take to set 53-bit precision). That isn't so difficult to improve upon with your own implementation, but you can expect poor performance in comparison with double vectorized libraries. If you're running under Visual Studio, you don't have satisfactory long double math libraries.

You don't have an option to request x87 based float or double libraries except by running IA32 mode with the 32-bit compilers, where you give up on vectorization. In current Intel Windows C++, -Qlong-double enables use of x87 to support long double even when using SSE2 to support double. According to the doc, __LONG_DOUBLE_SIZE() macro is available to test whether long double is set to 53-bit precision (macro returning 64) or 64-bit (macro returning 80). I don't know whether this should track any change of setting you make by using intrinsics or asm to change it (or does it simply reflect the option used to compile that object). The old option /Qpc80 to set 64-bit precision mode seems to have been deprecated, but I don't see it in the current deprecation list.

MKL offers VSL vector math functions in various accuracies. I haven't had the incentive to use those.

You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Horner-like implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods, by using fma implementation on corei7-4, and it offers a little extra precision also by taking advantage of fma. Horner popularized the algorithm in English language literature but it was published long before in other languages. I'm trying to look up my drafts on the subject. It's peripheral to the subject of Intel C++ and I'm having difficulty thinking how much might be topical. This x87 stuff was a diversion from my impression of the original question on the thread.

As we've brought up x87 code and polynomial evaluation (pretty much off forum topic), here's a snippet of old tanh() code (one of the few elementary functions where you want to augment the curve fits buried in the x87 ROM):

__inline_mathcodeNP (tanh, __x, \

if(__fabsl(__x) <= .34657){ \

long double __x2 = __x * __x; \

long double __x4 = __x2*__x2; \

return __x + __x2*__x*( \

-0.3333333333333333333028L \

+__x2*(0.133333333333333321200L \

+__x2*-0.5396825396825207695E-01L \

+__x4*(0.218694885360028124E-01L \

+__x2*-0.88632355226515778E-02 \

+__x4*(0.3592127817609080E-02 \

+__x2*-0.14558300258105E-02) \

+__x4*__x4*(0.5899693119329E-03 \

+__x2*-0.238614526828E-03 \

+__x4*(0.9399418484E-04 \

+__x2*-0.294863013E-04)))));} \

return 1 - 2 / (expl(__x + __x) + 1))

I suppose what someone called naive would be individual calculation of each of the expanded powers of __x, relying on the compiler to extract any commonality, but not specifying an order of evaluation to control roundoff. The coefficients are derived by Chebyshev economization of a polynomial fit for tanh(__x)/__x over the interval |__x| < .34657 ; where the formula 1-2/(exp(2x)-1) would suffer cancellation error.

The full latency of evaluation of the polynomial series is meant to be around 0.2 of what is involved in a plain Horner series. There would be more instances of underflow than in the Horner series but not nearly as many as in the "naive" evaluation.

and here's an example of how to make expl() consistently accurate to 62 bit or better precision by using the internal curve fit and log base 2(e) while limiting roundoff error :

# define __exp_code \

long double __value; \

long double __exponent; \

long double __temp; \

__asm ("fldl2e" : "=t" (__temp) ); \

__x -= .693147178739309310913*(__exponent = rintl(__x * __temp)); \

__x = (__x-1.82063599850414622404E-09*__exponent)*__temp; \

__asm ("f2xm1" : "=t" (__value) : "0" (__x)); \

__asm ("fscale" : "=t" (__value) : "0" (__value+1), "u" (__exponent)); \

return __value

__inline_mathcode_ (long double, __expl, __x, __exp_code)

The 2 double constants (of opposite sign) should add up to a long double accurate value of natural log of 2 and might be replaced by the corresponding same signed pair M_LN2LO and M_LN2HI seen in some (non-standard?) math.h implementations.

Now maybe you begin to appreciate having some math function work done for you.

>>>You brought up Horner polynomial as a possible way of implementing math functions, but I'm not getting a sense of your specific interest. Horner-like implementation of economized polynomials (e.g. Chebyshev economization) looks more competitive with likely tabular methods,>>>

I plan to implement math functions by using also Chebyshev economization. Later I would like to perform various tests of speed and accuracy.

Suitable implementations of Chebyshev economization give a small improvement in accuracy when comparing with a Taylor series, due to improved convergence of the polynomial within the chosen range. As in my posting above, in the case of the leading term linear in x, it's likely to be appropriate to economize based on fitting f(x)/x, using enough terms that the leading coefficient comes out to 1.0 in the data type.

More important for accuracy is arranging the series for accurate order of evaluation. Horner's method accomplishes that inherently for the most part, at least in the case where the magnitude of terms decreases with order (the reduced interval must be chosen to accomplish that). For the case where the leading coefficient is 1.0 or even 2.0, improved accuracy may be obtained by adding the entire leading term separately at the end (more so with fma, where the roundoff after the last multiplication is eliminated). It's convenient when this also is faster.

At the risk of repeating myself, a pure Horner evaluation is uncompetitive for performance, given that current CPUs offer pipeline capability to evaluate several terms in parallel. I could imagine an algorithm to optimize polynomial evaluation far better than compilers do, under Fortran style rules for re-association.

The question of accuracy comparison between Chebyshev economized polynomial and a true minimax polynomial is more difficult but perhaps inconsequential. A minimax polynomial should have slightly less maximum error than a Chebyshev economized one of the same number of terms, but a good Chebyshev polynomial is more accurate over most of the range. Even at the point where the minimax is more accurate if evaluated in extra precision, it may not be so when evaluated in target precision. Also, a satisfactory minimax polynomial requires a more accurate reference implementation of the function, and the extra calculation may be tedious if using a multiple precision library, where ordinary long double may be sufficient to fit a double data type Chebyshev polynomial, and ancient utilities like bc work as well. I translated published Chebyshev economization software to bc myself.

gnu mpc and mpfr have greatly improved the selection of choices since I started playing with this, and the chances of getting a high quality math library by default with your compiler are better as well (with possible exception of simd math libraries, which are a big part of the value proposition for Intel compilers).

I've never found an easily accessible documentation of the practical side of this stuff, although it's clear that experts have worked on it for decades, so I feel somewhat justified in writing about it myself, if I can find out what people will find useful. Plauger's "The Standard C Library" was an excellent reference but didn't spark any subsequent interest in discussing the subject. Things were more difficult back when IBMers had to allow for hex floating point and Plauger had to deal with the mixed byte order of VAX.

Re: #21. "I've never found an easily accessible documentation of the practical side of this stuff..."

You might consider looking at these two books by Jean-Michel Muller and some members of the group at ENS Lyon. (The name of the research team changes aperiodically.) One of the best collections of floating-point experts anywhere, including some of the major contributors to mpfr. Also look for various journal and conference publications from the group.

*Handbook of Floating-Point Arithmetic*

Birkhauser Boston, dec. 2009. 572 p. 62 illus., ISBN: 978-0-8176-4704-9

@Book{MullerEtAl2010,

title = {Handbook of Floating-Point Arithmetic},

author = {Muller, Jean-Michel and Brisebarre, Nicolas and de Dinechin, Florent and Jeannerod, Claude-Pierre and Lef{\`e}vre, Vincent and Melquiond, Guillaume and Revol, Nathalie and Stehl{\'e}, Damien and Torres, Serge},

publisher = {{B}irkh\"auser {B}oston },

pages = {572 },

note = {{ACM} {G}.1.0; {G}.1.2; {G}.4; {B}.2.0; {B}.2.4; {F}.2.1.,

ISBN 978-0-8176-4704-9}, year = {2010}, }

**Elementary Functions: Algorithms and Implementaton**

Springer Science & Business Media, 2006. 288 p, ISBN: 0817644083, 9780817644086

@book{muller2006elementary,

title={Elementary Functions: Algorithms and Implementation},

author={Muller, J.M.},

isbn={9780817644086},

lccn={2005048094},

series={Computer Science},

url={http://books.google.com/books?id=Mx-\_kaANJBEC},

year={2006},

publisher={Birkhauser Boston}

}

*Computer Approximations* by Cheney, Hart *et al* is a classic text in the field. It presents much information about the theoretical unpinnings of many of the mathematical techniques which can be used to generate approximations for various functions (e.g, Remez' algorithm). But it was first published in 1968. It is completely out of date with respect to software implementation techniques for math libraries on modern processors.

>>>But it was first published in 1968. It is completely out of date with respect to software implementation techniques for math libraries on modern processors.>>>

Yes that is true if you are judging this book in the context of modern math libraries implementations.

I posted my white paper on Horner polynomial and related topics at

https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnx0c...

Thanks Tim.

Someone notified me that they are blocked from reading google sites (not surprisingly) so I am trying t attach a copy here.

Attachment | Size |
---|---|

Download Expression evaluation optimization_0.docx | 21.64 KB |

Very interesting and informative white paper.

**Quote:**

Rafał B.wrote:Where x87 were correct ( jobs , industry ) ?

Are you asking about the specific programming field which used x87 code?

I think that x87 code was part of some libraries which targeted older CPUs before the introduction of SSE SIMD unit.

One place where you can see actually source code of such a library is in MASM, that one specific floating point library is fully written in x87 assembly.

**Quote:**

Rafał B.wrote:How is smoothing function x87 by AVX 256 / AVX 512 ?

- errors calculation

- methods

- example

- sources

What do you mean?

Are you referring to this function http://en.wikipedia.org/wiki/Smoothing

**Quote:**

Rafał B.wrote:How is relation between x87 and SEE SIMD ?

What relation are you talking about?

**Quote:**

iliyapolakwrote:

Quote:

Rafał B.wrote:How is relation between x87 and SEE SIMD ?

What relation are you talking about?

IMO he can't understand distinction between old 8087 (and alike) co-processors, and later implement on single chip of Pentium III (where support of ICC has been dropped as @Tim Prince told).

Of course, white paper from @Tim Prince is very educative, side note: Thank you very much Tim!

x87 is old standard adopted by Intel and AMD, maybe few others. SIMD is SSE, SSE2, SSE3, SSSE3, SSE4, SSE4.1, AES_new_instructions, ... et cetera... please don't correct me with spelling issues and other in this enumeration, I'm just in hurry.

x87 can do a little low precision, and limited number of instructions/mathematical calculation instructions, like float 32 bit divide only (though very, and really very few others, like add, and substract).

IMO he can't understand distinction between old 8087 co-processors, and later implement on single chip of Pentium III (where support of ICC has been dropped as @Tim Prince specified.

Modern CPU whether it is form Intel or AMD can do much more in meaning like sin() function. and SSE (or SSSE) can do it in parallel. Remember "SIMD" acronym is saying "Singe Instruction (are calculating) Multiple Data" so they add parallelization, in meaning it will NOT do anything by obsolete x87 instructions, but take (depending on precision of float or double or long double you choose) multiple data like 4x 16-bit floats and in one time applies calculation on all of them at once, i.e. by single instruction. This is called SIMD ("Single Instruction -> Multiple Data") or "SIMD parallelization". And this is crucial distinction between x87 (like 20 years old) and modern SIMD.

Actually, you are talking about x87... that is obsolete technology, and inline assembler in C/++ will need instructions to clear the state of FP stack of modern CPU's, because they have higher halves of data.

IMO you were actually asking "what's the difference between ***SCALAR*** and ***SIMD/VECTORISED*** instructions.".

Both, in scalar, and SIMD, the command-line accounts "/fp:fast" or "/fp:fast=2", or "/fp:precise" (and others), (on Windows, see docs for unices platforms equivalents).

The best you can check, is to do sample program with epsilon calculation, and compare results form different ICC command-line arguments for building...

NB: the Epsilon calculation is crucial for your purposes as you are talking about accuracy of generated code, but it depends on whether you chose float, double, or long-double@32-bit program (80-bits long), or long-double@x64 program (128-bit long). You should run Epsilon calculation and decide whether it fits your needs... (float/double/long double data type).

Please, play around whit these suggestions, and you will see the results... I.e. in the 32-bit float in some calculations it is OK, while in others you need higher bit-wise-length precision...

>>>Modern CPU whether it is form Intel or AMD can do much more in meaning like sin() function. and SSE (or SSSE) can do it in parallel. Remember "SIMD" acronym is saying "Singe Instruction (are calculating) Multiple Data" so they add parallelization, in meaning it will NOT do anything by obsolete x87 instructions, but take (depending on precision of float or double or long double you choose) multiple data like 4x 16-bit floats and in one time applies calculation on all of them at once, i.e. by single instruction. This is called SIMD ("Single Instruction -> Multiple Data") or "SIMD parallelization". And this is crucial distinction between x87 (like 20 years old) and modern SIMD.>>>

I would like to add that "Multiple Data" can represent some vector composed of 2-4 scalars which themselves can model some specific abstraction of real physical object.

For example: RGBA integer values of pixel or 3D float coordinates and velocity scalar value of particle.

**Quote:**

Rafał B.wrote:Tahank You for Your answer .

Which degree and the measuring points were used in the model described in Taylor x87?

IIRC the polynomial is up to 9th degree.

**Quote:**

Rafał B.wrote:Thank You for Your answer

How is shaped the rest of the Taylor series in x87?

Please download and read following document "AMD K5 Transcendental Functions" in this document you will find a plenty info about the trigo function implementation.

Rate Us