Performance of sqrt

Performance of sqrt

Bild des Benutzers Christian M.

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processo...

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

102 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers iliyapolak

 >>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

Please look at this link which is more related to the speed of execution(comparision between SSE sqrt(x) and invsqrt multiplied by x)

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

Bild des Benutzers Christian M.

Zitat:

iliyapolak schrieb:

 >>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

This brings me already closer.

But what about Ivy Bridge and other unmentioned CPUID(s). Does anybody have some tips?

Bild des Benutzers iliyapolak

>>>The thing is that AVX shows no improvement over SSE>>>

Maybe exact the microcode implementation of the sqrt algorithm is the same when AVX and SSE instruction are compared.

Bild des Benutzers Christian M.

I think AVX sqrt implementation only calls SSE implementation for lower and upper YMM register. As latency is doubled for double data amount.

But I am not sure, whether this is for all Sandy Bridge or only because I test on a middle class Sandy Bridge for mobile notebooks.

Bild des Benutzers Tim Prince

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.  Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

Bild des Benutzers iliyapolak

@Tim

Is it possible to obtain an information about the exact algorithm used to calculate sqrt values on Intel CPU's?

Bild des Benutzers Sergey Kostrov

Christian,

Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

Bild des Benutzers Sergey Kostrov

>>...As latency is doubled for double data amount...

In SSE performance numbers are almost the same for the following test cases:

[ Test-case 1 ]

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[2] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[3] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmResult = _mm_sqrt_ps( mmValue );

[ Test-case 2 ]

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )fA;
mmValue.m128_f32[2] = ( RTfloat )fA;
mmValue.m128_f32[3] = ( RTfloat )fA;
mmResult = _mm_sqrt_ps( mmValue );

Bild des Benutzers Sergey Kostrov

>>...06_2A means Sandy Bridge or does it not?..

I'll take a look.

In general, you need to get more detailed information like:
...
CPU Brand String: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
CPU Vendor : GenuineIntel
Stepping ID = 2
Model = 12
Family = 6
Extended Model = 1
...
and then to "map" these numbers to codes in the manual.

Bild des Benutzers Christian M.

Zitat:

TimP (Intel) schrieb:

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.  Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

The thing with Ivy Bridge is really interesting. With add and mul Sandy Bridge already allows quite good instruction level parallelism. If one result is not directly base on the operations before, one can fill the pipeline very well and get a result per clock nearly, I suppose.

Can one find the optimizations of Ivy Bridge also in the Intrinsics guide? I do not find the appropriate CPUID. If 06_2A is Sandy Bridge, then according to the table from http://software.intel.com/en-us/articles/intel-architecture-and-processo..., Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked every but those that are imporant for me).

Zitat:

Sergey Kostrov schrieb:

Christian,

Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be a good indicator. I wounder whether the results also differ within a CPU family.

You mentioned that I should "map these numbers to codes in the manual. Which manual are you talking about exactly?

Bild des Benutzers Sergey Kostrov

>>... Which manual are you talking about exactly?

Please take a look at: http://www.intel.com/content/www/us/en/processors/architectures-software...

Bild des Benutzers Sergey Kostrov

Here are a couple of more links & tips:

- You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT

- Try to use msinfo32.exe utility ( it provides some CPU information )

- http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cac...

Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page )

- http://software.intel.com/en-us/forums/topic/278742

Bild des Benutzers iliyapolak

>>>I am especially interested on the performance of the precise square root operation>>>

Here you have a very interesting discussion about the hardware accelereted sqrt calculation

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

Bild des Benutzers iliyapolak

>>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

Bild des Benutzers iliyapolak

  >>>With add and mul Sandy Bridge already allows quite good instruction level parallelism>>>

Sandy Bridge really improved instruction level parallelism by adding one or two new ports to the execution cluster.So for example when your code has fp add(one vector addition) and fp mul(one vector multiplication) both without beign interdependent on each other they can be executed simultaneously.

Bild des Benutzers Tim Prince

">>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe..."

These imprecise operations are available via Intel compiler options

/Qimf-accuracy-bits:bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

So you can request the 13-bit accuracy implementation of divide and sqrt. Iterative methods with less than full precision can be produced by requesting 20- 40- or 49-bit accuracy.  22-bit accuracy is the default for single precision vectorization; -Qprec-div -Qprec-sqrt (implied by /fp:source|precise) changes default to 24/53-bit accuracy.  Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.   Original core 2 duo with the slower divide and sqrt is no longer in production.  I turned mine in after 4.5 years rather than re-install WIndows a 4th time.

The x87 divide and sqrt also support a trade-off between speed and precision, by setting 24-, 53- (default for Intel and Microsoft compilers) or 64- (hardware default, /Qpc80) bit precision mode.

You also have the choice, since SSE, of gradual underflow (/Qftz-) to maintain precision in the presence of partial underflow.  Sandy Bridge removes the performance penalty for /Qftz- in most common situations.  This was done in part because it's not convenient to set abrupt underflow when using Microsoft or gnu compilers.

All these options are more than most developers are willing to bargain for (and QA test).  That's one of the reasons for availability of IEEE standard compliant instructions and for progress at the hardware level in making them more efficient.

Bild des Benutzers Christian M.

>>>Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

Quite interesting discussion, it provides a lot of information.

I found the following discussion about square root and AVX: http://stackoverflow.com/questions/8924729/using-avx-intrinsics-instead-...

One is mentioning something about instruction emulation. Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?

Bild des Benutzers Christian M.

>>>These imprecise operations are available via Intel compiler options ...

Wow, this information is quite new. I did not know one could control accuracy.

>>> Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.

So this was the first time IEEE compliant instructions provided quite good speed compared to other SSE/SSE2 versions?

And to x87: I found that some compilers use only x87 FPU in 32 bit mode and switching same code to compile for 64 bit mode, SSE is used (only scalar version). Is this also something can be controlled? For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.

Bild des Benutzers iliyapolak

>>>For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.>>>

Yes because this is the developer's decision and/or project constraints to favor precision over vectorization of the code.

Bild des Benutzers iliyapolak

>>>Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?>>>

I'm not sure if Core i3 has less execution units  than Core i7.I think that main difference is in cache size ,TDP ,number of physical and logical cores(HT), more agressive overclocking.

Bild des Benutzers Sergey Kostrov

>>>>These imprecise operations are available via Intel compiler options...

That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time.

>>
>>Wow, this information is quite new. I did not know one could control accuracy...

Please take a look at a _control87 CRT-function.

Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:

Forum topic: Mathimf and windows
Web-link: http://software.intel.com/en-us/forums/topic/357759

Bild des Benutzers Sergey Kostrov

>>>>These imprecise operations are available via Intel compiler options...

That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time.

>>
>>Wow, this information is quite new. I did not know one could control accuracy...

Please take a look at a _control87 CRT-function.

Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:

Forum topic: Mathimf and windows

Bild des Benutzers Sergey Kostrov

Sorry, I forgot to specify a forum's name...

>>Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:
>>
>>Forum topic: Mathimf and windows

It is in Intel C++ compiler forum.

Bild des Benutzers Sergey Kostrov

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked
>>every but those that are imporant for me)...

Christian, Please take a look at Table 3-18. Highest CPUID Source Operand for Intel 64 and IA-32 Processors ( page 212 ) in

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Order Number: 325383-044US
August 2012

Bild des Benutzers Sergey Kostrov

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked
>>every but those that are imporant for me)...

This is what my CPUID test case displays:
...
CPU Brand String: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz
CPU Vendor: GenuineIntel
Stepping ID = 9
Model = 10
Family = 6
Extended Model = 3
...

Bild des Benutzers Sergey Kostrov

>>>>...Let me know if you need real performance numbers for different sqrt functions and floating-point types...
>>>>
>>This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be
>>a good indicator...

In general all tests based on the following for loop:
...
int iNumberOfIterations = 16777216; // 2^24

g_uiTicksStart = ::GetTickCount();
for( int t = 0; t < iNumberOfIterations; t++ )
{
...
}
g_uiTicksEnd = ::GetTickCount();
printf( RTU(" - %ld ticks\n"), ( int )( g_uiTicksEnd - g_uiTicksStart ) );
...
for Microsoft C++ compiler, Debug and Release configurations, and without any optimizations.

Bild des Benutzers Sergey Kostrov

[ Microsoft C++ compiler / Debug configurations ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 296 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 281 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 577 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 593 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 593 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 343 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 3011 ticks
625.000^0.5 = 25.000

CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 984 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 969 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 2422 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 2500 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 2672 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 1406 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 11187 ticks
625.000^0.5 = 25.000

Bild des Benutzers Sergey Kostrov

[ Microsoft C++ compiler / Release configurations ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 281 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 297 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 318 ticks
625.000^0.5 = 25.000

F32vec4 class
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 985 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 969 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 1422 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 1953 ticks
625.000^0.5 = 25.000

Bild des Benutzers Sergey Kostrov

>>...I wounder whether the results also differ within a CPU family...

I can't verify it. However, when it comes to precision if a 53-bit precision is set then results must be the same for all CPUs.

Bild des Benutzers Tim Prince

In the particular case where your operands can be expressed exactly in 12 bits precision, it seems that your accuracy doesn't vary among these methods.  Accuracy of the sqrt reciprocal approximation varies between AMD CPU families, but I think Intel tried to keep it the same. 

If you wished to test accuracy of sqrt without going through an exhaustive list of cases, you could try something like the Paranoia benchmark.

The earliest AMD families had a 14-bit approximation which would be sufficient to obtain 52 bits after 2 iterations; this has been considered at Intel but I don't know of it ever being adopted.

Bild des Benutzers iliyapolak

Thanks for posting sqrt(x) test case.

What is this sqrt(x) implementation "User Sqrt - RTfloat"?

Do you have results for SSE sqrt(x) where x = double primitive type?

Bild des Benutzers Sergey Kostrov

>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?..

It is based on a classic iterative method and I'll provide more details later.

>>Do you have results for SSE sqrt(x) where x = double primitive type?..

No. If you decide to test it you will need to use:

__m128d _mm_sqrt_pd( __m128d )

Note: It is the same as SQRTPD instruction.

Bild des Benutzers Sergey Kostrov

Hi everybody, Next three test results demonstrate what the latest version of Intel C++ compiler can do...

Bild des Benutzers Sergey Kostrov

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Precise (/fp:precise)

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 265 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 203 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

Note 1: 47 ticks for 2^24 iterations!
Note 2: 1 sec is 1000 ticks.

Bild des Benutzers Sergey Kostrov

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast (/fp:fast)

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 188 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

Bild des Benutzers Sergey Kostrov

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++]

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 187 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 46 ticks
625.000^0.5 = 25.000

Bild des Benutzers Sergey Kostrov

>>>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?..
>>
>>It is based on a classic iterative method and I'll provide more details later.

Here it is:
...
RTint iNumberOfIterations = _RTNUMBER_OF_TESTS_0016777216; // 2^24

// Sub-Test 1 - User Sqrt - RTfloat
{
CrtPrintf( RTU("User Sqrt - RTfloat\n") );
RTfloat fA = 625.00f;
RTfloat fG = 625.00f;

RTfloat fQ = 0.0f;

CrtPrintf( RTU("Calculating the Square Root of %.3f"), fA );

g_uiTicksStart = SysGetTickCount();
for( RTint t = 0; t < iNumberOfIterations; t++ )
{
fQ = 0.0L;

while( RTtrue )
{
if( ( fQ - fG ) > -0.00001f )
break;
fQ = fA / fG;
fG = ( 0.5f * fG + 0.5f * fQ );
}
}
CrtPrintf( RTU(" - %ld ticks\n"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) );
CrtPrintf( RTU("%.3f^0.5 = %.3f\n"), fA, fG );
}
...

Bild des Benutzers iliyapolak

>>>SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000>>>

It is interesting which of the sqrt calculation methods does hardware accelerated SSE instruction use?

Bild des Benutzers Sergey Kostrov

>>...which of the sqrt calculation methods does hardware accelerated SSE instruction use?

The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions:
...
/*
* Square Root of Double-Precision Floating-Point Values
* **** VSQRTPD ymm1, ymm2/m256
* Performs an SIMD computation of the square roots of the two or four packed
* double-precision floating-point values in the source operand and stores
* the packed double-precision floating-point results in the destination
*/
extern __m256d __cdecl _mm256_sqrt_pd(__m256d a);

/*
* Square Root of Single-Precision Floating-Point Values
* **** VSQRTPS ymm1, ymm2/m256
* Performs an SIMD computation of the square roots of the eight packed
* single-precision floating-point values in the source operand stores the
* packed double-precision floating-point results in the destination
*/
extern __m256 __cdecl _mm256_sqrt_ps(__m256 a);
...

some time later and I'd like to verify Christian's statement '...The thing is that AVX shows no improvement over SSE...'

Bild des Benutzers Sergey Kostrov

Christian, Have you seen that picture: en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)#Roadmap on the Wiki?

>>...Has performance of this commands improved in Ivy Bridge?

As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge ).

Bild des Benutzers iliyapolak

>>>The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions:>>>

It seems that I had wrongly formulated my question.I wanted to ask which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by the SSE sqrt instructions.I have found this paper "Fast Floating Point Square Root".

Bild des Benutzers Sergey Kostrov

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by
>>the SSE sqrt instructions...

I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.

>>...As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge )...

Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:

...The thing is that AVX shows no improvement over SSE...

Bild des Benutzers Sergey Kostrov

...The thing is that AVX shows no improvement over SSE...

Christian, How did you come up to that conclusion? Could you follow up, please?

My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

Bild des Benutzers iliyapolak

>>>Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:>>>

Sorry Sergey but I still have only Core i3.I can run your tests for SSE verification only.

Bild des Benutzers iliyapolak

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by >> the SSE sqrt instructions...

>>>I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.>>>

Yes I thought the same.By looking at the algorithm one can see that it implements costly(for the hardware) division per every iteration so I think that Intel engineers probably optimized this part of the algorithm.

>>>CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000>>>

Interesting case is CRT sqrt function which is slower than SSE and AVX counterparts.I  suppose when disassembled it calls fsqrt x87 instruction which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and to preserve accuracy of the result.Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and performes an input checking.

@Sergey can you force compiler to inline calls to CRT sqrt function?

Bild des Benutzers Sergey Kostrov

>>...CRT sqrt function which is slower than SSE and AVX counterparts. I suppose when disassembled it calls fsqrt x87 instruction
>>which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).

There are two issues: a call overhead ( parameters verifications, etc ) and it could be dependent ( possibly ) on a setting of _set_SSE2_enable function ( I didn't verify it ).

>>It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.

Yes, but this is another set of tests and I won't have time for it.

>>FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and
>>to preserve accuracy of the result.

HrtSqrt is actually based on it.

>>Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and
>>performes an input checking.

Yes and this is what I called '...a call overhead...' before.

Bild des Benutzers Sergey Kostrov

>>...can you force compiler to inline calls to CRT sqrt function?..

Yes, but it won't improve performance significantly (!) since '...parameters verifications, etc...' must be done anyway in the testing for loop.

Bild des Benutzers Sergey Kostrov

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

Here are results:

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++]

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 188 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
625.000^0.5 = 25.000

AVX Sqrt - RTfloat
Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats )
625.000^0.5 = 25.000

Bild des Benutzers Sergey Kostrov

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

>>...
>>SSE Sqrt - RTfloat
>>Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
>>625.000^0.5 = 25.000
>>...
>>AVX Sqrt - RTfloat
>>Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats )
>>625.000^0.5 = 25.000

This is how I've done assesment:

- Normalization factor is 2 = 8 ( floats ) / 4 ( floats ).
- Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6

Bild des Benutzers iliyapolak

>>>- Normalization factor is 2 = 8 ( floats ) / 4 ( floats ).
- Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6>>>

Thanks for clarifying this.I was wondering how did you get a 6x improvement in speed of execution.

Seiten

Melden Sie sich an, um einen Kommentar zu hinterlassen.