Performance of sqrt

Performance of sqrt

Hello,

I am using the intrinsic for square root. I know from the Optimization manual I could use reciprocal square root and aproximation algorithm. But I need the accuracy.

The thing is that AVX shows no improvement over SSE. Intrinsics guide gave me some hints. Is it true that the square root operation is not pipeling for both SSE and AVX? At least latency and througput indicte this. I mean AVX has twice data amount per operation but a double of latency and half of througput means all combined same performance? Is it so?

My testsystem is an i5-2410M. In the intrinsics guide (I updated to the newest version) I only find latency and througput for Sandy Bridge. Has performance of this commands improved in Ivy Bridge? Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it not? Does this account for all Sandy Bridge CPUs (regardless of Desktop or Mobile or i3, i5, i7)?

For CPUID(s) I found: http://software.intel.com/en-us/articles/intel-architecture-and-processo...

Does the intrinsics guide refer to a combination of family and model number? What about model numbers not mentioned in the intrinsics guide like Ivy Bridge?

102 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

 >>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

Please look at this link which is more related to the speed of execution(comparision between SSE sqrt(x) and invsqrt multiplied by x)

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slower-than-rsqrtx-x

Zitat:

iliyapolak schrieb:

 >>>Could anyone explain the CPUID(s) a little bit? 06_2A means Sandy Bridge or does it >>>

That means 32 nm Sandy Bridge microarchitecture.

This brings me already closer.

But what about Ivy Bridge and other unmentioned CPUID(s). Does anybody have some tips?

>>>The thing is that AVX shows no improvement over SSE>>>

Maybe exact the microcode implementation of the sqrt algorithm is the same when AVX and SSE instruction are compared.

I think AVX sqrt implementation only calls SSE implementation for lower and upper YMM register. As latency is doubled for double data amount.

But I am not sure, whether this is for all Sandy Bridge or only because I test on a middle class Sandy Bridge for mobile notebooks.

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.  Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

@Tim

Is it possible to obtain an information about the exact algorithm used to calculate sqrt values on Intel CPU's?

Christian,

Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

>>...As latency is doubled for double data amount...

In SSE performance numbers are almost the same for the following test cases:

[ Test-case 1 ]

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[2] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmValue.m128_f32[3] = ( RTfloat )0.0f; // since this is 0.0 sqrt shouldn't be calculated
mmResult = _mm_sqrt_ps( mmValue );

[ Test-case 2 ]

RTfloat fA = 625.0f;
mmValue.m128_f32[0] = ( RTfloat )fA;
mmValue.m128_f32[1] = ( RTfloat )fA;
mmValue.m128_f32[2] = ( RTfloat )fA;
mmValue.m128_f32[3] = ( RTfloat )fA;
mmResult = _mm_sqrt_ps( mmValue );

>>...06_2A means Sandy Bridge or does it not?..

I'll take a look.

In general, you need to get more detailed information like:
...
CPU Brand String: Intel(R) Atom(TM) CPU N270 @ 1.60GHz
CPU Vendor : GenuineIntel
Stepping ID = 2
Model = 12
Family = 6
Extended Model = 1
...
and then to "map" these numbers to codes in the manual.

Zitat:

TimP (Intel) schrieb:

As Christian hinted, the hardware implementation of IEEE divide and sqrt on Sandy and Ivy Bridge sequences the operands into AVX-128 pairs, so it's likely there is little performance gain for AVX-256 vs. SSE/SSE2 or AVX-128.  Ivy Bridge greatly reduces the latency so there may not be an incentive to employ the reciprocal and iteration option. That is termed a throughput optimization for single/float precision, in that it opens up opportunity for instruction level parallelism in a loop which has significant work in instructions other than divide/sqrt.

The thing with Ivy Bridge is really interesting. With add and mul Sandy Bridge already allows quite good instruction level parallelism. If one result is not directly base on the operations before, one can fill the pipeline very well and get a result per clock nearly, I suppose.

Can one find the optimizations of Ivy Bridge also in the Intrinsics guide? I do not find the appropriate CPUID. If 06_2A is Sandy Bridge, then according to the table from http://software.intel.com/en-us/articles/intel-architecture-and-processo..., Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked every but those that are imporant for me).

Zitat:

Sergey Kostrov schrieb:

Christian,

Let me know if you need real performance numbers for different sqrt functions and floating-point types ( 6 tests in total / 5 different C++ compilers ). I can do it for Intel Core i7-3840QM ( Ivy Bridge / 4 cores ) and older CPUs, for example Intel Pentium 4.

This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be a good indicator. I wounder whether the results also differ within a CPU family.

You mentioned that I should "map these numbers to codes in the manual. Which manual are you talking about exactly?

>>... Which manual are you talking about exactly?

Please take a look at: http://www.intel.com/content/www/us/en/processors/architectures-software...

Here are a couple of more links & tips:

- You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT

- Try to use msinfo32.exe utility ( it provides some CPU information )

- http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cac...

Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page )

- http://software.intel.com/en-us/forums/topic/278742

>>>I am especially interested on the performance of the precise square root operation>>>

Here you have a very interesting discussion about the hardware accelereted sqrt calculation

http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

>>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

  >>>With add and mul Sandy Bridge already allows quite good instruction level parallelism>>>

Sandy Bridge really improved instruction level parallelism by adding one or two new ports to the execution cluster.So for example when your code has fp add(one vector addition) and fp mul(one vector multiplication) both without beign interdependent on each other they can be executed simultaneously.

">>>I am especially interested on the performance of the precise square root operation.>>>

Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe..."

These imprecise operations are available via Intel compiler options

/Qimf-accuracy-bits:bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

So you can request the 13-bit accuracy implementation of divide and sqrt. Iterative methods with less than full precision can be produced by requesting 20- 40- or 49-bit accuracy.  22-bit accuracy is the default for single precision vectorization; -Qprec-div -Qprec-sqrt (implied by /fp:source|precise) changes default to 24/53-bit accuracy.  Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.   Original core 2 duo with the slower divide and sqrt is no longer in production.  I turned mine in after 4.5 years rather than re-install WIndows a 4th time.

The x87 divide and sqrt also support a trade-off between speed and precision, by setting 24-, 53- (default for Intel and Microsoft compilers) or 64- (hardware default, /Qpc80) bit precision mode.

You also have the choice, since SSE, of gradual underflow (/Qftz-) to maintain precision in the presence of partial underflow.  Sandy Bridge removes the performance penalty for /Qftz- in most common situations.  This was done in part because it's not convenient to set abrupt underflow when using Microsoft or gnu compilers.

All these options are more than most developers are willing to bargain for (and QA test).  That's one of the reasons for availability of IEEE standard compliant instructions and for progress at the hardware level in making them more efficient.

>>>Follow this link : http://stackoverflow.com/questions/1528727/why-is-sse-scalar-sqrtx-slowe...

Quite interesting discussion, it provides a lot of information.

I found the following discussion about square root and AVX: http://stackoverflow.com/questions/8924729/using-avx-intrinsics-instead-...

One is mentioning something about instruction emulation. Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?

>>>These imprecise operations are available via Intel compiler options ...

Wow, this information is quite new. I did not know one could control accuracy.

>>> Beginning with Harpertown, the IEEE instructions, referred to as "native" in your references, have been quite competitive for SSE/SSE2.

So this was the first time IEEE compliant instructions provided quite good speed compared to other SSE/SSE2 versions?

And to x87: I found that some compilers use only x87 FPU in 32 bit mode and switching same code to compile for 64 bit mode, SSE is used (only scalar version). Is this also something can be controlled? For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.

>>>For some algorithms high accuracy might be useful. x87 fpu provides most precision with 80 bit. This can not be achieved with SSE any more.>>>

Yes because this is the developer's decision and/or project constraints to favor precision over vectorization of the code.

>>>Is it true that low end processor (lets take an i3 Sandy Bridge) has other execution units or less than an i7 Sandy Bridge?>>>

I'm not sure if Core i3 has less execution units  than Core i7.I think that main difference is in cache size ,TDP ,number of physical and logical cores(HT), more agressive overclocking.

>>>>These imprecise operations are available via Intel compiler options...

That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time.

>>
>>Wow, this information is quite new. I did not know one could control accuracy...

Please take a look at a _control87 CRT-function.

Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:

Forum topic: Mathimf and windows
Web-link: http://software.intel.com/en-us/forums/topic/357759

>>>>These imprecise operations are available via Intel compiler options...

That is correct. However, from my point of view and experience, a more flexible way to control precision is a precision control at run-time.

>>
>>Wow, this information is quite new. I did not know one could control accuracy...

Please take a look at a _control87 CRT-function.

Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:

Forum topic: Mathimf and windows

Sorry, I forgot to specify a forum's name...

>>Note: We recently had a very good discussion regarding precision issues and, if interested, please take a look at:
>>
>>Forum topic: Mathimf and windows

It is in Intel C++ compiler forum.

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked
>>every but those that are imporant for me)...

Christian, Please take a look at Table 3-18. Highest CPUID Source Operand for Intel 64 and IA-32 Processors ( page 212 ) in

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Order Number: 325383-044US
August 2012

>>...Ivy Bridge should have 06_3A. But I can't find it in the Intrinsics guide for any instructions (I have not checked
>>every but those that are imporant for me)...

This is what my CPUID test case displays:
...
CPU Brand String: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz
CPU Vendor: GenuineIntel
Stepping ID = 9
Model = 10
Family = 6
Extended Model = 3
...

>>>>...Let me know if you need real performance numbers for different sqrt functions and floating-point types...
>>>>
>>This would be great! I am especially interested on the performance of the precise square root operation. Different CPUs would be
>>a good indicator...

In general all tests based on the following for loop:
...
int iNumberOfIterations = 16777216; // 2^24

g_uiTicksStart = ::GetTickCount();
for( int t = 0; t < iNumberOfIterations; t++ )
{
...
}
g_uiTicksEnd = ::GetTickCount();
printf( RTU(" - %ld ticks\n"), ( int )( g_uiTicksEnd - g_uiTicksStart ) );
...
for Microsoft C++ compiler, Debug and Release configurations, and without any optimizations.

[ Microsoft C++ compiler / Debug configurations ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 296 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 281 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 577 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 593 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 593 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 343 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 3011 ticks
625.000^0.5 = 25.000

CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 984 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 969 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 2422 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 2500 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 2672 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 1406 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 11187 ticks
625.000^0.5 = 25.000

[ Microsoft C++ compiler / Release configurations ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 281 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 297 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 318 ticks
625.000^0.5 = 25.000

F32vec4 class
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

CPU: Intel(R) Pentium(R) 4 CPU 1.60GHz

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 985 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 969 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 406 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 1422 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 1953 ticks
625.000^0.5 = 25.000

>>...I wounder whether the results also differ within a CPU family...

I can't verify it. However, when it comes to precision if a 53-bit precision is set then results must be the same for all CPUs.

In the particular case where your operands can be expressed exactly in 12 bits precision, it seems that your accuracy doesn't vary among these methods.  Accuracy of the sqrt reciprocal approximation varies between AMD CPU families, but I think Intel tried to keep it the same. 

If you wished to test accuracy of sqrt without going through an exhaustive list of cases, you could try something like the Paranoia benchmark.

The earliest AMD families had a 14-bit approximation which would be sufficient to obtain 52 bits after 2 iterations; this has been considered at Intel but I don't know of it ever being adopted.

Thanks for posting sqrt(x) test case.

What is this sqrt(x) implementation "User Sqrt - RTfloat"?

Do you have results for SSE sqrt(x) where x = double primitive type?

>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?..

It is based on a classic iterative method and I'll provide more details later.

>>Do you have results for SSE sqrt(x) where x = double primitive type?..

No. If you decide to test it you will need to use:

__m128d _mm_sqrt_pd( __m128d )

Note: It is the same as SQRTPD instruction.

Hi everybody, Next three test results demonstrate what the latest version of Intel C++ compiler can do...

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Precise (/fp:precise)

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 265 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 203 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

Note 1: 47 ticks for 2^24 iterations!
Note 2: 1 sec is 1000 ticks.

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast (/fp:fast)

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 188 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++]

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 187 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 46 ticks
625.000^0.5 = 25.000

>>>>What is this sqrt(x) implementation "User Sqrt - RTfloat"?..
>>
>>It is based on a classic iterative method and I'll provide more details later.

Here it is:
...
RTint iNumberOfIterations = _RTNUMBER_OF_TESTS_0016777216; // 2^24

// Sub-Test 1 - User Sqrt - RTfloat
{
CrtPrintf( RTU("User Sqrt - RTfloat\n") );
RTfloat fA = 625.00f;
RTfloat fG = 625.00f;

RTfloat fQ = 0.0f;

CrtPrintf( RTU("Calculating the Square Root of %.3f"), fA );

g_uiTicksStart = SysGetTickCount();
for( RTint t = 0; t < iNumberOfIterations; t++ )
{
fQ = 0.0L;

while( RTtrue )
{
if( ( fQ - fG ) > -0.00001f )
break;
fQ = fA / fG;
fG = ( 0.5f * fG + 0.5f * fQ );
}
}
CrtPrintf( RTU(" - %ld ticks\n"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) );
CrtPrintf( RTU("%.3f^0.5 = %.3f\n"), fA, fG );
}
...

>>>SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks
625.000^0.5 = 25.000>>>

It is interesting which of the sqrt calculation methods does hardware accelerated SSE instruction use?

>>...which of the sqrt calculation methods does hardware accelerated SSE instruction use?

The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions:
...
/*
* Square Root of Double-Precision Floating-Point Values
* **** VSQRTPD ymm1, ymm2/m256
* Performs an SIMD computation of the square roots of the two or four packed
* double-precision floating-point values in the source operand and stores
* the packed double-precision floating-point results in the destination
*/
extern __m256d __cdecl _mm256_sqrt_pd(__m256d a);

/*
* Square Root of Single-Precision Floating-Point Values
* **** VSQRTPS ymm1, ymm2/m256
* Performs an SIMD computation of the square roots of the eight packed
* single-precision floating-point values in the source operand stores the
* packed double-precision floating-point results in the destination
*/
extern __m256 __cdecl _mm256_sqrt_ps(__m256 a);
...

some time later and I'd like to verify Christian's statement '...The thing is that AVX shows no improvement over SSE...'

Christian, Have you seen that picture: en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)#Roadmap on the Wiki?

>>...Has performance of this commands improved in Ivy Bridge?

As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge ).

>>>The last two, SSE Sqrt and F32vec4 class. I will do an evaluation of AVX sqrt-intrinsic functions:>>>

It seems that I had wrongly formulated my question.I wanted to ask which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by the SSE sqrt instructions.I have found this paper "Fast Floating Point Square Root".

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by
>>the SSE sqrt instructions...

I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.

>>...As I promised I'll do a verification and results will be posted ( unfortunately, only for Ivy Bridge )...

Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:

...The thing is that AVX shows no improvement over SSE...

...The thing is that AVX shows no improvement over SSE...

Christian, How did you come up to that conclusion? Could you follow up, please?

My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

>>>Iliya, do you have a computer with a CPU that support AVX? I need an independent verification of my test results and I really have lots of questions to Christian with regard to his statement:>>>

Sorry Sergey but I still have only Core i3.I can run your tests for SSE verification only.

>>...which of the mathematical algorithms used to calculated sqrt is implemented in hardware/microcode by >> the SSE sqrt instructions...

>>>I won't be surprised if it is a highly optimized version of Newton-Raphson Square Root algorithm and it would be nice to hear from Intel software engineers.>>>

Yes I thought the same.By looking at the algorithm one can see that it implements costly(for the hardware) division per every iteration so I think that Intel engineers probably optimized this part of the algorithm.

>>>CrtSqrt - RTdouble Calculating the Square Root of 625.000 - 94 ticks 625.000^0.5 = 25.000>>>

Interesting case is CRT sqrt function which is slower than SSE and AVX counterparts.I  suppose when disassembled it calls fsqrt x87 instruction which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and to preserve accuracy of the result.Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and performes an input checking.

@Sergey can you force compiler to inline calls to CRT sqrt function?

>>...CRT sqrt function which is slower than SSE and AVX counterparts. I suppose when disassembled it calls fsqrt x87 instruction
>>which itself has the latency of 10-24 core clock cycles(as reported by Agner tables).

There are two issues: a call overhead ( parameters verifications, etc ) and it could be dependent ( possibly ) on a setting of _set_SSE2_enable function ( I didn't verify it ).

>>It would be nice to test the FSQRT accuracy against the AVX VSQRTPD result.

Yes, but this is another set of tests and I won't have time for it.

>>FSQRT can use long double precision types for intermediate calculation stage in order to diminish rounding errors and
>>to preserve accuracy of the result.

HrtSqrt is actually based on it.

>>Longer execution time of Library sqrt function is probably due to additional C code which wraps FSQRT instruction and
>>performes an input checking.

Yes and this is what I called '...a call overhead...' before.

>>...can you force compiler to inline calls to CRT sqrt function?..

Yes, but it won't improve performance significantly (!) since '...parameters verifications, etc...' must be done anyway in the testing for loop.

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

Here are results:

[ Intel C++ compiler 13.0.089 / 32-bit / Release Configuration ]

CPU: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz

Optimization: Maximize Speed (/O2)
Code Generation:
Add Processor-Optimized Code Path: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QaxAVX)
Intel Processor-Specific Optimization: Intel(R) Core(TM) processor family with Intel(R) Advanced Vector Extensions support (/QxAVX)

Floating Point Model: Fast=2 (/fp:fast=2) [Intel C++]

User Sqrt - RTfloat
Calculating the Square Root of 625.000 - 140 ticks
625.000^0.5 = 25.000

User Sqrt - RTdouble
Calculating the Square Root of 625.000 - 188 ticks
625.000^0.5 = 25.000

CrtSqrt - RTfloat
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

CrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 93 ticks
625.000^0.5 = 25.000

HrtSqrt - RTdouble
Calculating the Square Root of 625.000 - 94 ticks
625.000^0.5 = 25.000

SSE Sqrt - RTfloat
Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
625.000^0.5 = 25.000

F32vec4 class - RTfloat
Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
625.000^0.5 = 25.000

AVX Sqrt - RTfloat
Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats )
625.000^0.5 = 25.000

>>...My test results show that AVX-based sqrt is ~6x faster than SSE-based sqrt.

>>...
>>SSE Sqrt - RTfloat
>>Calculating the Square Root of 625.000 - 47 ticks ( to calculate square roots for 4 floats )
>>625.000^0.5 = 25.000
>>...
>>AVX Sqrt - RTfloat
>>Calculating the Square Root of 625.000 - 15 ticks ( to calculate square roots for 8 floats )
>>625.000^0.5 = 25.000

This is how I've done assesment:

- Normalization factor is 2 = 8 ( floats ) / 4 ( floats ).
- Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6

>>>- Normalization factor is 2 = 8 ( floats ) / 4 ( floats ).
- Then, ( 47 ( ticks ) / 15( ticks ) ) * 2 ~= 6>>>

Thanks for clarifying this.I was wondering how did you get a 6x improvement in speed of execution.

Seiten

Melden Sie sich an, um einen Kommentar zu hinterlassen.