Performance of sqrt

102 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

<<>>

Btw hardware accelerated AVX and SSE sqrt implementations are also performing input checking and validation , but it is done at microcode/hardware level and the latency is lower(as expected) when compared to library sqrt(instruction decoding and sending to execution units takes some time).

>>>There are two issues: a call overhead ( parameters verifications, etc )>>>

@Sergey
It seems to me that my last message was improperly formatted.Double post above is related to the quoted sentence

>>...I was wondering how did you get a 6x improvement in speed of execution...

Do you want me to post a test case for SSE and AVX sqrt intrinsics?

Quote:

Sergey Kostrov wrote:

>>...I was wondering how did you get a 6x improvement in speed of execution...

Do you want me to post a test case for SSE and AVX sqrt intrinsics?

No thanks.I looked at your explanation and I understood how it was calculated.

Sorry for the late answer,

thanks for the lots of input. As to the wikipedia link: Does this mean Ivy Bridge as Sandy Bridge only with shrinked structure (and maybe some minor improvements).

To the tests: I read the answers in the thread once again more detailed. But I think you get AVX to be a lot of faster then SSE. Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge i5-2410M is slower?

In addition my test result are quite stable: I tested VS2010, VS2012 and Intel Compiler (a week ago so not newest version, I think there was an update the last days). In all I get speedup of 2 for SSE and AVX as double is used and 4 for floats. I always used standard Release Config 32 bit and 64 bit. And 2 other configurations with /arch:AVX for MS Compiler and /QxAVX and /QxaAVX for Intel compiler.

Or it is because of my time meassurement? I use function clock() and calc the difference from two variables with static_cast <double> (End - Start) / static_cast <double> (CLOCKS_PER_SEC)

My normal version:

    for (size_t k = 0; k < mInput.size(); k++)
    {
        mResult[k] = std::sqrt(mInput[k]);
    }

My SSE version:

    size_t const incr = 128 / (8 * sizeof(double));

    for (size_t k = 0; k < mInput.size(); k += incr)
    {
        __m128d val = _mm_loadu_pd(&mInput[k]);

        val = _mm_sqrt_pd(val);

        _mm_storeu_pd(&mResult[k], val);
    }

My AVX version:

    size_t const incr = 256 / (8 * sizeof(double));
    
    for (size_t k = 0; k < mInput.size(); k += incr)
    {
        __m256d val = _mm256_loadu_pd(&mInput[k]);

        val = _mm256_sqrt_pd(val);

        _mm256_storeu_pd(&mResult[k], val);
    }

Additional information:

size of input vector: 5000000

repeating test: 5000

I summed up the times of all test repetitions. Some additional code is used to avoid "wrong" compiler optimization. In the beginning code was optimized nearly away as I did not use result data.

So one post more from me in a row:

I also checked the disassembly of my code, but there was nothing unexpected.

And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called.

What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit.

It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.

Did you take care to make at least the store aligned?  If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).

As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.

Thanks for the feedback and the test case!

>>...Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge
>>i5-2410M is slower?

I don't try to compromise your results and I simply would like to see that Intel's AVX provides advantages over SSE.

I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project and I'll take care of it. I'll upload the project as soon as it is ready.

>>...Some additional code is used to avoid "wrong" compiler optimization. In the beginning code was optimized nearly away as
>>I did not use result data...

I had the same problem and I'll create a new thread on Intel C++ compiler forum some time later.

>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>

Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.

>>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>

That's true.

>>What code exactly differs from some settings. VS2010 for example generates FPU instructions for 32bit and skalar SSE for 64bit. Intel generate skalar SSE for both 32 and 64 bit>>>

At least since VS2005, there has been /arch:SSE2 option.  VS2012 adds /arch:AVX option and limited use of parallel SSE2 or AVX instructions.

Thanks Tim

Quote:

TimP (Intel) wrote:

It looks like you may have made the test big enough to exceed cache, so it might be expected to be bandwidth limited.

Did you take care to make at least the store aligned?  If the loads are unaligned, and you don't have at least corei7-3, splitting them explicitly into 128-bit loads is expected to be faster (although maybe not worth the effort if you want readable intrinsics).

As you commented earlier, now that you have shown code excerpts, there is nothing here to produce better performance with AVX on current platforms.

I tried with other data sized, too. This did not have much influence. I reduced the data amount to 1 MB (which is a third of 3 MB L3 cache). There AVX got 3% more performance than SSE.

Sorry I forgot to mention, all data for all tests has been aligned to 32 bytes. I used a user defined allocator to assure this for STL vector container.

Would using prefetch increase performance a little bit? I think I saw an example therefore in the forum here, some time ago.

Quote:

iliyapolak wrote:

>>>And to the CRT sqrt function. Simply call std::sqrt, debug it and go to the disassembly window. Now do single steps. This way you can get into CRT assembler code. There is a lot of input checking done before sqrt is called>>>

Thanks for your advise,but I prefer to work with the help of IDA Pro disassembler and windbg.

I will check that tool. But I think it is not free.

Quote:

Sergey Kostrov wrote:

Thanks for the feedback and the test case!

>>...Should I test my test code? Or let me run your code in my machine? Maybe my test system with Sandy Bridge
>>i5-2410M is slower?

I don't try to compromise your results and I simply would like to see that Intel's AVX provides advantages over SSE.

I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project and I'll take care of it. I'll upload the project as soon as it is ready.

I don not want to compromise your tests, too. It is just interesting that our results differ that much. At first I was supprised about my results. Later I just found the link I already provided. I am doing some research for a student research project. So I just wanted to assure my results go along with other people's result.

With an FIR filter for example I get quite good speedup of AVX: twice fast for double and nearly twice fast for float. But as soon as sqrt is involed in algorithm AVX might not get more than 10% of speed.

Please let me know I you opened another thread for the test you mentioned or post the link here.

Maybe CPU documentation gives some hints. One processor might have lower latency than another for same instruction.

// EDIT:
I could not find instruction latency information when I  go to my processor and then to datasheets. There is volume 1 and 2 but none of them mentions something like instruction latency. I only found http://www.agner.org/optimize/instruction_tables.pdf some information there, but Ivy Bridge is not listed there.

// EDIT: Google is your friend, I think this is quite helpful: http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optim...

Here I get Latency and Throughput for different CPU families. And what is important: for AVX Ivy Bridge nearly has twice the throughput for division and square root compared to Sandy Bridge.

Quote:

Sergey Kostrov wrote:

Here are a couple of more links & tips:

- You need to look at Intel 64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C, INSTRUCTION LATENCY AND THROUGHPUT

- Try to use msinfo32.exe utility ( it provides some CPU information )

- http://ark.intel.com -> http://ark.intel.com/products/52224/Intel-Core-i5-2410M-Processor-3M-Cac...

Note: Take a look at a datasheet for your i5-2410M CPU in a Quick Links section ( on the right side of the web page )

- http://software.intel.com/en-us/forums/topic/278742

I tried to find it in the datasheets. There is Volume 1 and 2. But I could not find latency information or something related to instructions.

>>>I will check that tool. But I think it is not free.>>>

Full version of IDA is not free,but you can download stripped down version which is free,Windbg is free.

>>I tried with other data sized, too. This did not have much influence. I reduced the data amount to 1 MB (which is a third of
>>3 MB L3 cache). There AVX got 3% more performance than SSE.

Thanks for the note, Christian.

It matches to my results. Since you have a system with Sandy Bridge CPU and I have a system with Ivy Bridge CPU throughput is 1 to 2. Your 3% result has to be multiplied by 2 and we get 6, that is a performance improvement is 6%. Is that wrong?

>>...Would using prefetch increase performance a little bit? I think I saw an example therefore in the forum here, some time ago.

Yes, especially for data sets which are greater than 64KB and you will need to re-implement your main for loop. There are lots of posts on IDZ forums related to that subject and just enter _mm_prefetch in a search control. Please take a look at one of a recent threads related to prefetching ( there are some codes and tests data ): software.intel.com/en-us/forums/topic/352880.

>>...I think the best way to proceed is a consolidated set of tests ( SSE vs. AVX ) in a Visual Studio project...

Christian,

I hope that on Monday I will upload a project with my tests. Then you will need to add your tests with STL containers, of course as soon as you have time. When everything is ready ( sources tuned, etc ) a set of new tests on our systems could be done and new results posted.

>>>With an FIR filter for example I get quite good speedup of AVX: twice fast for double and nearly twice fast for float>>>

This could be due to easily vectorized code and wider registers.

Sergey,

if you have your project, please let me know. I help you to integrate my code and we can do a test on different plattforms.

Sound interesting about prefetch. As soon as I have time, I try to integrate this, too.

It is stilly fancy that you get an speedup of 6, which means AVX is 6 time faster. This is far away from my results. But we tested on other architectures. Ivy Bridge doubles througput for instruction and decreased latency.

>>...It is stilly fancy that you get an speedup of 6, which means AVX is 6 time faster. This is far away from my results. But we
>>tested on other architectures. Ivy Bridge doubles througput for instruction and decreased latency.

Hi Christian,

Out "test strategies" are different ( and that's OK! ) and you will see that in the test project soon. I'm not going to over-complicate and will keep it as simple as possible.

Best regards,
Sergey

PS: Sorry for the delay ( too many different things for Monday... )

Hello Sergey,

it's ok. I am really interested in your test strategy. I hope you see this message. The last time I often had troubles, because my posts gut stuck in the spam filter. I think you missed my last post, too.

Kind regards,

Christian

Hi everybody,

Please find attached a Visual Studio 2008 project ( it is a Professional Edition ) with Intel C++ compiler set by default. Tests for three sqrt-functions are currently implemented:

- CRT sqrt
- SSE sqrt ( intrinsic )
- AVX sqrt ( intrinsic )

Christian, please add your STL codes ( as soon as you have time ) and upload project for tests. Thanks in advance and let me know if you have any issues or questions. You're free to modify and improve codes.

Best regards,
Sergey

Attachments: 

AttachmentSize
Downloadapplication/zip sqrttestapp.zip4.94 KB

Hi

Thanks for posting SQRT test cases source code.IIRC a few months ago one of the forum user advised to do not call a tested function with the constant value.Value should be pseudo-random in the proper input range.It can be interesting to see how such a approach will affect the speed of execution.

>>...a few months ago one of the forum user advised to do not call a tested function with the constant value...

Interations counter t could be also used. A call to rand CRT-function would create additional overhead unless it is done before main for loop. Anyway, it is not a problem to test and to see if results are different.

>>>A call to rand CRT-function would create additional overhead unless it is done before main for loop>>>

Calling rand and srand from within for loop is not applicable because of function call overhead.Also casting loop counter will incure some overhead.

It might take one or two days more, to add this. The days have too little time.

But to give an short overview: I create a vector, fill it with random data. This part is not included in time meassure. Then I operate on it. After getting time, I pick random item and store it in a volatile variable. This way compiler does not optimize anything away, also in release config.

Would it be ok, to go to Visual Studio 2010 project? I work with VS2010 and VS2012 and have no 2008 installed?

>>...It might take one or two days more, to add this. The days have too little time.

Thanks for the update.

>>...Would it be ok, to go to Visual Studio 2010 project? I work with VS2010 and VS2012 and have no 2008 installed?..

Yes. I will "port" the project back to Visual Studio 2008 ( opposite case... ).

>>...But I think you get AVX to be a lot of faster then SSE...

As of today I have two test cases and you have a 2nd one in VS 2008 project attached a couple of days ago. I see 3x difference on Ivy Bridge system ( application was compiled with Intel C++ compiler XE 2013 ) and I don't confirm 6x difference between SSE2 and AVX sqrt-calculations. Also, I consider that the 2nd case is better implemented.

As Tim noted you could have some negative impact related to cache lines and I think you need to use VTune to analyse your processing on Sandy Bridge.

So here we go with the tests.

Unfortunately I see a problem. You use very high iteration counts. And now I create a vector containing all elements and operate on it. The same for the result. Maybe we should change this to a combination of vector with certain size and iterating above it. But then it is hard to get comparable results. Do you have some idea? We could create a vector with size of 1 MB to fit in L3 for sure. Then operate on all elements and repeat this to get on overall iteration count?

Feel free to make comments on the code.

Attachments: 

AttachmentSize
Downloadapplication/zip sqrttestapp.zip16.84 KB

>>...Do you have some idea?..

Let me take a look at updated sources and I'll post updated project some time next week. Thank you!

>>...Unfortunately I see a problem. You use very high iteration counts...

If I use a number that is less then 2^24 then tests for AVX-sqrt will be executed in less then 15 ticks. I have Intel Core i7-3840QM ( Ivy Bridge ) and it is very fast. So, there is nothing wrong here and you can use a lower number. That is why there is a piece of code like:


...

//	int iNumberOfIterations =  16777216;			// 2^24

//	int iNumberOfIterations =  33554432;			// 2^25

//	int iNumberOfIterations =  67108864;			// 2^26

//	int iNumberOfIterations = 134217728;			// 2^27

	int iNumberOfIterations = 268435456;			// 2^28

...

Ah this sounds quite good.

Let me know when you have the updated project. I will let it run on two Sandy Bridge Systems (one mobile i5 and desktop high class i7).

Here is updated Visual Studio 2008 project ( it is a Professional Edition ) with Intel C++ compiler set by default. Please do a code review and, if everything looks good, we're ready to do testing.

Best regards,
Sergey

Attachments: 

AttachmentSize
Downloadapplication/zip sqrttestapp.zip6.84 KB

[ Example of Output ]

64-bit Windows platform

Notes:
- Processing is Normalized - Tests calculate 8 sqrt values per iteration
- Number of iterations is 33554432

Tests started

CRT Sqrt - float
Calculating the Square Roots - Done in xx ticks
625.000^0.5 = 25.000

SSE Sqrt - float
Calculating the Square Roots - Done in xx ticks
625.000^0.5 = 25.000

AVX Sqrt - float
Calculating the Square Roots - Done in xx ticks
625.000^0.5 = 25.000

STL vector size: 67108864 ( float elements )
Number of tests: 4

STL vector: STL Sqrt - float
Calculating the Square Roots
Test 1: xxx ticks
Test 2: xxx ticks
Test 3: xxx ticks
Test 4: xxx ticks
Average: xxx ticks

STL vector: SSE Sqrt - float
Calculating the Square Roots
Test 1: xx ticks
Test 2: xx ticks
Test 3: xx ticks
Test 4: xx ticks
Average: xx ticks

STL vector: AVX Sqrt - float
Calculating the Square Roots
Test 1: xx ticks
Test 2: xx ticks
Test 3: xx ticks
Test 4: xx ticks
Average: xx ticks

Tests completed

Press ESC to Exit...

Think code is quite good now, so lets start the tests and see what we get.

>>...Think code is quite good now, so lets start the tests and see what we get.

I'll post my results today in the afternoon. Thanks, Christian.

Thanks, Sergey.

I think the whole combination of our test scenarios gives quite an good overview.

>>...I think the whole combination of our test scenarios gives quite an good overview.

If you wish I could e-mail my test results in a private message. Would you be able to create a combined report before posting it?

You mean I run the test on another machine and then we post it together?

>>You mean I run the test on another machine and then we post it together?

Yes.

- I do the test on Ivy Bridge and email you results
- You do the test on Sandy Bridge and create a combined report
- You post results to the thread

Does it look good?

Yes, thats fine. I think I can run the tests that evening and post them results tomorrow.

And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler?

>>...And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler?

Yes. I'll build binaries ( 32-bit and 64-bit Release Configurations ) for tests and pack them into a zip-archive. My test results also will be included.

Note: You will need run-time DLLs ( Redistributable Package ) for Visual Studio 2008 and you can get it from Download.Microsoft.com.

>>>>...And please email me also the exe, so we test the same thing. You work with VS2008 and Intel Compiler?
>>
>>Yes. I'll build binaries ( 32-bit and 64-bit Release Configurations ) for tests and pack them into a zip-archive. My test
>>results also will be included.

Done. Please check your private messages.
Best regards,
Sergey

Here you find the test results, based on the project provided above. All additional information can be found in the output itself.

///////////////////////////////////////////////////////////////////////////////
    CONSOLE APPLICATION : SqrtTestApp Project Overview
///////////////////////////////////////////////////////////////////////////////

Release Notes:

    6. Tests on Sandy Dridge system:

    >> 32-bit <<

        32-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 172 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 327 ticks
                Test  2: 343 ticks
                Test  3: 344 ticks
                Test  4: 327 ticks
                Average: 335 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 93 ticks
                Test  2: 94 ticks
                Test  3: 78 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 94 ticks
                Test  2: 78 ticks
                Test  3: 93 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        Tests completed

        Press ESC to Exit...

    >> 64-bit <<

        64-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 187 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 16 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 328 ticks
                Test  2: 343 ticks
                Test  3: 343 ticks
                Test  4: 343 ticks
                Average: 339 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 78 ticks
                Test  2: 78 ticks
                Test  3: 93 ticks
                Test  4: 78 ticks
                Average: 81 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 78 ticks
                Test  2: 94 ticks
                Test  3: 93 ticks
                Test  4: 94 ticks
                Average: 89 ticks

        Tests completed

        Press ESC to Exit...

    5. Sandy Bridge system:

        Betriebssystemname    Microsoft Windows 7 Home Premium
        Version    6.1.7601 Service Pack 1 Build 7601
        Zusätzliche Betriebssystembeschreibung     Nicht verfügbar
        Betriebssystemhersteller    Microsoft Corporation
        Systemname    DANIELA-LAPTOP
        Systemhersteller    SAMSUNG ELECTRONICS CO., LTD.
        Systemmodell    RV420/RV520/RV720/E3530/S3530/E3420/E3520
        Systemtyp    x64-basierter PC
        Prozessor    Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz, 2301 MHz, 2 Kern(e), 4 logische(r) Prozessor(en)
        BIOS-Version/-Datum    Phoenix Technologies Ltd. 03PQ, 08.07.2011
        SMBIOS-Version    2.6
        Windows-Verzeichnis    C:\Windows
        Systemverzeichnis    C:\Windows\system32
        Startgerät    \Device\HarddiskVolume1
        Gebietsschema    Österreich
        Hardwareabstraktionsebene    Version = "6.1.7601.17514"
        Benutzername    Daniela-Laptop\Daniela
        Zeitzone    Mitteleuropäische Zeit
        Installierter physikalischer Speicher (RAM)    6,00 GB
        Gesamter realer Speicher    5,98 GB
        Verfügbarer realer Speicher    4,28 GB
        Gesamter virtueller Speicher    12,0 GB
        Verfügbarer virtueller Speicher    10,3 GB
        Größe der Auslagerungsdatei    5,98 GB
        Auslagerungsdatei    C:\pagefile.sys

    4. Tests on Ivy Dridge system:

    >> 32-bit <<

        ..\SqrtTestApp\Release>SqrtTestApp32.exe
        32-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 62 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 109 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 343 ticks
                Test  2: 359 ticks
                Test  3: 343 ticks
                Test  4: 359 ticks
                Average: 351 ticks

        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 62 ticks
                Test  2: 47 ticks
                Test  3: 47 ticks
                Test  4: 62 ticks
                Average: 54 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 62 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 50 ticks

        Tests completed

    >> 64-bit <<

        ..\SqrtTestApp\x64\Release>SqrtTestApp64.exe
        64-bit Windows platform

        Notes:
        - Processing is Normalized - Tests calculate 8 sqrt values per iteration
        - Number of iterations is 33554432

        Tests started

        CRT Sqrt - float
                Calculating the Square Roots - Done in 47 ticks
                625.000^0.5 = 25.000

        SSE Sqrt - float
                Calculating the Square Roots - Done in 109 ticks
                625.000^0.5 = 25.000

        AVX Sqrt - float
                Calculating the Square Roots - Done in 31 ticks
                625.000^0.5 = 25.000

        STL vector size: 67108864 ( float elements )
        Number of tests: 4

        STL vector: STL sqrt - float
                Calculating the Square Roots
                Test  1: 359 ticks
                Test  2: 343 ticks
                Test  3: 359 ticks
                Test  4: 343 ticks
                Average: 351 ticks
        
        STL vector: SSE sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 62 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 50 ticks

        STL vector: AVX sqrt - float
                Calculating the Square Roots
                Test  1: 47 ticks
                Test  2: 47 ticks
                Test  3: 47 ticks
                Test  4: 47 ticks
                Average: 47 ticks

        Tests completed

    3. Ivy Bridge system:

        OS Name                            Microsoft Windows 7 Professional
        Version                            6.1.7601 Service Pack 1 Build 7601
        Other OS Description             Not Available
        OS Manufacturer                    Microsoft Corporation
        System Name                        DELLPM
        System Manufacturer                Dell Inc.
        System Model                    Precision M4700
        System Type                        x64-based PC
        Processor                        Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
        BIOS Version/Date                Dell Inc. A05, 08/10/2012
        SMBIOS Version                    2.7
        Windows Directory                C:\Windows
        System Directory                C:\Windows\System32
        Boot Device                        \Device\HarddiskVolume2
        Locale                            Canada
        Hardware Abstraction Layer        Version = "6.1.7601.17514"
        User Name                        DellPM\Admin
        Time Zone                        Mountain Standard Time
        Installed Physical Memory (RAM)    16.0 GB
        Total Physical Memory            15.9 GB
        Available Physical Memory        14.3 GB
        Total Virtual Memory            47.9 GB
        Available Virtual Memory        46.3 GB
        Page File Space                    32.0 GB
        Page File                        C:\pagefile.sys

    2. When int iNumberOfIterations = 268435456 ( 2^28 ) there is Microsoft C++
       exception: std::length_error ( Vector is too long ) and application crashes

       Fixed. A different way of processing is used now.

    1. Renamed aligned_alloc.h to AlignedAlloc.h

///////////////////////////////////////////////////////////////////////////////

Here is a short overview of the test implemented by Christian and Sergey in order to test performance of calculation of Square Roots:

- Different sqrts are tested on two systems: Sandy Bridge and Ivy Bridge

- Sandy Bridge configuration:

Processor: Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz, 2301 MHz, 2 Core(s), 4 Logical Processor(s)
OS Name: Microsoft Windows 7 Home Premium ( 64-bit )
Physical Memory (RAM): 6.00 GB

- Ivy Bridge configuration:

Processor: Intel(R) Core(TM) i7-3840QM CPU @ 2.80GHz, 2801 Mhz, 4 Core(s), 8 Logical Processor(s)
OS Name: Microsoft Windows 7 Professional ( 64-bit )
Physical Memory (RAM): 16.00 GB

- Both systems support AVX instruction set

- On both systems the same executable was executed ( compiled on Ivy Bridge ) in order to make results as consistent as possible ( Thanks, Christian for that idea! )

- There are 6 tests in total and here are consolidated results for a quick comparison:

Sandy Bridge vs. Ivy Bridge - 32-bit configuration

1. CRT Sqrt - float - Done in 47 ticks vs. 62 ticks
2. SSE Sqrt - float - Done in 172 ticks vs. 109 ticks
3. AVX Sqrt - float - Done in 31 ticks vs. 31 ticks
4. STL vector: STL sqrt - float - Average: 335 ticks vs. 351 ticks
5. STL vector: SSE sqrt - float - Average: 89 ticks vs. 54 ticks
6. STL vector: AVX sqrt - float - Average: 89 ticks vs. 50 ticks

Sandy Bridge vs. Ivy Bridge - 64-bit configuration

1. CRT Sqrt - float - Done in 47 ticks vs. 47 ticks
2. SSE Sqrt - float - Done in 187 ticks vs. 109 ticks
3. AVX Sqrt - float - Done in 16 ticks vs. 31 ticks
4. STL vector: STL sqrt - float - Average: 339 ticks vs. 351 ticks
5. STL vector: SSE sqrt - float - Average: 81 ticks vs. 50 ticks
6. STL vector: AVX sqrt - float - Average: 89 ticks vs. 47 ticks

- Tests 1, 2, 3 demonstrate what a developer could expect when only a couple of sqrts need to be calculated ( 1 value, or 4 values, or 8 values )

- Tests 4, 5, 6 demonstrate what a developer could expect when sqrts of a large vector need to be calculated

- Attached is a zip-file with source codes ( Visual Studio 2008 Professional Edition and Intel C++ compiler XE 13.0.0.089 is set )

Attachments: 

AttachmentSize
Downloadapplication/zip sqrttestapp.zip8.2 KB

I'd like to clarify a couple things:

- Tests 1, 2 and 3 executed 33554432 ( 2^25 ) times
- STL vector size: 67108864 float elements ( 64MB of single-precision floating point values )
- Win32 API function GetTickCount used to measure time intervals
- 1 sec = 1000 ticks

Please see the codes for more details.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today