64bit code works significantly slower than 32bit code

64bit code works significantly slower than 32bit code

Hi All,

Here are some benchmarks (time that test requires to complete) of the code being compiled and tested on x86 and x64 platforms using MSVC and Intel C++ Compiler.

MSVC 32 bit: 195.9 seconds
Intel 32 bit: 178.1 seconds (17.8 seconds faster, very good)

MSVC 64 bit: 194.9 seconds (time of execution is almost the same with MSVC 32 bit cde)
Intel 64 bit: 187.3 seconds (nine seconds slower than Intel 32 bit code)

MSVC code does not degrade when compiling for x64 while Intel code becomes slower for x64. Are there any tricks to make 64bit Intel code as fast as 32bit Intel code ?

Thanks in advance.

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Thanks Romant,

Could let us know what kind of opt opinion do you setup in your project? For example, what kind of optimization level do you set? Do you use SIMD optimization? or what target Machine do you setup for project configure? All these opinions will impact your program perfromance.

Hope it can help you!

Thanks,
Wise

Thanks Romant,

Unfortunately, I can't download your png files from your URL. Could you send them to me through email? My email address is wise.chen@intel.com

Thanks,
Wise

Thanks Romant,

I got your configure pictures from my team member. I didn't see any issue on it. One more two questions:
1. Could you provide me your machine configure in which you did test?
2. Could you provide me the 'Linker' option setup in your project property pages?

Thanks,
Wise

Wise,

Machine: Intel Core i7 920 CPU (overclocked to 3.33 Ghz), XP64 English OS.

Linker command line (32bit), excluded list of input libs:
/INCREMENTAL:NO /nologo /NODEFAULTLIB:"libcmt.lib" /NODEFAULTLIB:"libcmtd.lib" /NODEFAULTLIB:"libcpmt.lib" /NODEFAULTLIB:"libcpmtd.lib" /TLBID:1 /DEBUG /SUBSYSTEM:WINDOWS /LARGEADDRESSAWARE /OPT:REF /OPT:ICF /ENTRY:"wWinMainCRTStartup" /MACHINE:X86 /FIXED:NO

Linker command line (64bit), excluded list of input libs:
/INCREMENTAL:NO /nologo /NODEFAULTLIB:"libcmt.lib" /NODEFAULTLIB:"libcmtd.lib" /NODEFAULTLIB:"libcpmt.lib" /NODEFAULTLIB:"libcpmtd.lib" /TLBID:1 /DEBUG /SUBSYSTEM:WINDOWS /LARGEADDRESSAWARE /OPT:REF /OPT:ICF /ENTRY:"wWinMainCRTStartup" /MACHINE:X64 /FIXED:NO

Interprocedural optimization is enabled in both configurations. Please, let me know if this info is not enough.

The timings that you display do not strike me as justifying the attribute "significantly slower". There are many compiler options that can be used to tune your application. A cursory look at your screenshots showed me that you had not selected the option to generate code specific to your CPU.

Quoting mecej4
The timings that you display do not strike me as justifying the attribute "significantly slower". There are many compiler options that can be used to tune your application. A cursory look at your screenshots showed me that you had not selected the option to generate code specific to your CPU.

My question is: what must I do in order to make 64bit code as fast as 32bit code ? I'm open to any suggestions and experiments as Intel C++ Compiler is a new product for me.

That is a pretty difficult question to answer without some representative example to analyze. Is there a particular part of the code that is noticeably slower? Could you post a small kernel that shows a similar problem?Thanks!Dale

Unfortunately, I can't extract a piece of code to analyze ... I can only describe what the code does. It is a number of sets of nested loops on a number of arrays, each array contains about 500 thousands float numbers. In other words, pure number crunching that involves simple arithmetic operations at most.

Thanks Romant,

It is hard for us to tune your code with your source code.
I read your configure for C/C++ and Linker. Just from configure setup of optimization view, please Enable your 'Interprocedureal Optimization' of OPtimization of Linker and try it again.

Hope it can help you.

Thanks,
Wise

Quoting romant73Unfortunately, I can't extract a piece of code to analyze ... I can only describe what the code does. It is a number of sets of nested loops on a number of arrays, each array contains about 500 thousands float numbers. In other words, pure number crunching that involves simple arithmetic operations at most.

Try this:

Run a profiler to find the hot spots of your application. Select a few of the hottest for further consideration. Select thesame functions for both x32 and x64 builds.Set a break point in the hot spot(s) in each configuration. Note, doing this in a fully optimized program may bedifficult. Break on entry into the function, and not on the hottests statement. If necessary, to avoid some inline-ing, you may need to add a dummy function call that is not inlined and break in that function, then step out after break.

When in the function containing the hot spot, open a dissassembly window and copy to an editor. (or screenshot to get disassembled code). Do the same for the other configuration. IOW get dissassembly for x32 and x64. You can also compile with option to produce ASM listing (produce listing with code bytes).

By comparing the two you can see which is performing more work than the other and/or has more bytes per instruction.

Often you may find that x64 is converting 32-bit int to 64-bit in the process of producing array indexes. In these situations, consider promoting the loop index to intptr_t as this is an int of size of addressing register.

Other differences is x64 almost always uses the XMM registers, where x32 may selectively use the XMM registers. These are used for the SSE instructions. While SSE is generallyfaster than FPU, there are some cases where it is not. You might keep this in mind when you compare the code.

Also, if you note 0x66 or 0x67 bytes in the instruction (usually at front) these are data size and address sizeoverloadprefexes. An excess of these will enlarge the number of bytes to execute your loops. At some point this may cause some loops to not fit within the L1 instruction cache.

Don't forget the often aggressive inlining can also cause some loops to spill out of the L1 instruction cache.

Jim Dempsey

Jim, thank you very much for describing the strategy, definitely, I will try low level analysis out.

Leave a Comment

Please sign in to add a comment. Not a member? Join today