VS2005 vs Intel C++ Compiler, w.r.t. SSE2+

VS2005 vs Intel C++ Compiler, w.r.t. SSE2+

I've just read this thread, and also from word of mouth, have heard that the Intel compiler does better with SSE intrinsics than MSVC. We currently use VS2005, and are supporting SSE2 and higher only. The MSVC compiler does a less-than-stellar job avoiding XMM register spilling, etc.

I am wondering if there's a nice how-toon using the Intel compiler with the VS2005 ide, or at least using the Intel compiler for simd-ciritical portions of our code.

I have the evaluation latest version of the Intel compilercurrently.

Thanks,
William

27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Apparently I should've looked in the help docs (or even just opened up VS after installation). I'll get back to you. :)

Thanks,
William

Ok, after conversion I had some linker errors, as well as a bunch of internal compiler errors in one of our vcproj's (now icproj's), but overall, a pretty painless procedure. Linker errors fixed, but the ICE's couldn't be avoided, so just avoiding those projects for now.

Anyhow, at runtime, I'm getting a misaligned vector load exception. Here is the code casuing it. If I remove the static const, it works, but this is clearly not optimal. Any ideas why it's not aligning the global vector? (This works in VS2005.)

template
__forceinline const __m128 VecConstant()
{
static const __declspec(align(16)) u32 s_vect[4] = { floatAsIntX, floatAsIntY, floatAsIntZ, floatAsIntW };
return *(__m128*)(&s_vect);
}

EDIT: It seems to be__declspec() getting ignored, since aligned-vector members of other classes are not getting aligned either.

This is horrible, but seems that moving the __declspec(align(16)) before the "static" keyword fixed the alignment for that one particular case... :-/ Still working on the other misalignments, which may be due to our own allocator.

Sorry for being a1-man thread lately. Will post problems as they arise.

Ok, the problem seems to be the following (compiler bug?):

pPtr = new CMyClass[m_amount];

where CMyClass has the __declspec(align(16)) specifier (and additionally has a 16-byte-aligned object as its first member). CMyClass has zero virtual functions. In the CMyClass constructor, the first member is set to zero, which calls _mm_setzero_ps() internally, causing a misalignment exception -- that first member isn't aligned properly -- everything is shifted over by one word.

The problem is with the new[] operator. The word stored at the beginning of the array (to hold the size of the array) isn't 16-byte-aligned, and is pushing everything to the right by 1 word instead of by 4 words.

Has no one else discovered this problem?

EDIT: Anyhow, I got around it using _aligned_malloc() + placement newfor now, buta fix would be better of course.

Unless you take specific action to replace Microsoft's implementation of new[], you should get the same run time library implementation with either Microsoft or Intel C++. On a 64-bit OS, new[] should produce a 16-byte aligned allocation, but in the 32-bit implementation, it does not. I don't like this myself, but the decision is out of our hands. 4-byte alignment is expected on Windows, 8-byte on linux. I believe _aligned_malloc() is the recommended solution for 32-bit Windows.

Tim, the problem is that the Intel compiler has a bug (which I happened to report long time ago #458830) where if you use placement new and aligned malloc() you still do not get aligned memory back when allocating with new[] because the compiler stores number of array elements at the beginning of allocated memory and increments the pointer it returns. Try compiling attached test.cpp with MSVC and then with ICC. Note that support says that the issue has been fixed, but obviously the fix is not available in 10.1.021.

To cut the long story short, I would also like to see that bug resolved as quickly as possible instead of having to write additional code. We are already waiting three months for that fix, and that is mighty slow for such a serious showstopper.

William, if you want to compare the code generated by MSVC and ICC from your intrinsics, it is very important that you set /arch:SSE2 (in addition to /O2 or /Ox) for MSVC compiler or otherwise you will get suboptimal code because compiler will still use floating point and perform some needless type conversions.

Attachments: 

AttachmentSize
Downloadtext/x-c++src test.cpp0 bytes
Regards,
Igor Levicki

If you are concerned about needless type conversions with cl /arch:SSE2,when using float data types, you will likely need the /fp:fast option (which is less aggressive than ICL default).

Hi,

The most optimal way to overload new and delete to ensure alignement in classes in C++ is given below. Interms of performance_mm_malloc works better than _aligned_malloc, as the behaviour of_mm_malloc is well known for theIntel Compiler.

Best Regards,

Lars Petter Endresen

//==================================================================

// Fix to make _mm_ functions work within classes in Microsoft C++

#include

#define _aligned_free(a) _mm_free(a)

#define _aligned_malloc(a, b) _mm_malloc(a, b)

void* operator new(size_t bytes) { return _mm_malloc(bytes,16); }

void* operator new[](size_t bytes) { return _mm_malloc(bytes,16); }

void operator delete(void* ptr) { _mm_free(ptr); }

void operator delete[](void* ptr) { _mm_free(ptr); }

//==================================================================

Lars, with all due respect but you have disregarded completely what I wrote above. There is a bug in Intel Compiler preventing what you are suggesting from working correctly.

Regards,
Igor Levicki

Dear Igor,

Sorry, I did not mean to disregard your point. Indeed your sample code crashes and I agree that this should be fixed. However, I have never experienced this problem myself, as I chose to implement things in a different manner. So for me my overload trick works every time! Also, I noticed that removing the destructor in your code resolved the problem, I cannot figure out why? Can you? If you anyway implement new and delete yourself, maybe you can avoid thedestructor altogether? This should fix your problem.

The topic of this thread is "VS2005 vs. Intel C++ Compiler, w.r.t. SSE2+" andIntel C++ Compiler has so many advantages over VS2005 that your particular problemmay seema littleout of focus(as you know all compilers have known limitations). To address the focus of this topic more, try to compile the below code with Intel and Microsoft C++. With Intel C++ 10.0 this code is automatically vectorized, parallelized, unrolled, jammed and cache blocked. The result is astonishing [Intel Core2 CPU 6400 @ 2.13GHz]:

  • VS2005 C++: 75.29 seconds
  • Intel 10.0 C++: 9.09 seconds

Compiler options used are:

  • Microsoft C++: "/O2 /fp:fast /arch:SSE2 /STACK:1000000000"
  • Intel C++: "-Qipo -O3 -QxT -Qparallel /STACK:1000000000"

Best Regards,

Lars Petter Endresen

 #include 
 #include 
 #include 
 #define SIZE 4000
 int main()
 { 
 	int i, j, k;
 	float a[SIZE][SIZE],b[SIZE][SIZE],c[SIZE][SIZE];
 	clock_t start, finish;
 	for(i = 0; i < SIZE; i++){
 		for(j = 0; j < SIZE; j++){
 			b[i][j] = a[i][j] = (float)rand()/RAND_MAX;
 			c[i][j] = 0.0;
 		}
 	}
 	start = clock();
 	for(i=0;i
Also, I noticed that removing the destructor in your code resolved the problem, I cannot figure out why? Can you? If you anyway implement new and delete yourself, maybe you can avoid the destructor altogether?

Removing the destructor should not make any difference in that code, the fact that it does suggests that the bug may be more complex than I described.

As for new and delete my point is that I do not want (and other people shouldn't have to) implement new and delete to get alignment to work.

In my opinion, compiler should honor __declpsec(align(16)) if present even if you allocate the object via new[], or even better properly align all known data types (including __m128) without the need for any explicit alignment directives.

As for avoiding destructor, what if you need to derive a class which also has to do some house-keeping of its own? What if you need virtual destructor? Sorry, removing it is not a solution, just a workaround and a lousy one at best because it might break in the future.

As for the topic of this thread I am surprised that you find my posting here off-topic yet you seem to have missed the topic yourself. He has asked how much better ICC is compared to MSVC with regard to SIMD intrinsic code generation and I have explained how to get the best code from SIMD intrinsics using MSVC so he could do a fair comparison on his own. I also warned him about the bug because it seemed to me from his last post that he was also affected.

Regards,
Igor Levicki

A number of other advantages of Intel C++ over Microsoft C++,

  1. Intel C++ is more ISO standard conformant than Microsoft C++.
  2. Intel C++ gives more warnings than Microsoft C++.
  3. Intel C++ is platform independent; the same compiler is available for Windows, Linux and Mac. Microsoft C++ is not.
  4. Intel C++ contains a useful code coverage tool. Microsoft C++ does not.
  5. Intel C++ gives up to 10x speedup, relative to Microsoft C++,typically in floating point or multimediaintensive code.
  6. Intel C++ supports SIMD calculations. Microsoft C++ does not.
  7. Intel C++ is reported to being used by (surprise...) Microsoft themselves.
  8. Intel C++ has special compiler switches to trap uninitialized variables runtime. Microsoft C++ does not.
  9. Intel C++ optimizes better for size than Microsoft C++ as SIMD instructions can reduce code size too.
  10. Intel C++ with automatic vectorization isautomatically adapted tothe future AVX (http://softwareprojects.intel.com/avx/) 256 bit instruction set. Microsoft C++ does not support automatic vectorization.

I think that I disagree fundamentally with your software development philosophy Igor, as an engineer better should adapt to the tools at hand than to look for situations in which the tools fail. If you have 2 hours to complete a software release, you better remove your destructor and take the pragmatical approach to make things work - all tools havea vast number of limitaions and investigating all of them may delay any software project to a point where your customers becomes very unhappy. Writing a C++ compiler is notoriously difficult, in particular if you both desire maximum performance and correct behaviour according to the language standard. This is particularly true for some of the more advanced topics in the C++ language standard, like templates and polymorphism.So, to achive maximum performance, remember tofollow two simple rules:

  1. Write compiler friendly code.
  2. Use the right compiler.

When it comes down to codes involving floating point instructions like "mulps" and "divps", I am a little surprised that developers still hand optimize code using intrinsics, why not rely on automatic vectorization here? Then you do not need to rewrite your software every time the SIMD vector length is being increased. Today, the SIMD vector length is 128 bit, with AVX it will be 256 bit, and later it will be extended to 512 and 1024 bit (Vector Future FP support to 512 bits and even 1024 bits.) However, there are situations where SSEx intrinsics may be unavoidable, and thus I support your point that the alignement issue should be fixed. I did some
more testing, and I found that the failure is seen only when the destructor is present - maybe the destructor is the origin of the problem?

Best Regards,

Lars Petter Endresen

Ok lets see what claims are bogus:

2. Serious developers use lint anyway so a chatty compiler isn't an advantage.

7. Saying this without citing the source is just a rumor.

8. In debug mode all uninitialized variables are trapped anyway.

9. This is not true. MSVC generates smaller code and uses less memory for constants.

10. It is not fair to compare this because AVX specification has only appeared in public, while Intel had it internally for quite some time.

Furthermore, saying that MSVC doesn't support SIMD is misleading. You can use intrinsics with equal success in MSVC.

I think that I disagree fundamentally with your software development philosophy Igor, as an engineer better should adapt to the tools at hand than to look for situations in which the tools fail.

Lars, your logic is flawed. I wasn't actively looking for this failure I stumbled upon it because my perfectly legal and moral C++ code crashed. I have a workaround but I want a permanent fix because if I rely on a workaround the workaround itself might stop working when fix is introduced.

If you have 2 hours to complete a software release, you better remove your destructor and take the pragmatical approach to make things work

You still haven't answered my question what if I need the destructor? Did it cross your mind that the code sample I gave here and on premier support is vastly simplified?

As for "writing compiler friendly code" that's a two edged sword. What is friendly for one compiler might not be friendly at all for another.

I am a little surprised that developers still hand optimize code using intrinsics, why not rely on automatic vectorization here?

It is actually very simple:

  1. Compiler cannot vectorize everything (try type conversions for example or code like a[b[i]]).
  2. Code written with intrinsics will work almost as fast if you compile it using MSVC.
  3. New compiler versions may introduce regressions where something that vectorized earlier does not vectorize anymore, or it suddenly has lower performance because engineers had to make a trade-off somewhere.

I did some more testing, and I found that the failure is seen only when the destructor is present - maybe the destructor is the origin of the problem?

If you bothered to read my post about the bug more carefully, or at least to run that sample code in a debugger you would have noticed that the compiler stores the number of array elements at the beginning of the allocated memory and then increments the pointer and passes that incremented (and thus unaligned) pointer when it returns from new[].

That number is used so the compiler knows how many times it has to invoke the destructor. If you remove the destructor then most likely the number of array elements does not get stored and the alignment stays correct.

EDIT:

Lars, I am having trouble getting your test code to work with ICC 10.1.021. It compiles but it crashes with exception 0xC00000FD even though I passed /STACK option. Same happens if I remove /STACK, compile to object and then link separately with /STACK.

Regards,
Igor Levicki

> Code written with intrinsics will work almost as fast if you compile it using MSVC.

Igor, I challenge you to beat the 4000x4000 float matrix multiplicationI posted earlier in this thread using any tool you want MSVC, intrinsics or inline assembly. If you come even close to the performance of Intel C++ I will be utterly surprised, because even the hand-optimized Intel MKL library is slower. Oh, there are many such examples where no human beingin alimited amount of time can beatthe IntelFORTRAN or C++ Compiler. Nor do most people want to dig so deep down into intrinsics or assembly code to resolve such standard tasks as matrix multiplication!

BTW, I would suggest reading the excellent manuals of Agner Fog before you start...

Best Regards,

Lars Petter

> It compiles but it crashes with exception 0xC00000FD even though I passed /STACK option.

Sorry for the confusion Igor, this is an option to the linker.

Then I guess I am at the slight advantage because I read those manuals long time ago.

By the way, I know that /STACK is a linker option. However code compiled with ICC 10.1.021 still crashes with said exception. Any ideas?

Regards,
Igor Levicki

IgorLevicki:

As for avoiding destructor, what if you need to derive a class which also has to do some house-keeping of its own? What if you need virtual destructor? Sorry, removing it is not a solution, just a workaround and a lousy one at best because it might break in the future.

My experience is that advanced C++ and high performance are not always good friends. Well, for the sake of software structure you may use advanced language features like inheritence, virtual functions and templates, but if you are not careful, this may also lead to poor performance - it is a common misconception among most C++ programmers that C++ actually is so useful in supercomputing. They have all been mislead by Bjarne Stroustrup's The C++ Programming Language (2007). He states that "Consequently, the standard library provides a vector - called valarray - designed specifically for speed of the usual numeric vector operations. Following his code (pp 662-674),

 #include 
 #include 
 #include 
 #include 
 #define SIZE 4000
 template class Slice_iter
 {
 	std::valarray *v;
 	std::slice s;
 	size_t curr;
 	T& ref(size_t i) const {return (*v)[s.start()+i*s.stride()]; }
 public:
 	Slice_iter(std::valarray *vv, std::slice ss): v(vv), s(ss), curr(0) { }
 	T& operator[](size_t i){ return ref(i); }
 	T& operator[](size_t i) const { return ref(i); }
 };
 template class Cslice_iter
 {
 	std::valarray *v;
 	std::slice s;
 	size_t curr;
 	T& ref(size_t i) const {return (*v)[s.start()+i*s.stride()]; }
 public:
 	Cslice_iter(std::valarray *vv, std::slice ss): v(vv), s(ss), curr(0) { }
 	T& operator[](size_t i) const { return ref(i); }
 };
 class Matrix 
 {
 	std::valarray< float > *v;
 	size_t r, c;
 public:
 	Matrix(size_t x, size_t y){r=x;c=y;v = new std::valarray(0.,x*y);}           
 	Slice_iter< float > row(size_t i ) { return Slice_iter< float >(v, std::slice(i*c, c, 1)); }
 	Cslice_iter< float > row(size_t i ) const { return Cslice_iter< float >(v, std::slice(i*c, c, 1)); }         
 	Slice_iter< float > operator[](size_t i) {return row(i);}
 	Cslice_iter< float > operator[](size_t i) const {return row(i);}
 };
 int main()
 {
 	int i, j, k;
 	Matrix a(SIZE,SIZE),b(SIZE,SIZE),c(SIZE,SIZE);
 	clock_t start, finish;
 	for(i = 0; i < SIZE; i++){
 		for(j = 0; j < SIZE; j++){
 			b[i][j] = a[i][j] = (float)rand()/RAND_MAX;
 			c[i][j] = 0.0;
 		}
 	}
 	start = clock();
 	for(i=0;i

extends the 4000x4000 float matrix multiplication simulation time to 620.2 seconds, even with the latest Intel C++ 10.0 compiler. I have seen developers argueing that Intel C++ Compiler is no better than other compilers when the actual problem is that the software they have written is completely resistant to any compiler optimizations, exactly like the above Matrix suggested by Bjarne Stroustrup.

Best Regards,

Lars Petter Endresen.

Sorry for the confusion Igor - you need to set a global enironment variable KMP_STACKSIZE to 512m.When you have an executable that runs, please report back the fastest of 5 successive runs, because the programs runs better when the caches are warm.

To help you out Igor, I have posted a showstopper to premier support "Cannot align class with destructorIssue Number 481167". My experience with premier support is that they usually resolve issues quickly, in particular ifI manage to convince them that this is of importance for the general public. I never write any code with destructors, that's why I never saw your crash Igor (I like classes and the many other features of C++ like copy-constructors and in particular the composition paradigm which is good for performance).

He states that "Consequently, the standard library provides a vector - called valarray - designed specifically for speed of the usual numeric vector operations.

Well, I haven't tested those and I do not intend to use them. I have noticed however that ICC generally has poorer performance than even MSVC when working with templates and some of more bizarre aspects of C++ such as that array you are mentioning.

I have seen developers argueing that Intel C++ Compiler is no better than other compilers when the actual problem is that the software they have written is completely resistant to any compiler optimizations, exactly like the above Matrix suggested by Bjarne Stroustrup.

And they are most likely right — not only it is not better but it seems to be worse.

Sorry for the confusion Igor - you need to set a global enironment variable KMP_STACKSIZE to 512m.

I solved that by making the arrays global.

To help you out Igor, I have posted a showstopper to premier support "Cannot align class with destructor Issue Number 481167".

It annoys me to no end when someone skims over my posts when I take so much time to write them as precisely as possible... Lars why haven't you read my first post? I have already reported that alignment problem as a showstopper three months ago, and now they have two reports which may actually delay the fix until they figure out it is a duplicate.

As for your code sample, MSVC takes 61.67 seconds. ICC takes 7.55 seconds on Core 2 Duo E8200 (2.66GHz).

Simple change (which took me less than 2 minutes) in assembler code generated by MSVC from:

	lea	edi, DWORD PTR [edx*4]
	npad	3
$LL20@mat_mul_c:
; Line 17
	movss	xmm0, DWORD PTR [ecx]
	mulss	xmm0, DWORD PTR [esi]
	addss	xmm0, DWORD PTR [eax]
	movss	DWORD PTR [eax], xmm0
	movss	xmm0, DWORD PTR [ecx+4]
	mulss	xmm0, DWORD PTR [esi]
	addss	xmm0, DWORD PTR [eax+4]
	movss	DWORD PTR [eax+4], xmm0
	movss	xmm0, DWORD PTR [ecx+8]
	mulss	xmm0, DWORD PTR [esi]
	addss	xmm0, DWORD PTR [eax+8]
	movss	DWORD PTR [eax+8], xmm0
	movss	xmm0, DWORD PTR [ecx+12]
	mulss	xmm0, DWORD PTR [esi]
	addss	xmm0, DWORD PTR [eax+12]
	movss	DWORD PTR [eax+12], xmm0

To:

	lea	edi, DWORD PTR [edx*4]
	movss	xmm1, dword ptr [esi]
	shufps	xmm1, xmm1, 0
$LL20@mat_mul_c:
; Line 17
	movaps	xmm0, xmmword ptr [ecx]
	mulps	xmm0, xmm1
	addps	xmm0, xmmword ptr [eax]
	movaps	xmmword ptr [eax], xmm0

Brings the time down from 61.67 to 42.64 seconds and it is still single-threaded and not cache optimized. Should I continue?

Regards,
Igor Levicki
IgorLevicki:

And they are most likely right not only it is not better but it seems to be worse.

Let us put ourselves in the shoes of Intel Compiler Team - should they honor stupid people writing stupid software or smart people writing smart software? I mean, they have limited resources and must do some choices, what is the most important feature of our compiler? I couple of years ago they "merged" the advanced optimizations of Compaq Visual Fortran, first into Intel Visual Fortran and then into Intel C++, meaning that similar semantics in C++ and FORTRAN gives almost exactly the same assembly code and performance (try yourselves!), in particular if you write FORTRAN style C code, using features in C99 like restrict. But this is not the approach of most C++ programmers that have been mislead by many C++ books like "Effective C++" and "MoreEffective C++".

To summarize, dependent of the problem at hand use,

  1. Intel C++ without intrinsics,
  2. Intel C++ with intrinsics,
  3. Microsoft C++ with/without intrinsics.

In large software projects I prefer to use Microsoft C++ for the code that is not performance critical because it compiles so quickly, but try to write as much as possible of the performance critical code without using intrinsics. My experience is also that many well written codes written for other platforms or systems gives excellent performance "out of the box" with Intel C++, because many developers have known for a long time what is meant by "compiler friendly".

You do not need to improve the matrix multiplication any more, my point is that in many cases it is quite difficult to match the compiler with your own code (implementing unroll-and-jam may also be a laborious process...)

Best Regards,

Lars Petter Endresen

IgorLevicki:

And they are most likely right not only it is not better but it seems to be worse.

Let us put ourselves in the shoes of Intel Compiler Team - should they honor stupid people writing stupid software or smart people writing smart software? I mean, they have limited resources and must do some choices, what is the most important feature of our compiler? I couple of years ago they "merged" the advanced optimizations of Compaq Visual Fortran, first into Intel Visual Fortran and then into Intel C++, meaning that similar semantics in C++ and FORTRAN gives almost exactly the same assembly code and performance (try yourselves!), in particular if you write FORTRAN style C code, using features in C99 like restrict. But this is not the approach of most C++ programmers that have been mislead by many C++ books like "Effective C++" and "MoreEffective C++".

To summarize, dependent of the problem at hand use,

  1. Intel C++ without intrinsics,
  2. Intel C++ with intrinsics,
  3. Microsoft C++ with/without intrinsics.

In large software projects I prefer to use Microsoft C++ for the code that is not performance critical because it compiles so quickly, but try to write as much as possible of the performance critical code without using intrinsics. My experience is also that many well written codes written for other platforms or systems gives excellent performance "out of the box" with Intel C++, because many developers have known for a long time what is meant by "compiler friendly".

You do not need to improve the matrix multiplication any more, my point is that in many cases it is quite difficult to match the compiler with your own code (implementing unroll-and-jam may also be a laborious process...)

Best Regards,

Lars Petter Endresen

Igor,

Take a look at an independent FORTRAN compiler benchmark here, you'll easily figure out which compiler is the preferred choice. If someone would have taken the effort to translate all these benchmarks to C99 using restrict, the performance advantage of Intel C++ would be the same - I do not think Microsoft C++ would be able to compete in this "formula one league" of supercomputing compilers.

Best Regards,

Lars Petter Endresen

Absoft
10.0.8
ftn95
5.20.0
g95
0.91
gfortran
4.3.0
intel
10.1.011
Lahey
7.10.0
Nag
5.0
pgi
7.1-6

AC
27.57
29.86
31.76
19.61
11.78
33.84
38.49
34.22

AERMOD
39.06
78.82
96.88
52.32
35.54
60.81
87.58
41.99

AIR
13.93
29.48
18.12
15.89
10.61
19.72
16.13
15.00

CAPACITA
69.47
121.63
84.06
61.80
65.88
95.49
88.95
68.31

CHANNEL
6.69
11.50
22.66
3.27
3.72
7.73
6.63
3.78

DODUC
74.19
127.84
86.29
74.35
51.78
89.10
113.25
72.95

FATIGUE
13.34
37.07
98.65
21.21
13.93
27.05
31.93
18.40

GAS_DYN
7.87
69.73
69.04
14.33
4.93
23.70
71.43
53.50

INDUCT
95.03
182.72
111.08
92.63
84.99
170.05
160.80
83.64

LINPK
25.34
25.78
26.33
25.41
25.31
25.81
25.60
25.94

MDBX
24.83
55.41
24.09
21.81
23.31
38.84
26.99
23.34

NF
33.71
57.03
59.16
37.28
28.96
45.31
36.42
32.34

PROTEIN
58.98
116.69
106.97
62.35
58.04
96.71
85.88
73.88

RNFLOW
44.42
54.12
51.04
40.52
50.44
45.31
55.05
52.43

TEST_FPU
19.81
30.70
33.91
17.36
14.79
20.99
22.18
17.63

TFFT
3.95
5.50
4.27
3.67
3.56
3.96
4.34
4.05

Geometric Mean
24.94
46.68
44.11
24.98
20.35
34.89
37.49
28.42

Compiler Switches

Absoft

f95 -V -m32 -Ofast -speed_math=9 -WOPT:if_conv=off -LNO:fu=9:full_unroll_size=7000 -march=em64t -H60 -xINTEGER -stack:0x8000000

FTN95
ftn95 /p6 /optimise (slink was used to increase the stack size)

g95
g95 -march=nocona -ffast-math -funroll-loops -O3

gfortran
gfortran -march=native -funroll-loops -O3

Intel
ifort /O3 /Qipo /QxT /Qprec-div- /link /stack:64000000

Lahey
lf95 -inline (35) -o1 -sse2 -nstchk -tp4 -ntrace -unroll (6) -zfm

NAG
f95 -O4 -V

PGI
pgf90 -Bstatic -V -fastsse -Munroll=n:4 -Mipa=fast,inline -tp core2

As you two have pointed out, the lack of auto-vectorization in MSVC prevents it from competing in performance with other compilers, for the applications under discussion. That has not prevented MSVC from taking 90% of the commercial C++ compiler market; some customers consider lack of support for restrict and auto-vectorization a virtue, and more don't care. C++ standards don't recognize restrict, so each compiler which implements it does so in its own way.

As for auto-vectorization of C++ STL, g++ has taken the lead there (using __restrict__). Still, in many cases, ICL may catch up with liberal use of #pragma ivdep.

Apart from the lack of automatic vectorization and restrict MSVC has:

  • No Loop Permutation nor Interchange

  • No Loop Distribution

  • No Loop Fusion

  • No Data Prefetching

  • No Scalar Replacement

  • No Unroll and Jam

  • No Loop Blocking nor Tiling

  • No Partial-Sum Optimization

  • No Loadpair Optimization

  • No Predicate Optimization

  • No Loop Versioning with Low Trip-Count Check

  • No Loop Reversal

  • No Profile-Guided Loop Unrolling

  • No Loop Peeling

  • No Data Transformation: Malloc Combining and Memset Combining

  • No Loop Rerolling

  • No Memset and Memcpy Recognition

  • No Statement Sinking for Creating Perfect Loopnests

MSVC is surely not a supercomputing compiler.

Best Regards,

Lars Petter Endresen

That's an interesting wish list, possibly taken from a document on Intel C++ 9 for Itanium. Do you have a benchmark which demonstrates presence and usefulness of each optimization?

Some of those optimizations have been specific to Intel compilers for Itanium; some present ina pastversion and not retained; somemay beabout to make an appearance in compilers for Xeon.

Loop reversal is one of my pet wishes. I have had Intel and gnu compiler issues on file for years on this. There isn't sufficient incentive for it, and many otherson the list,in a non-vector non-auto-parallelcompiler. Current compilers sometimes vectorize loops by techniques which don't work as well as reversal.

MSVC deals with certain optimizations onlyby useof options which aren't compatible with a mixed build, with Fortran or C compilers other than MSVC. Your implied point about MSVC not being aimed at supercomputing is one I would agree with.

An excellent FORTRAN 77 code I was so lucky to work with some time ago,was so well implemented that I saw many loop transformations in action. Actually many "timeless" functions in that code were written more than 20 years ago, and are impossible to improve any further. I cannot easilyremember which of the loop transformations (taken from Intel 10.1.021 Compiler Help) we did not take advantage of, but this software really is an interesting case study.

Best Regards,

Lars Petter Endresen

should they honor stupid people writing stupid software or smart people writing smart software?

Lars, I wouldn't call people who know how to use advanced C++ features stupid. That just isn't right.

I see that you are heavily biased against C++. I believe that C++ has its own place for certain tasks and that compiler which bears the title "Intel C++ Compiler" should perform equally well with C++ code as it does with Fortranized C code. That makes further discussion pointless.

Furthermore, I do not understand why are you advocating use of ICC so fiercely when the original poster only asked how it works with intrinsics? In my opinion it is rude to be so pushy about something you like, and to try to persuade others into it. Saying that ICC does better job at code optimization and pointing a person to a free trial is enough. You sound like a used car salesman.

You do not need to improve the matrix multiplication any more, my point is that in many cases it is quite difficult to match the compiler with your own code (implementing unroll-and-jam may also be a laborious process...)

I just shaved off 20 seconds with a simple tweak that took a minute to do. I am sure I could at least match it if I had enough time to waste.

Regards,
Igor Levicki

Indeed you are right. "Stupid" is not the right word in this context. Sorry.You have a valid point. Let us instead call the developers more or less mislead, and the software they write more or less compiler friendly. Regarding the point about being pushy I also agree, but the current situations is that MSVC has 90% of the marked and the C++ litterature is currently being overwhelmed with books that forwards all the terrific capabilities of advanced C++, including STL and all the bizarrelanguage features you know about.

Now, when we have a test case where the "less is more" naive C implementation isin the order of ~100x faster than the implementation in the "C++ Bible", I think it is fair to be a little pushy -this is more reminiscent of David vs Goliath than anything else. My main point is that MSVC is not so suitable for supercomputing as many other compilers, including GCC that has a very interesting ongoing vectorization project. Try compiling the test case with GCC and you wlll see that MSVC is not doing so well.

The best developers I know are C++ developers, that uses many of the advanced features of the language, but they are aware of the potential side-effect of their choices. In most situations MSVC using C++ with STL is an excellent choice since most software in use is not performance critical. To master C++ on all levels from high level abstractions to low level SIMD and ILP instructions generated by the compiler is quite tricky, but compilers are making progress and for example Boost may be better optimized with many future compilers.

Leave a Comment

Please sign in to add a comment. Not a member? Join today