(Really) slow performance with OpenMP + Template

(Really) slow performance with OpenMP + Template

Hi all,

I do have a very strange performance behaviour of my small hand written matrix multiplication code and have no idea where to start, especially since g++ generates code with the expected performance. Any suggestions would be welcome.

I have the following two functions:

[cpp]template 
inline void matrix_mul_par_blocked (datatype ** A, datatype ** B, datatype ** C, const int size) {
	
	#pragma omp parallel for
	for (int x_block = 0; x_block

and
[cpp]template 
inline void matrix_mul_seq_blocked (datatype ** A, datatype ** B, datatype ** C, const int size) {
	
	
	for (int x_block = 0; x_block

These two functions (should) be identical except for the OpenMP pragma. I compile this function with the following parameter
icpc -xHost -fast -ansi-alias -openmp -ipo main.cpp -o locality
On my Core i7 I get the following performance numbers (Note that the parallel version was running with just one thread).
matrix_mul_par_blocked: parallel 1 threads + 1024*1024 blocked: 2.44271, 6707.3 MBytes/s 0.87914 GFlop/s parallel 1 threads + 512*512 blocked: 2.43206, 6736.69 MBytes/s 0.882991 GFlop/s parallel 1 threads + 256*256 blocked: 2.4443, 6702.94 MBytes/s 0.878567 GFlop/s parallel 1 threads + 128*128 blocked: 2.45334, 6678.24 MBytes/s 0.87533 GFlop/s parallel 1 threads + 64*64 blocked: 2.4501, 6687.08 MBytes/s 0.87649 GFlop/s parallel 1 threads + 32*32 blocked: 2.51444, 6515.96 MBytes/s 0.85406 GFlop/s parallel 1 threads + 16*16 blocked: 2.60151, 6297.88 MBytes/s 0.825476 GFlop/s parallel 1 threads + 8*8 blocked: 2.7293, 6003 MBytes/s 0.786825 GFlop/s parallel 1 threads + 4*4 blocked: 3.31869, 4936.88 MBytes/s 0.647087 GFlop/s
matrix_mul_seq_blocked: sequential + 1024*1024 blocked: 0.390563, 41949.7 MBytes/s 5.49843 GFlop/s sequential + 512*512 blocked: 0.370856, 44178.9 MBytes/s 5.79061 GFlop/s sequential + 256*256 blocked: 0.424052, 38636.8 MBytes/s 5.0642 GFlop/s sequential + 128*128 blocked: 0.362724, 45169.3 MBytes/s 5.92043 GFlop/s sequential + 64*64 blocked: 0.332515, 49273 MBytes/s 6.45831 GFlop/s sequential + 32*32 blocked: 0.438981, 37322.8 MBytes/s 4.89197 GFlop/s sequential + 16*16 blocked: 0.450894, 36336.7 MBytes/s 4.76272 GFlop/s sequential + 8*8 blocked: 0.579346, 28280.2 MBytes/s 3.70674 GFlop/s sequential + 4*4 blocked: 1.09335, 14985.1 MBytes/s 1.96413 GFlop/s
There is a performance difference of almost a factor of 10 between both versions. Note that this performance difference is identical for different matrix sizes.
Edit: datatype can be both float or double. The performance problem stays the same
9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - jbreitbart

Hi all,

I do have a very strange performance behaviour of my small hand written matrix multiplication code and have no idea where to start, especially since g++ generates code with the exspected performance. Any suggestions would be welcome.

I have the following two functions:

template 
inline void matrix_mul_par_blocked (datatype ** A, datatype ** B, datatype ** C, const int size) {

	#pragma omp parallel for
	for (int x_block = 0; x_block		for (int y_block = 0; y_block			
			// calculate block C_sub of C ((x_block, y_block), (x_block+size-1, y_block+size-1))
			for (int X=0; X				
				// (i, j) are the coordinates for the block in C_sub
				for (int i=x_block; i					for (int k = X; k						//#pragma ivdep
						for (int j=y_block; j							C[i][j] += A[i][k] * B[k][j];
						}
					}
				}

			}
		}
	}	
}

and

template 
inline void matrix_mul_seq_blocked (datatype ** A, datatype ** B, datatype ** C, const int size) {


	for (int x_block = 0; x_block		for (int y_block = 0; y_block			
			// calculate block C_sub of C ((x_block, y_block), (x_block+block_size-1, y_block+block_size-1))
			for (int X=0; X				
				// (i, j) are the coordinates for the block in C_sub
				for (int i=x_block; i					for (int k = X; k						//#pragma ivdep
						for (int j=y_block; j							C[i][j] += A[i][k] * B[k][j];
						}
					}
				}

			}
		}
	}	
}

These two functions (should) be identical except for the OpenMP pragma. I compile this function with the following parameter

icpc -xHost -fast -ansi-alias -openmp -ipo main.cpp -o locality

On my Core i7 I get the following performance numbers (Note that the parallel version was running with just one thread).

matrix_mul_par_blocked:

parallel 1 threads + 1024*1024 blocked: 2.44271, 6707.3 MBytes/s 0.87914 GFlop/s
parallel 1 threads + 512*512 blocked: 2.43206, 6736.69 MBytes/s 0.882991 GFlop/s
parallel 1 threads + 256*256 blocked: 2.4443, 6702.94 MBytes/s 0.878567 GFlop/s
parallel 1 threads + 128*128 blocked: 2.45334, 6678.24 MBytes/s 0.87533 GFlop/s
parallel 1 threads + 64*64 blocked: 2.4501, 6687.08 MBytes/s 0.87649 GFlop/s
parallel 1 threads + 32*32 blocked: 2.51444, 6515.96 MBytes/s 0.85406 GFlop/s
parallel 1 threads + 16*16 blocked: 2.60151, 6297.88 MBytes/s 0.825476 GFlop/s
parallel 1 threads + 8*8 blocked: 2.7293, 6003 MBytes/s 0.786825 GFlop/s
parallel 1 threads + 4*4 blocked: 3.31869, 4936.88 MBytes/s 0.647087 GFlop/s

matrix_mul_seq_blocked:

sequential + 1024*1024 blocked: 0.390563, 41949.7 MBytes/s 5.49843 GFlop/s
sequential + 512*512 blocked: 0.370856, 44178.9 MBytes/s 5.79061 GFlop/s
sequential + 256*256 blocked: 0.424052, 38636.8 MBytes/s 5.0642 GFlop/s
sequential + 128*128 blocked: 0.362724, 45169.3 MBytes/s 5.92043 GFlop/s
sequential + 64*64 blocked: 0.332515, 49273 MBytes/s 6.45831 GFlop/s
sequential + 32*32 blocked: 0.438981, 37322.8 MBytes/s 4.89197 GFlop/s
sequential + 16*16 blocked: 0.450894, 36336.7 MBytes/s 4.76272 GFlop/s
sequential + 8*8 blocked: 0.579346, 28280.2 MBytes/s 3.70674 GFlop/s
sequential + 4*4 blocked: 1.09335, 14985.1 MBytes/s 1.96413 GFlop/s

There is a performance difference of almost a factor of 10 between both versions. Note that this performance difference is identical for different matrix sizes.

Your icpc option implies aggressive multiple-level loop optimizations "hpo" which are disabled when you set -openmp. If the performance drops below g++, with same thread numbers, you have grounds for filing a performance issue on premier.intel.com, if you care to do so.
Among things to look for (including opt-report and in .S code) would be:
Does icpc make an appropriate assumption about loop lengths? Can you help it by pragma e.g.:
#pragma loop count avg(block_size) // use actual value for block_size
Does icpc gain (in the non-openmp case) by switching to a dot product organization, or by other loop nest swaps or even second guessing your blocking?
Do KMP_AFFINITY settings help with OpenMP, particularly if you have HT enabled?
Are you running 1 thread per core? icpc would default to 2 threads per core, if HT is enabled.

Quoting - Jens

On my Core i7 I get the following performance numbers (Note that the parallel version was running with just one thread).

one thread for the parallel version? It should be slower because there's some overhead in this case.But I'm not sure how much overhead.

how about the performance with 4 threads?

Jennifer

If you are up for some experimentation try the following

int nBlocks = size / block_size;
#pragma omp parallel for
for (int x_block_n = 0; x_block_n int x_block = x_block_n * block_size;

IOW use a stride of 1 in the parallel loop.

The second thing to do is to _remove_ inline from the function.
I found too often the attempts by the compiler to preserve registers across the inline increases register pressure. The removal of inline will cost a little time in the call and return (with save and restore of registers) but will relieve register pressure when executing the nested loops.

Should one of these suggestions show improvement then I suggest you submit a report to premier support showing the difference.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - tim18

Your icpc option implies aggressive multiple-level loop optimizations "hpo" which are disabled when you set -openmp. If the performance drops below g++, with same thread numbers, you have grounds for filing a performance issue on premier.intel.com, if you care to do so.
Among things to look for (including opt-report and in .S code) would be:
Does icpc make an appropriate assumption about loop lengths? Can you help it by pragma e.g.:
#pragma loop count avg(block_size) // use actual value for block_size
Does icpc gain (in the non-openmp case) by switching to a dot product organization, or by other loop nest swaps or even second guessing your blocking?
Do KMP_AFFINITY settings help with OpenMP, particularly if you have HT enabled?
Are you running 1 thread per core? icpc would default to 2 threads per core, if HT is enabled.

I have written all my different matrix multiplication version in headers files, which are included in only one main .cpp file, which gets compiled with the options named above. I guess hpo disabled as soon as -openmp is passed as an options for all functions, not only the ones which are using OpenMP? If that is true than this should not be my problem, but I should probably test the sequential version without -openmp and see how it performs.

I have included some measurements when compiling my code with g++ (options: -O3 -fopenmp) at the end of this post. It does not show the same performancebehavior, even though the sequential version is slower than that of the icpc.Surprisinglythe parallel version with 1 thread is even faster than the sequential version.

Thanks for the other suggestions. I had a look at the opt-report but could not find anything that looked suspicious to me, but than again it is the first time a looked at such a report. How do I generate .S code / what is the .S code, sorry but searching .S didn't work out that well.

Regarding the more genericoptimizations: Yes, it is possible to improve the performance by a great deal. IIRC the fastest sequential version provides between 13 and 15 GFlops on matrices that don't fit in the cache. However, the parallel version of these functions suffer from the same behavior as the onepostedin my first post and the posted one looks muchsimpler.

-Jens

measurements done with g++:

sequential + 1024*1024 blocked: 1.72641, 9490.19 MBytes/s 1.2439 GFlop/s
sequential + 512*512 blocked: 1.71174, 9571.58 MBytes/s 1.25457 GFlop/s
sequential + 256*256 blocked: 1.72388, 9504.13 MBytes/s 1.24572 GFlop/s
sequential + 128*128 blocked: 1.75157, 9353.89 MBytes/s 1.22603 GFlop/s
sequential + 64*64 blocked: 1.68367, 9731.13 MBytes/s 1.27548 GFlop/s
sequential + 32*32 blocked: 1.72891, 9476.51 MBytes/s 1.2421 GFlop/s
sequential + 16*16 blocked: 2.14699, 7631.16 MBytes/s 1.00023 GFlop/s
sequential + 8*8 blocked: 1.86026, 8807.39 MBytes/s 1.1544 GFlop/s
sequential + 4*4 blocked: 2.24707, 7291.28 MBytes/s 0.955683 GFlop/s

parallel 1 threads + 1024*1024 blocked: 1.26326, 12969.6 MBytes/s 1.69996 GFlop/s
parallel 1 threads + 512*512 blocked: 1.24727, 13135.9 MBytes/s 1.72174 GFlop/s
parallel 1 threads + 256*256 blocked: 1.26574, 12944.3 MBytes/s 1.69663 GFlop/s
parallel 1 threads + 128*128 blocked: 1.28263, 12773.7 MBytes/s 1.67427 GFlop/s
parallel 1 threads + 64*64 blocked: 1.21504, 13484.3 MBytes/s 1.76741 GFlop/s
parallel 1 threads + 32*32 blocked: 1.25454, 13059.8 MBytes/s 1.71177 GFlop/s
parallel 1 threads + 16*16 blocked: 1.82677, 8968.83 MBytes/s 1.17556 GFlop/s
parallel 1 threads + 8*8 blocked: 1.87928, 8718.21 MBytes/s 1.14271 GFlop/s
parallel 1 threads + 4*4 blocked: 2.25431, 7267.85 MBytes/s 0.952612 GFlop/s

parallel 4 threads + 1024*1024 blocked: 1.26349, 12967.3 MBytes/s 1.69965 GFlop/s
parallel 4 threads + 512*512 blocked: 0.624334, 26242.4 MBytes/s 3.43964 GFlop/s
parallel 4 threads + 256*256 blocked: 0.32315, 50700.9 MBytes/s 6.64547 GFlop/s
parallel 4 threads + 128*128 blocked: 0.321153, 51016.2 MBytes/s 6.68679 GFlop/s
parallel 4 threads + 64*64 blocked: 0.303879, 53916.2 MBytes/s 7.0669 GFlop/s
parallel 4 threads + 32*32 blocked: 0.314014, 52176 MBytes/s 6.83881 GFlop/s
parallel 4 threads + 16*16 blocked: 0.462128, 35453.4 MBytes/s 4.64695 GFlop/s
parallel 4 threads + 8*8 blocked: 0.469638, 34886.4 MBytes/s 4.57264 GFlop/s
parallel 4 threads + 4*4 blocked: 0.564996, 28998.4 MBytes/s 3.80088 GFlop/s

Quoting - Jennifer Jiang (Intel)

one thread for the parallel version? It should be slower because there's some overhead in this case.But I'm not sure how much overhead.

how about the performance with 4 threads?

Jennifer

The performance with 4 Threads is at the end of this post. It is not scaling that well. In this test case the matrix size was 1024*1024.

I probably should have made my time measurements more clear. The first column in my measurements is the runtime in seconds, so there is a ~2 seconds overhead -- I guess this is too much just for a #pragma omp parallel for. Furthermore the difference is not a constant factor, but increases with matrix size. For example at a matrix with the size of 2048*2048 thedifferenceis about 18 seconds (10 seconds vs. 28 seconds).

-Jens

Performance with 4 threads:

parallel 4 threads + 1024*1024 blocked: 3.62622, 4518.2 MBytes/s 0.59221 GFlop/s
parallel 4 threads + 512*512 blocked: 1.8004, 9100.19 MBytes/s 1.19278 GFlop/s
parallel 4 threads + 256*256 blocked: 0.909177, 18020.7 MBytes/s 2.36201 GFlop/s
parallel 4 threads + 128*128 blocked: 0.902628, 18151.4 MBytes/s 2.37915 GFlop/s
parallel 4 threads + 64*64 blocked: 0.895904, 18287.7 MBytes/s 2.397 GFlop/s
parallel 4 threads + 32*32 blocked: 0.893991, 18326.8 MBytes/s 2.40213 GFlop/s
parallel 4 threads + 16*16 blocked: 0.909282, 18018.6 MBytes/s 2.36174 GFlop/s
parallel 4 threads + 8*8 blocked: 0.938087, 17465.3 MBytes/s 2.28922 GFlop/s
parallel 4 threads + 4*4 blocked: 1.04614, 15661.3 MBytes/s 2.05276 GFlop/s

The performance with 8 threads is (almost) identical.

Quoting - jimdempseyatthecove

If you are up for some experimentation try the following

int nBlocks = size / block_size;
#pragma omp parallel for
for (int x_block_n = 0; x_block_n int x_block = x_block_n * block_size;

IOW use a stride of 1 in the parallel loop.

The second thing to do is to _remove_ inline from the function.
I found too often the attempts by the compiler to preserve registers across the inline increases register pressure. The removal of inline will cost a little time in the call and return (with save and restore of registers) but will relieve register pressure when executing the nested loops.

Should one of these suggestions show improvement then I suggest you submit a report to premier support showing the difference.

Jim Dempsey

I have made the changes you suggested, but it hardly made any difference in my tests and the parallel version is still slower than the sequential version. I also tried to change the template parameter into a function parameter (which is possible here, but not in all my matrix matrix multiplication versions), but this again made no difference regarding performance.

-Jens

Sorry for again replying to my own thread. I am afraid I have misslead you all in a wrong direction, as the problem does not directly come from the functions shown above. If I directly callmatrix_mul_par_blocked its performance is almost as good as the one of its sequential counterpart. However, I normally use some template metaprogramming to generate code for the different block sizes. Here is the code for the parallel version, the code for the sequential version is identical except that it calls the sequential matrix multiplication function.

template 
struct caller_par_blocked {
	
	template 
	caller_par_blocked (T A, T B, T C, T C_ref, const int size, const int t) {
		if (max == max_block_size) std::cout << "*********" << std::endl;
		timeval tv1, tv2;
		
		if (max <= size) {
			gettimeofday(&tv1, NULL);
			matrix_mul_par_blocked  (A, B, C, size);
			gettimeofday(&tv2, NULL);
			std::cout << "parallel " << t << " threads + " << max << "*" << max << " blocked: "
			     << timediff(tv2, tv1) << ", " << memory_bandwidth(size, timediff(tv2, tv1))
			     << " MBytes/s " << g_flops (size, timediff(tv2, tv1)) << " GFlop/s" << std::endl;
			compare_matrix (C, C_ref, size);
			fill_zero (C, size);
		}

		caller_par_blocked c(A, B, C, C_ref, size, t);
	}
};

template <>
struct caller_par_blocked {
	template 
	caller_par_blocked (T /*A*/, T /*B*/, T /*C*/, T /*C_ref*/, const int /*size*/, const int /*t*/) {}
};

In my main function I just call

caller_par_blocked (A, B, C, C_ref, size, t);

to get the functions calls for all block sizes. Can anyone explain why this caller struct should affect performance the way it does? And why this is only happening when the called function contains an OpenMP pragma?

-Jens

So far, you've demonstrated that guesses with incomplete or misleading information lead us nowhere. Continuing in that vein, guessing that you've changed the level of indirection in the pointers you pass, it's absolutely essential for the compiler to be able to see its way around potential pointer aliasing in the inner loop(s). If the compiler is able to see that a conversion to an inner loop dot product is possible, that should help tremendously with optimization.
Both g++ and icpc have adopted extensions of the C99 restrict concept for this reason. icpc also has a noalias option which may permit optimizations which aren't safe according to C++ standard.
In a case like this, if you reorganize your loops so as to use inner_product, you may get full optimization without any noalias extensions. Of course, you should do better yet with MKL library calls, given that your problem is large enough to consider cache blocking and OpenMP.

Login to leave a comment.