Improve the performance of a sum

Improve the performance of a sum

Hello Intel,

I am using a scientific calculation code. And I want to improve it a little bit if possible. I check the code with Amplifier. The most time consuming (heavily used) code is this:

double a = 0.0;
 for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i)<<ldf) + k];

To me it is just a dot product between w and fi. I am wondering:

1. Does Intel compiler will do it automaticall? (I mean treated the loop as the dot product of two vecterized array.)

2. Is there a way to improve the code? (I mean maybe define another array a1 the same size of w. Then all multiplied number can be stored in a1 (unrolled loop?). Do summation in the end. )

3. Other suggestions?

I am using parallel composer 2013 with visual studio. Any idea will be appreicated!:)

69 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

SSE4.1 or later would support such optimization better than the earlier instruction sets. You would want to assure that the variables which presumably are loop invariant have local definition so are not subject to aliasing. Of course you would assure that your code is standard compliant so you can set -ansi-alias.
Did you look up the icc pragmas ?, e.g.
#pragma simd reduction(+: a)
If the loop is big enough you may wish to add threaded reduction on top of vector reduction, using OpenMP or Cilk+
What do -opt-report and the guide options say?

Could you elaborate on " You would want to assure that the variables which presumably are loop invariant have local definition so are not subject to aliasing"?
Thank You.

For example,
for(int j=0.... would give j local scope and avoid the compiler being concerned about side effects on it. Likewise, i, k, ldf should ideally have a local scope, as a does. If you can't use restrict qualifiers on w[] and index[], as you mentioned you would require #pragma ivdep.
#pragma simd reduction would ignore all such dependencies, obviating need for #pragma ivdep, but also ignoring proven dependencies, so you should make an effort to get it optimized or obtain compiler diagnostics without simd first.

Thanks for your comment Timp. SSE 4.1 did improve the performance. But still not very significant.

In your post there are a lot of OpenMP related issue. I didn't know OpenMP. But I think I get a general idea now.

I am just curious of the serial optimization of the loop. I mean the speeding up solution for the situation that all numbers are not next to each other in the memory. I also asked the same question in Stack Overflow. And it seems that rearranging is the only choice.

I know that I can get some performance improvement by using more than one core. But I want to improve it if possible before I move on to OpenMP or MPI. And I do think even I use OpenMP, the low efficient cache usage is still the problem.

Does OpenMP provide more than one thread that solve the non alias data set? Could you please tell me some more about your idea?

Indirect access such as you have quoted certainly raises issues of cache locality which can't be discussed without more knowledge of your application.
The point of OpenMP is to parallelize your application across multiple threads. Taking several threads together, there is typically more aggregate cache resource than when running a single thread, so it should help in handling larger data sets efficiently. With the indirect access, it's not necessarily a fait accompli. OpenMP is more often suited to nested loop situations where the inner loop is vectorized and the outer one partitioned (automatically) among threads.
With MPI, you partition your data set so that each CPU has access only to a part of it. It may work well as an outer level of parallelization on top of OpenMP.

OK. I'll try OpenMP.
But still I don't understand where is the benefit of using more than one threads.
The code that I am working with is a huge calculation in general, the data is around some GB in the memory. The loop may apply on big vectors.
For example, think about a 3D matrix, and it has index-x, index-y and index-z. And the order that these matrix elements in the memory is x fast, y medium and z slow (x1y1z1 -> x2y1z1 -> x3y1z1 ->...-> x1y2z1 -> x2y2z1 ->.......). But when I want to take a slice (for fixed x, y) of it and multiply it with a given vector, the only thing that a CPU can do is to load a portion of the matrix and select a piece which is needed and apply the multiplication. There are data which is not useful at this time and should be ignored.
Then how can the multi-thread on a single processor speed it up? Or there are other optimization ways?

>>...SSE 4.1 did improve the performance. But still not very significant...
.
Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of summation. I'll provide some numbers if interested and try to do some testing with your codes.
.
>>...the data is around some GB in the memory...
.
How much memory do you have on your system? Is it 32-bit or 64-bit platform?
.
I wouldn't expect a significant improvement in performance if some parts of your data are saved in a Virtual Memory file during the processing due to lack of available Physical Memory.

>>...for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i) left-shift ldf) + k];
.
Could you provide some technical details for i, ldf and k variables? Could I use any values for test purposes? Thanks in advance.
.
Note: I wonder if the editor problem with a left-arrow character could be fixed?

>>...Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of
>>summation. I'll provide some numbers...
.
Here are results ( two matrices addition / sizes 2048x2048 / single-precision data type 'float' ):
.
- 4-in-1 unroll improves performance by ~3%
- 8-in-1 unroll improves performance by ~4%
.
I always use a hard-coded for-loops unrolling because I can't depend on C++ compilers' pragma directives due to portability issues.

>>...Or there are other optimization ways?
.
Here is a list of different techniques I would try:
.
- if accuracy is not a crucial issue use a single-precision data type 'float' instead of double-precision data type 'double'
- a distribution of values in 'index' array needs to be analyzed ( it creates a random acces to array 'fi' )
- lock in memory 'index' array using a 'VirtualLock' Win32 API function
- software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%
- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time )
- warm up your data by pre-scanning arrays before processing ( it will load as much as possible data into memory )
.
Let me know if you have any questions.

The following may work better on systems supporting gather.
On systems without gather it may make better use of compiler temporaries.
double a = 0.0;

int* fi_index = (int*)_alloca(n*sizeof(int));

// construct indexes

for(j = 0; j < n; j++) fi_index[j] [((index[j] + i) << ldf) + k;
// for(j = 0; j .lt. n; j++) fi_index[j] [((index[j] + i) .lt..lt. ldf) + k;

// use indexes

for(j = 0; j < n; j++) a += w[j]*fi[fi_index[j]];
// for(j = 0; j .lt. n; j++) a += w[j]*fi[fi_index[j]];

*** Please fix this forum such that one can copy and paste sample code without having to figure out html
*** Or take our time to edit and post and reedit and repost and reedit and post, ....
*** Your time taken once will save our time over and over and over again.

Jim Dempsey

www.quickthreadprogramming.com

>>*** Please fix this forum such that one can copy and paste sample code without having to figure out html
>>*** Or take our time to edit and post and reedit and repost and reedit and post, ....
>>*** Your time taken once will save our time over and over and over again.
.
Agree with Jim. On the new IDZ forum a 'Preview' button for a post was deleted by some unknown reason. Why did you do it? I used the 'Preview' button always before submitting a new post and it allowed me to fix some little issues in the text or source codes.

引文:

Sergey Kostrov 写道:

>>...SSE 4.1 did improve the performance. But still not very significant...
.
Did you try to unroll the loop? I have a similar piece of code ( not exactly, of course ) and I remember that unrolling 4-in-1 improved performance of summation. I'll provide some numbers if interested and try to do some testing with your codes.
.
>>...the data is around some GB in the memory...
.
How much memory do you have on your system? Is it 32-bit or 64-bit platform?
.
I wouldn't expect a significant improvement in performance if some parts of your data are saved in a Virtual Memory file during the processing due to lack of available Physical Memory.


I did. /Qunroll:100000000
But I don't think the unrolling would contribute. You see each time the sum is "a += ...". There is a dependency here. Meaning the n+1 step depends on its previous step. If a was an array, you could unroll the loop but here a is just a double variable. That's my understanding of the unrolling. I added /Qunroll:100000000 there, but maybe it doesn't take any effect.

I have 4GB free in my system (windows 7 64bit. 8GB total. And I am using 64bit config).

引文:

Sergey Kostrov 写道:

>>...for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i) left-shift ldf) + k];
.
Could you provide some technical details for i, ldf and k variables? Could I use any values for test purposes? Thanks in advance.
.
Note: I wonder if the editor problem with a left-arrow character could be fixed?

Sorry, it is a bit complicated. I don't have some simple examples for you to try out.

Here is my understanding:
Overall this is a piece of calculating the eigenvalue problem using conjugate gradient method. Maybe the code above is a part of preparing the matrix to solve. The matrix, for example in the test calculation is a million by million sparse matrix. And the non zero elements are 3 million I believe (some sort of tri-diagonal). But when come to the actual usage, maybe the size will go to 10 million or even bigger.

I start to focus on it is because I checked it with amplifier. By looking at the result, I know the 2 most time consuming functions are the MKL (intel version I used) functions (dot product and the other one I don't know very well). That look is the most significant time consuming thing other than those.

引文:

Sergey Kostrov 写道:

>>...Or there are other optimization ways?
.
Here is a list of different techniques I would try:
.
- if accuracy is not a crucial issue use a single-precision data type 'float' instead of double-precision data type 'double'
- a distribution of values in 'index' array needs to be analyzed ( it creates a random acces to array 'fi' )
- lock in memory 'index' array using a 'VirtualLock' Win32 API function
- software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%
- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time )
- warm up your data by pre-scanning arrays before processing ( it will load as much as possible data into memory )
.
Let me know if you have any questions.

Using float is actually works for time shrinking purpose But for the time being I'd like to keep the precision. All the matrices and vectors are already in the memory so there is no loading issue. I believe now the bottle neck is the cache.

I never did raise priority. And I should try it.

Could you explain more:
1. "lock in memory 'index' array using a 'VirtualLock' Win32 API function"
2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%"

Thanks a lot for your nice response!

引文:

jimdempseyatthecove 写道:

The following may work better on systems supporting gather.
On systems without gather it may make better use of compiler temporaries.
double a = 0.0;

int* fi_index = (int*)_alloca(n*sizeof(int));

// construct indexes

for(j = 0; j < n; j++) fi_index[j] [((index[j] + i) << ldf) + k;
// for(j = 0; j .lt. n; j++) fi_index[j] [((index[j] + i) .lt..lt. ldf) + k;

// use indexes

for(j = 0; j < n; j++) a += w[j]*fi[fi_index[j]];
// for(j = 0; j .lt. n; j++) a += w[j]*fi[fi_index[j]];

*** Please fix this forum such that one can copy and paste sample code without having to figure out html
*** Or take our time to edit and post and reedit and repost and reedit and post, ....
*** Your time taken once will save our time over and over and over again.

Jim Dempsey

Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"?
In the code the index is computed before the vector product. What's the reason of doing so?

Thanks in advance!

>>...I did. /Qunroll:100000000
.
That number looks to big and I don't think that would help. I always use 4-in-1. Here is a small example:


// Without Unrolling ( 1-in-1 )
for( int i = 0; i < N; i++ )

{

   a += b[i];

}



// With Unrolling ( 4-in-1 )
for( int i = 0; i < N; i+=4 )

{

   a += ( b[i] + b[i+1] + b[i+2] + b[i+3] );

}


Note: A verification that N % 4 equals to 0 needs to be done and if it is not 0 additional processing required.

Hi Jim,
.
>>*** Please fix this forum such that one can copy and paste sample code without having to figure out html
.
I think IDZ developers are trying to force us C/C++ Software Developers to do more HTML programming. It looks like they follow Renee James keynote statements about HTML5.
.
Best regards,
Sergey

>>...I did. /Qunroll:100000000
.
Here is a simmary from Intel C++ compiler documentation regarding loops unrolling:
.
Loop Unrolling
.
The benefits of loop unrolling are as follows:
- Unrolling eliminates branches and some of the code.
- Unrolling enables you to aggressively schedule (or pipeline) the loop to hide latencies if you have enough free registers to keep variables live.
- For processors based on the IA-32 architectures, the processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops for the processors, until they have a maximum of 16 iterations
- A potential limitation is that excessive unrolling, or unrolling of very large loops, can lead to increased code size.
.
There is an option /Qunroll-aggressive and here is some summary:
.
...
On systems using IA-64 architecture, you can only specify a value of 0.
...
and
...
This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance.
On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts.
On IA-64 architecture, this option enables additional complete unrolling for loops that have multiple exits or outer loops that have a small constant trip count.
...
.
There is also an option -funroll-all-loops and by default it is in OFF state.
.
[SergeyK Note] There is a very small performance improvement by changing unrolling from 4-in-1 to 8-in-1 and I always use 4-in-1. Regarding your concerns about unrolling: I would try to use it before making a final statement.

>>Could you explain more:
>>1. "lock in memory 'index' array using a 'VirtualLock' Win32 API function"
.
Here is some quote from MSDN:
...
The VirtualLock function locks the specified region of the process's virtual address space into physical memory, ensuring that subsequent access to the region will not incur a page fault.
...
I also recommend you to look at:
.
Intel threading Building Blocks forum: 'Working with Memory Mapped Files?' thread
.
Web-link: http://software.intel.com/en-us/forums/topic/276995
.
Please take a look.

>>...2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%"
.
I use an inline assembler instruction 'prefetcht0' instead of a call to intrinsic function '_mm_prefetch' ( a more portable form with some additional call overhead ). In general it looks like:
.


...

	_asm prefetcht0	[ ptAddress ]

...

>>...I never did raise priority. And I should try it.
.
These Win32 API functions should be used:
.
GetCurrentProcess
SetPriorityClass
GetCurrentThread
SetThreadPriority

>>>>...for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i) left-shift ldf) + k];
>>>>.
>>>>Could you provide some technical details for i, ldf and k variables?
>>
>>Sorry, it is a bit complicated.
.
There are lots of smart guys here and if you really want to get help some generic details will help significantly. Could you tell what is a value for 'n' in the 'for' statement?

blockquote>

Could you explain more about "systems supporting gather"? Does windows 7 64bit support? What is "gather"?
In the code the index is computed before the vector product. What's the reason of doing so?

Thanks in advance!

[/quote]

In the next generation of AVX for Haswell there is a new feature whereby the AVX instruction set permits scatter/gather.
Consult: http://software.intel.com/sites/default/files/m/f/7/c/36945
for the nitty-gritty
Note than when your system supports and compiler supports AVX gather, then the compiler may generate gather directly from the source code or you can alternately use intrinsic functions to perform the gather at a lower level
Condiser the AVX instruction for gathering 4 doubles.
This instruction specifies a 4-wide (dp) AVX destination register, an array address, a ymm register containing 4 indicies into the array, a ymm register containing a mask (to enable or disable loading array elements specified by the indicies).
In the suggestion I made, should you pre-create an array of indicies (either all at once, or 4 at a time), then your code can perform the loads more efficiently:
One fetch of 4 indicies
One instruction to fetch into a 4-wide register from 4 seperate locations in an array (still have 4 fetches)
But now note all 4 data elements are now packed into an AVX register without additional instructions.
Now the vector operation(s) can be directly performed (e.g. sum, dot product, ...)
---
On systems without AVX gather, the pre-building of the array of indicies may run faster (give it a try, it will be faster to experiment than writing these back and forth forum posts). Reason being, the compiler optimizer may better see what you are doing and generate more efficient code. Try to keep the size of the array of indicies small enough to fit in L1 cache. Nest the loop an additional level in the event you exceed L1 cache size.
Jim Dempsey

www.quickthreadprogramming.com

引文:

Sergey Kostrov 写道:

>>...I did. /Qunroll:100000000
.
Here is a simmary from Intel C++ compiler documentation regarding loops unrolling:
.
Loop Unrolling
.
The benefits of loop unrolling are as follows:
- Unrolling eliminates branches and some of the code.
- Unrolling enables you to aggressively schedule (or pipeline) the loop to hide latencies if you have enough free registers to keep variables live.
- For processors based on the IA-32 architectures, the processor can correctly predict the exit branch for an inner loop that has 16 or fewer iterations, if that number of iterations is predictable and there are no conditional branches in the loop. Therefore, if the loop body size is not excessive, and the probable number of iterations is known, unroll inner loops for the processors, until they have a maximum of 16 iterations
- A potential limitation is that excessive unrolling, or unrolling of very large loops, can lead to increased code size.
.
There is an option /Qunroll-aggressive and here is some summary:
.
...
On systems using IA-64 architecture, you can only specify a value of 0.
...
and
...
This option determines whether the compiler uses more aggressive unrolling for certain loops. The positive form of the option may improve performance.
On IA-32 architecture and Intel® 64 architecture, this option enables aggressive, complete unrolling for loops with small constant trip counts.
On IA-64 architecture, this option enables additional complete unrolling for loops that have multiple exits or outer loops that have a small constant trip count.
...
.
There is also an option -funroll-all-loops and by default it is in OFF state.
.
[SergeyK Note] There is a very small performance improvement by changing unrolling from 4-in-1 to 8-in-1 and I always use 4-in-1. Regarding your concerns about unrolling: I would try to use it before making a final statement.

I set a big number to loop unroll because I thought the compiler would decide whether and how the unroll should be taken. Maybe I should go with "unroll all loop" option.
You really gave me a lesson of loop unrolling.
I thought the unrolling is just repeat the loop body several times and I didn't know the loop can be unrolled as you explained ( a += ( b[i] + b[i+1] + b[i+2] + b[i+3] ); ).
I don't think I will manually unroll the loop. But you are right, I should unroll the loop.

引文:

Sergey Kostrov 写道:

>>...2. "software prefetch ( as inline assembler instruction ) could improve performance by 0.5 - 1.0%"
.
I use an inline assembler instruction 'prefetcht0' instead of a call to intrinsic function '_mm_prefetch' ( a more portable form with some additional call overhead ). In general it looks like:
.


...

	_asm prefetcht0	[ ptAddress ]

...


I am not sure when to use it. Does it tell the processor to "prefetch" data into cache?

引文:

Sergey Kostrov 写道:

>>...I never did raise priority. And I should try it.
.
These Win32 API functions should be used:
.
GetCurrentProcess
SetPriorityClass
GetCurrentThread
SetThreadPriority

OK. I'll take this into account.

引文:

Sergey Kostrov 写道:

>>>>...for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i) left-shift ldf) + k];
>>>>.
>>>>Could you provide some technical details for i, ldf and k variables?
>>
>>Sorry, it is a bit complicated.
.
There are lots of smart guys here and if you really want to get help some generic details will help significantly. Could you tell what is a value for 'n' in the 'for' statement?


True. I want to tell you more. But it is a code piece from a free software. This is the most inner layer of the self consistent iteration loop. And if you browse the code, you can see the author actually did some smart optimization. When I check it with Amplifier, I know the most time consuming (except MKL routines) is the loop. To extract a better picture, I have to track through some functions. It is a fortran C mixed program. The C code part is written and can be called by fortran subroutines aiming at speeding up the code. I'll try to make it clearer over the weekend or maybe early next week. I am sorry, but it is something not that easy to me. Just give me some time, maybe I can post it here. And I'd thanks for your kindness, Sergey.

引文:

FortCpp 写道:

Hello Intel,

double a = 0.0;
 for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i)<<ldf) + k];

3. Other suggestions?

How about a small working sample program with dummied up data...
That exhibits approximately the same behavior.
This may yield better results for you and require less time of the other readers.
Jim Dempsey

www.quickthreadprogramming.com

>>... It is a fortran C mixed program...
.
Please take into account that I don't have a Fortran compiler. As far as I know Jim has it. So, a pure C or C++ test-case should work for everybody.

>>...Does it tell the processor to "prefetch" data into cache?
.
Yes. There is a constant flow of posts redarding that subject on 'Software Tuning, Performance Optimization & Platform Monitoring' forum.
.
Web-link is: http://software.intel.com/en-us/forums/software-tuning-performance-optim...
.
A complete description of the instruction is in 'Instruction Set Reference Manual' volume 2B ( from N to Z ). During last 12 months I read lots of posts regarding that subject and there are two groups of developers with different point of views:
.
- It helps
- It doesn't help
.
About 9 months ago I've done a set of tests and when a 'prefetcht0' was used it improved performance by 0.5 - 1.0%, However, it depends on an algorithm! In my case it was a Strassen Heap Based Complete algorithm for matrix multiplication. Since you have a VTune it will be much easier for you to detect if it helps or doesn't help.
.
Here is a very simple example:


...

   int iArray[64] = { 0 };

   int iSum = 0;

   _asm prefetcht0 [ iArray ]

   for( int i = 0; i < 64; i++ )

      iSum += iArray[i];

...

Here is a more advanced example:


///////////////////////////////////////////////////////////////////////////////

// Test1143 - Research on Data Prefetching
#if defined ( _RTTEST_1143_INCLUDED )
//	#define _RTPREFETCH_DATA( pAddress, iOffset )

//	#define _RTPREFETCH_DATA( pAddress, iOffset )	_mm_prefetch( ( ( RTchar * )( pAddress ) ) + iOffset, _MM_HINT_T0 )

//	#define _RTPREFETCH_DATA( pAddress )			_asm prefetcht0 [ ##pAddress ]
	#define CrtDataPrefetchT0( pAddress )

	#define CrtDataPrefetchT1( pAddress )

	#define CrtDataPrefetchT2( pAddress )

	#define CrtDataPrefetchNTA( pAddress )
//	#define CrtDataPrefetchT0( pAddress )			_asm prefetcht0 [ ##pAddress ]

//	#define CrtDataPrefetchT1( pAddress )			_asm prefetcht1 [ ##pAddress ]

//	#define CrtDataPrefetchT2( pAddress )			_asm prefetcht2 [ ##pAddress ]

//	#define CrtDataPrefetchNTA( pAddress )			_asm prefetchnta [ ##pAddress ]
	#define _RTPREFETCH_DATASIZE					64

	#define _RTVECTOR_SIZE							2048
#endif
#if defined ( _RTTEST_1143_INCLUDED )
RTvoid RunTest1143( RTvoid )

{

	CrtPrintf( g_szMessageTestStart, 1143 );

	g_uiTicksStart = SysGetTickCount();
	// Sub-Test 1

	{

	///*

		#if ( defined ( _WIN32_MSC ) || defined ( _WIN32_ICC ) )
		RTint iSize1 = sizeof( __m128 );

		RTint iSize2 = sizeof( __m128d );
		__m128 *pVec1 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );

		__m128 *pVec2 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );

		__m128 *pVec3 = ( __m128 * )_mm_malloc( _RTVECTOR_SIZE * sizeof( __m128 ), 16 );
		RTint i;
		for( i = 0; i < _RTVECTOR_SIZE; i++ )

		{

			pVec1[i].m128_f32[0] = pVec1[i].m128_f32[1] = pVec1[i].m128_f32[2] = pVec1[i].m128_f32[3] = 2.0f;

			pVec2[i].m128_f32[0] = pVec2[i].m128_f32[1] = pVec2[i].m128_f32[2] = pVec2[i].m128_f32[3] = 2.0f;

			pVec3[i].m128_f32[0] = pVec3[i].m128_f32[1] = pVec3[i].m128_f32[2] = pVec3[i].m128_f32[3] = 0.0f;

		}
		g_uiTicksStart = SysGetTickCount();

		for( RTuint t = 0; t < ( ( RTuint )_RTNUMBER_OF_TESTS * 256 ); t++ )

		{

			CrtDataPrefetchT0( pVec1 );

			CrtDataPrefetchT0( pVec2 );

			CrtDataPrefetchT0( pVec3 );
			for( i = 0; i < _RTVECTOR_SIZE; i += 4 )

			{

				pVec3[i  ] = _mm_mul_ps( pVec1[i  ], pVec2[i  ] );

				pVec3[i+1] = _mm_mul_ps( pVec1[i+1], pVec2[i+1] );

				pVec3[i+2] = _mm_mul_ps( pVec1[i+2], pVec2[i+2] );

				pVec3[i+3] = _mm_mul_ps( pVec1[i+3], pVec2[i+3] );

			}

		}

		CrtPrintf( RTU("SSE-Based Vector Multiplication - %ld ticksn"), ( RTint )( SysGetTickCount() - g_uiTicksStart ) );
		if( pVec1 != RTnull )

			_mm_free( pVec1 );

		if( pVec2 != RTnull )

			_mm_free( pVec2 );

		if( pVec3 != RTnull )

			_mm_free( pVec3 );
		#endif

	//*/

	}
	CrtPrintf( g_szMessageTestCompleted, ( RTint )( SysGetTickCount() - g_uiTicksStart ) );

	CrtPrintf( g_szMessageTestEnd, 1143 );

}
#endif

Here is another tip. I've noticed that:


   for( int i = 0; i < N; i+=4 )                         // 1st form

   {

      a += ( b[i] + b[i+1] + b[i+2] + b[i+3] );

   }

is faster then:

   for( int i = 0; i < N; i+=4 )                         // 2nd form

   {

      a += b[i];

      a += b[i+1];

      a += b[i+2];

      a += b[i+3];

   }

引文:

Sergey Kostrov 写道:

Here is another tip. I've noticed that:


   for( int i = 0; i < N; i+=4 )                         // 1st form

   {

      a += ( b[i] + b[i+1] + b[i+2] + b[i+3] );

   }

is faster then:

   for( int i = 0; i < N; i+=4 )                         // 2nd form

   {

      a += b[i];

      a += b[i+1];

      a += b[i+2];

      a += b[i+3];

   }

If you allow a vectorizing compiler to optimize "riffling," you shouldn't have to be concerned about this. If you don't allow the compiler to perform such optimizations, the normal icc option -fp:fast can undo your efforts (and degrade accuracy). The point would be that when you add values sequentially in scalar mode, even if the result is registerized, you can add only once per 4 clock cycles or so.

Hi Tim, Thanks for your comments.
.
>>... If you don't allow the compiler to perform such optimizations,
.
Yes, I don't allow it
.
>>the normal icc option -fp:fast can undo your efforts (and degrade accuracy).
.
A Floating Point Model 'Precise' ( /fp:precise ) is always used on all VS projects I have.

Yes, icc options such as /fp:precise or /fp:source should observe the order of addition specified in the source code, including disabling optimization of sum reduction. Unlike gcc and ifort, icc has no command line option for "vectorized" sum reduction while otherwise complying with language standards on evaluation order. This may be one of the reasons for the #pragma simd reduction (which over-rides fp:precise in that loop).

Hi 'FortCpp',

I've created a simplified test-case that simulates your calculations. There is about 1% performance improvement if a Floating Point Model is changed from 'Precise' ( /fp:precise ) to 'Fast' ( /fp:fast ). It needs to be verified in your software if you decide to use 'Fast' Floating Point Model.

>>...I don't think I will manually unroll the loop. But you are right, I should unroll the loop...

As you can see in both cases test 'UnRolled Loops - 4-in-1 - B' has the best times. More information will be submitted soon.

>> Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~23% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 469 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 1172 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 844 ticks

>> Non-Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 2219 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 2203 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 2078 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 2860 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 2687 ticks

Here is additional information.

>>...
>>double a = 0.0;
>>
>>for( j = 0; j < n; j++ )
>> a += w[j] * fi[ ( (index[j] + i) << ldf ) + k ];
>>
>>To me it is just a dot product between w and fi.
>>...

Access to 'w' array is Sequential. However, access to 'fi' array could be Not Sequential ( could be a Random ) and it depends on a content of 'index' array. That is why some time ago I've asked a question about a distribution of data in 'index' array,

I've simulated the Sequential and Random accesses to 'index' array and in the 2nd case the calculation of sum was almost 5 times slower.

These are results of two tests that demonstrate the ranges of values I got when accessing 'index' array:

*** Set of Full Sum Tests ***

>> Test 1 - Array size: 4096 'double' elements <<
>> Sequential - Deterministic Processing starts <<

Index Value: 0
Index Value: 1
Index Value: 2
Index Value: 3
Index Value: 4
...
Index Value: 1340
Index Value: 1341
Index Value: 1342
Index Value: 1343
Index Value: 1344
...
Index Value: 2690
Index Value: 2691
Index Value: 2692
Index Value: 2693
Index Value: 2694
...
Index Value: 4091
Index Value: 4092
Index Value: 4093
Index Value: 4094
Index Value: 4095

>> Deterministic Processing ends <<

>> Test 2 - Array size: 4096 'double' elements <<
>> Random - Non-Deterministic Processing starts <<

Index Value: 3733
Index Value: 432
Index Value: 2213
Index Value: 1508
Index Value: 2045
...
Index Value: 3841
Index Value: 3760
Index Value: 269
Index Value: 2187
Index Value: 2194
...
Index Value: 1213
Index Value: 781
Index Value: 261
Index Value: 2983
Index Value: 1967
Index Value: 274
...
Index Value: 2171
Index Value: 3864
Index Value: 1573
Index Value: 1214
Index Value: 2577

>> Non-Deterministic Processing ends <<

If 'index' array is large ( let's say tens of megabytes ) and has random values in it than values from 'fi' array will be "taken" from different parts of memory on every iteration.

I verified my own statement:

>>...
>>- raise a priority of the process / thread to Above Normal or High could improve performance by 0.5 - 1.0% ( do not change priority to Real Time )
>>,,,

There is about 6% - 8% performance improvement if a process priority is changed to 'High' or 'Realtime' from 'Normal. Take into account that it needs to be verified in your software if you decide to use a priority boost.

>> Array size: 32MB 'double' elements <<
>> Random - Non-Deterministic Tests <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 2187 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 2172 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost - ~6% faster than 1-in-1 test-case )
Sum is: 0.000000
Calculated in 2062 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 2844 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 2656 ticks

Process Priority High
Full Sum : UnRolled Loops - 4-in-1 - B
Sum is: 0.000000
Calculated in 2047 ticks

Process Priority Realtime
Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( with priority boost - ~8% faster than 1-in-1 test-case )
Sum is: 0.000000
Calculated in 2016 ticks

Take into account that large amounts of memory need to be pre-allocated before a priority change.

And two more things:

- In my tests I've simulated two types of accesses and didn't populate 'w' and 'fi' arrays with some random values. They were initialized with 0.0 values. That is why the sum is always:
...
Sum is: 0.000000
...

- Let me know if you need source codes of my tests

[EDITED] Sorry for the duplicate of the previous post. All content is deleted.

You might consider the following:

double a = 0.0;
int scale = 1 {left shift} ldf;
int offset = i {left shift} ldf + k;
// assuming stack large enough...
// use alloca to allocate a 16-byte aligned array
int* reindex = (int*)(((intptr_t)alloca(n*sizeof(int)+15)+15)&(~(intptr_t)15));
// assure that your index array is 16-byte aligned
_ASSERT((((intptr_t)index) & 15) == 0);
#pragma simd
for(int j = 0; j {less than} n; j++)
reindex[j] = index[j] * scale + offset;
// since w alignment is unknown and fi[reindex[j]] is unknown
// we cannot use #pragma simd here
// however, on compilers and processors supporting scatter/gather
// the following loop can be vectorized.
// On compilers (and processors) not supporting scatter/gather
// the following loop is unrollable and may be partially vectorized.
for(int j = 0; j {less than} n; j++)
a += w[j]*fi[reindex[j]];

Jim Dempsey
(down with html comment edit boxes)

www.quickthreadprogramming.com

Jim suggested you create an index array like...

>>// construct indexes
>>
>>for(j = 0; j < n; j++) fi_index[j] [((index[j] + i) << ldf) + k;
>>
>>// use indexes
>>
>>for(j = 0; j < n; j++) a += w[j]*fi[fi_index[j]];

So, I didn't try to test it due to lack of time. But it is clear that it will improve performance. In overall, if all our proposals are combined I expect final performance improvement by ~10% to 15%. It would be nice to hear from you about a total time of calculations for not optimized version of that free software. By the way, what software are you taking about?

>>...it is clear that it will improve performance.

I've verified it in a set of my test-cases for 16MB and 32MB sizes of arrays and results are as follows:

- improvement was for about 4% to 6% depending on sizes of arrays ( so, it really works! ) compared to

>>...
>>Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost - ~6% faster than 1-in-1 test-case )
>>Sum is: 0.000000
>>Calculated in 2062 ticks
>>...

- it also outperformed a test-case with a priority boost to Realtime

- on a 32-bit Windows platform total amount of allocated memory was about 1.3GB ( Virtual Memory manager was busy but it is Not applicable in your case )

- the indexing solution can be be used if 'i', 'ldf' and 'k' used like "constants" and if they are Not a reindexing will be needed

>>...In overall, if all our proposals are combined I expect final performance improvement by ~10% to 15%...

I think an increase in performance greater than 15% could be achieved.

Thanks a lot, Sergey for all your update. I was busy this week for some precision issue. Thanks Timp and Jim as well.
I tried to make a piece of code out of the source. Still I may not have a better description. But I think I can provide more information of the loop. What I did is I tried to follow the loop in the debug mode.
Here is the story:

After many calling, I came to a subroutine. The purpose of the code is to "apply a Hamiltonian (matrix) to a vector". The I follow the subroutines, I came to here:


call operate_ri_vec(op%stencil%size, wre(1), nri_loc, ri(1, ini), imin(ini), imax(ini), pfi(1), LOGLDF, pfo(1))


This is a fortran code but it called a c function. The definition of the c function is:

void operate_ri_vec_                 (const int * opn,

					       const double * restrict w,

					       const int * opnri,

					       const int * opri,

					       const int * rimap_inv,

					       const int * rimap_inv_max,

					       const double * restrict fi,

					       const int * ldfp,

					       double * restrict fo){


And I believe the some fortran parameters (except int, double, and char) are addresses.
So op%stencil%size (some type of structure) is a number, which value is 7. wre is a 1D integer array. nri_loc=11722. pfi and pfo are 1D double arrays. ri, imin and imax are the integer arrays. LOGLDF is a macro and defined to 1. The I move on to follow the function.

const int ldf = ldfp[0];


In the function, I think it tells me that the ldf is 1 and it is a const. And then,

  for (l = 0; l < 11722 ; l++) {

    index = opri + 7 * l;

    ......


I've replace some of the variables with their values. opri is another address. The loop that I mentioned earlier is actually inside the loop above if I am correct.
[cpp]
double a = 0.0;
for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i)<

Hi Sergey and Timp,

I am catching up. And I saw you have some discussion regard the /fp:precise. Since I have a accuracy issue in my calculations, could you educate me more about it?

Also, when I did a benchmark between the same code compiled on different platform (win64 vs ubuntu 64bit), I saw the result looked exactly the same at the beginning of the iterations, but later on, there are slightly different from each other. Does such kind of issue related to the float point model?

引文:

Sergey Kostrov 写道:

>>...I don't think I will manually unroll the loop. But you are right, I should unroll the loop...

As you can see in both cases test 'UnRolled Loops - 4-in-1 - B' has the best times. More information will be submitted soon.

>> Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 609 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~23% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 469 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 1172 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 844 ticks

>> Non-Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 2219 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 2203 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case )
Sum is: 0.000000
Calculated in 2078 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 2860 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 2687 ticks

I'll apply it. So you used /fp:precise here as well?

引文:

Sergey Kostrov 写道:

And two more things:

- In my tests I've simulated two types of accesses and didn't populate 'w' and 'fi' arrays with some random values. They were initialized with 0.0 values. That is why the sum is always:
...
Sum is: 0.000000
...

- Let me know if you need source codes of my tests


That would be very nice! Can you send me your source? YLQK9@mail.missouri.edu

页面

发表评论

登录添加评论。还不是成员?立即加入