Improve the performance of a sum

69 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

>>...
>>289 #if VEC_SIZE > 1
>>290 for (; i < rimap_inv_max[l]; i++){
>>291 int k;
>>292 for(k = 0; k < (1<>293 double a = 0.0;
>>294 for(j = 0; j < n; j++) a += w[j]*fi[((index[j] + i)<>295 fo[(i<>296 }
>>297 }
>>298 #endif

I have two questions:

Do you know a starting value for 'i', values for 'rimap_inv_max[l]' and 'n' variables? I also would like to confirm that 'ldf'=1, correct?

I'll go through your last 3 posts tomorrow.

>>The code is open source project and you can find it here: http://www.tddft.org/trac/octopus/browser/trunk/src/grid/operate_inc.c

Not to be overly hard on the programmer/author the code is rather sophomoric.

Firstoff, the number of available SSE and AVX registers is 16. Xeon Phi (AKA Knights Corner), which currently is a co-processor, has 32 AVX2-extended registers (64-bytes/8 doubles/16 floats). So currently, unless your code is running on Xeon Phi co-processor, the "#if DEPTH >= 32" section is likely counter productive.

Secondly, the programmer is computing byte offsets, "LOAD(fi + indexj + 1*VEC_SIZE + k)". The compiler optimizations, often are confused by using these tactics. A better method is (untested code)


#if DEPTH >= 32

for (; i < (rimap_inv_max[l] - unroll32 + 1) ; i += unroll32)

{

	int k;

	for(k = 0; k < (1 {left shift} ldf); k += 32*VEC_SIZE)

	{

		register VEC_TYPE a0, a1, a2, a3;

		register VEC_TYPE a4, a5, a6, a7;

		register VEC_TYPE a8, a9, aa, ab;

		register VEC_TYPE ac, ad, ae, af;

		register VEC_TYPE b0, b1, b2, b3;

		register VEC_TYPE b4, b5, b6, b7;

		register VEC_TYPE b8, b9, ba, bb;

		register VEC_TYPE bc, bd, be, bf; 
		a0 = a1 = a2 = a3 = VEC_ZERO;

		a4 = a5 = a6 = a7 = VEC_ZERO;

		a8 = a9 = aa = ab = VEC_ZERO;

		ac = ad = ae = af = VEC_ZERO;

		b0 = b1 = b2 = b3 = VEC_ZERO;

		b4 = b5 = b6 = b7 = VEC_ZERO;

		b8 = b9 = ba = bb = VEC_ZERO;

		bc = bd = be = bf = VEC_ZERO; 
		for(j = 0; j < n; j++)

		{

			register VEC_TYPE wj = VEC_SCAL(w[j]);

			int indexj = (index[j] + i) {left shift} ldf;

			VEC_TYPE* fiVec = (VEC_TYPE*)(fi + indexj + k);

			a0 = VEC_FMA(wj, LOAD(fiVec    ), a0);

			a1 = VEC_FMA(wj, LOAD(fiVec + 1), a1);

			a2 = VEC_FMA(wj, LOAD(fiVec + 2), a2);

			a3 = VEC_FMA(wj, LOAD(fiVec + 3), a3);

			a4 = VEC_FMA(wj, LOAD(fiVec + 4), a4);

			a5 = VEC_FMA(wj, LOAD(fiVec + 5), a5);

			a6 = VEC_FMA(wj, LOAD(fiVec + 6), a6);

			a7 = VEC_FMA(wj, LOAD(fiVec + 7), a7);

			a8 = VEC_FMA(wj, LOAD(fiVec + 8), a8);

			a9 = VEC_FMA(wj, LOAD(fiVec + 9), a9);

			aa = VEC_FMA(wj, LOAD(fiVec + 10), aa);

			ab = VEC_FMA(wj, LOAD(fiVec + 11), ab);

			ac = VEC_FMA(wj, LOAD(fiVec + 12), ac);

			ad = VEC_FMA(wj, LOAD(fiVec + 13), ad);

			ae = VEC_FMA(wj, LOAD(fiVec + 14), ae);

			af = VEC_FMA(wj, LOAD(fiVec + 15), af);

			b0 = VEC_FMA(wj, LOAD(fiVec + 16), b0);

			b1 = VEC_FMA(wj, LOAD(fiVec + 17), b1);

			b2 = VEC_FMA(wj, LOAD(fiVec + 18), b2);

			b3 = VEC_FMA(wj, LOAD(fiVec + 19), b3);

			b4 = VEC_FMA(wj, LOAD(fiVec + 20), b4);

			b5 = VEC_FMA(wj, LOAD(fiVec + 21), b5);

			b6 = VEC_FMA(wj, LOAD(fiVec + 22), b6);

			b7 = VEC_FMA(wj, LOAD(fiVec + 23), b7);

			b8 = VEC_FMA(wj, LOAD(fiVec + 24), b8);

			b9 = VEC_FMA(wj, LOAD(fiVec + 25), b9);

			ba = VEC_FMA(wj, LOAD(fiVec + 26), ba);

			bb = VEC_FMA(wj, LOAD(fiVec + 27), bb);

			bc = VEC_FMA(wj, LOAD(fiVec + 28), bc);

			bd = VEC_FMA(wj, LOAD(fiVec + 29), bd);

			be = VEC_FMA(wj, LOAD(fiVec + 30), be);

			bf = VEC_FMA(wj, LOAD(fiVec + 31), bf);

		}

		VEC_TYPE* foVec = (VEC_TYPE*)(fo + (i {left shift} ldf) + k);

		STORE(foVec    , a0);

		STORE(foVec + 1, a1);

		STORE(foVec + 2, a2);

		STORE(foVec + 3, a3);

		STORE(foVec + 4, a4);

		STORE(foVec + 5, a5);

		STORE(foVec + 6, a6);

		STORE(foVec + 7, a7);

		STORE(foVec + 8, a8);

		STORE(foVec + 9, a9);

		STORE(foVec + 10, aa);

		STORE(foVec + 11, ab);

		STORE(foVec + 12, ac);

		STORE(foVec + 13, ad);

		STORE(foVec + 14, ae);

		STORE(foVec + 15, af);

		STORE(foVec + 16, b0);

		STORE(foVec + 17, b1);

		STORE(foVec + 18, b2);

		STORE(foVec + 19, b3);

		STORE(foVec + 20, b4);

		STORE(foVec + 21, b5);

		STORE(foVec + 22, b6);

		STORE(foVec + 23, b7);

		STORE(foVec + 24, b8);

		STORE(foVec + 25, b9);

		STORE(foVec + 26, ba);

		STORE(foVec + 27, bb);

		STORE(foVec + 28, bc);

		STORE(foVec + 29, bd);

		STORE(foVec + 30, be);

		STORE(foVec + 31, bf);

	}

}

#endif

Jim Dempsey

www.quickthreadprogramming.com

GRIPE:
One cannot paste .cpp code into

paste here
without getting extra line breaks.
Please fix.
Also note the C/C++ {Left Shift} does not paste in this block as well.
Please fix.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

It is hard to see your optimization without comparing it the original code. Do you mean an addition of 'indexj' variable?

...
049 int indexj = (index[j] + i) {left shift} ldf;
051 VEC_TYPE* fiVec = (VEC_TYPE*)(fi + indexj + k);
...

The edits recommended replace

LOAD(fi + indexj + n*VEC_SIZE + k)

with

LOAD(fiVec + n)

Where n is sequenced from 0:31
And fiVec is a VEC_TYPE* as opposed to the original fi being an FPTYPE* (double* or float*)

The compiler has optimizations specifically built for pointer, pointer+1, pointer+2, ...
The original code, "should" optimize to the same, but this is not always the case.
To verify, run the test (about 10 minutes of work).

Jim Dempsey

www.quickthreadprogramming.com

Citação:

jimdempseyatthecove escreveu:

You might consider the following:

double a = 0.0;
int scale = 1 {left shift} ldf;
int offset = i {left shift} ldf + k;
// assuming stack large enough...
// use alloca to allocate a 16-byte aligned array
int* reindex = (int*)(((intptr_t)alloca(n*sizeof(int)+15)+15)&(~(intptr_t)15));
// assure that your index array is 16-byte aligned
_ASSERT((((intptr_t)index) & 15) == 0);
#pragma simd
for(int j = 0; j {less than} n; j++)
reindex[j] = index[j] * scale + offset;
// since w alignment is unknown and fi[reindex[j]] is unknown
// we cannot use #pragma simd here
// however, on compilers and processors supporting scatter/gather
// the following loop can be vectorized.
// On compilers (and processors) not supporting scatter/gather
// the following loop is unrollable and may be partially vectorized.
for(int j = 0; j {less than} n; j++)
a += w[j]*fi[reindex[j]];

Jim Dempsey
(down with html comment edit boxes)


Thanks Jim. I just took your code and applied. But the improvement is not significant. I think your idea is calculate the index before the vector product is done. Don't know the reason.

When your indexes are totally random, then precomputing a table of indexes _may_ be faster on systems without gather. When you upgrade your processor to one that supports gather and recompile for gather, then the same code should be faster. Future proofing your code.

When you have a random index combined with a series of sequential references (index, index+1, index+2, ...), then it may be best to compute a reference/pointer to the index'ed location (p) then use p[0], p[1], p[2], ...

The code looks cleaner, and it may run faster.

Jim Dempsey

www.quickthreadprogramming.com

Here are a couple of follow ups.

>>...
>>Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( ~6% faster than 1-in-1 test case )
>>Sum is: 0.000000
>>Calculated in 2078 ticks
>>
>>Full Sum : UnRolled Loops - 8-in-1 - A
>>Sum is: 0.000000
>>Calculated in 2860 ticks
>>...
>>
>>I'll apply it. So you used /fp:precise here as well?

Yes.

>>...I saw you have some discussion regard the /fp:precise. Since I have a accuracy issue in my calculations, could you educate me
>>more about it?

** Please take a look at these topics in MSDN: **

...- /fp (Specify Floating-Point Behavior)
...- Floating Point Numbers

** Ways to control Floating Point Models ( modes ) are as follows: **

- C++ Compiler options /fp:[ precise | except | fast | strict ]

- Precise, Except, Fast and Strict Floating Point models could be controlled in source codes using 'float_control' pragma directive:

...#pragma float_control ( precise, on )
...#pragma float_control ( except, off )
...#pragma float_control ( fast, off )
...#pragma float_control ( strict, off )

- A set of CRT-functions '_control87', '_controlfp' and '__control87_2' allow to change Floating Point modes, like precision, rounding and infinity. You could also look at '_status87' and '_fpreset' CRT-functions. When I use '_control87' I always select default modes.

- Application of SSE instructions in CRT-functions, like sin, cos, could be controlled with '_set_SSE2_enable' CRT-function.

- Take into account that some features or functions are Microsoft / Intel C++ compilers specific.

>>...I saw the result looked exactly the same at the beginning of the iterations, but later on, there are slightly different from each other. Does such
>>kind of issue related to the float point model?

I think Yes and it is possibly related to different rounding modes. Could you post a small example?

Remember, that due to limitations of IEE 754 Standard rounding errors could be accumulated very quickly! Please take at a very simple example:

Let's say two square 8x8 matrices A and B need to be multiplied.

// Matrix A - 8x8 - 'float' type:

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0
901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix B - 8x8 - 'float' type:

101.0 201.0 301.0 401.0 501.0 601.0 701.0 801.0
901.0 1001.0 1101.0 1201.0 1301.0 1401.0 1501.0 1601.0
1701.0 1801.0 1901.0 2001.0 2101.0 2201.0 2301.0 2401.0
2501.0 2601.0 2701.0 2801.0 2901.0 3001.0 3101.0 3201.0
3301.0 3401.0 3501.0 3601.0 3701.0 3801.0 3901.0 4001.0
4101.0 4201.0 4301.0 4401.0 4501.0 4601.0 4701.0 4801.0
4901.0 5001.0 5101.0 5201.0 5301.0 5401.0 5501.0 5601.0
5701.0 5801.0 5901.0 6001.0 6101.0 6201.0 6301.0 6401.0

// Matrix C = Matrix A * Matrix B - 8x8 - 'float' type

13826808.0 14187608.0 14548408.0 14909208.0 15270008.0 15630808.0 15991608.0 16352408.0
32393208.0 33394008.0 34394808.0 35395608.0 36396408.0 37397208.0 38398008.0 39398808.0
>50959604.0< 52600404.0 54241204.0 55882004.0 57522804.0 59163604.0 60804404.0 62445204.0
69526008.0 71806808.0 74087608.0 76368408.0 78649208.0 80930008.0 83210808.0 85491608.0
88092408.0 91013208.0 93934008.0 96854808.0 99775608.0 102696408.0 105617208.0 108538008.0
106658808.0 110219608.0 113780408.0 117341208.0 120902008.0 124462808.0 128023608.0 131584408.0
125225208.0 129426008.0 133626808.0 137827616.0 142028400.0 146229216.0 150430000.0 154630816.0
143791600.0 148632416.0 153473200.0 158314016.0 163154800.0 167995616.0 172836416.0 177677200.0

I marked one of the several incorrect values as '>50959604.0<' and due to a rounding error it is NOT equal to '50959608.0'. If you change the 'float' type to 'double' the result of matrix multiplication will be correct. For ALL values of the matrix C last two digits must be '08'. It can't be '...00', '...04' or '...16'.

I always recommend to read Intel's article related to that subject:

'Consistency of Floating-Point Results using the Intel(R) Compiler' By Dr. Martyn J. Corden and David Kreitzer

I could upload if you won't be able to find the article.

** Please take a look at these topics in MSDN: **

...- /fp (Specify Floating-Point Behavior)
...- Floating Point Numbers

** Ways to control Floating Point Models ( modes ) are as follows: **

- C++ Compiler options /fp:[ precise | except | fast | strict ]

- Precise, Except, Fast and Strict Floating Point models could be controlled in source codes using 'float_control' pragma directive:

...#pragma float_control ( precise, on )
...#pragma float_control ( except, off )
...#pragma float_control ( fast, off )
...#pragma float_control ( strict, off )

- A set of CRT-functions '_control87', '_controlfp' and '__control87_2' allow to change Floating Point modes, like precision, rounding and infinity. You could also look at '_status87' and '_fpreset' CRT-functions. When I use '_control87' I always select default modes.

- Application of SSE instructions in CRT-functions, like sin, cos, could be controlled with '_set_SSE2_enable' CRT-function.

- Take into account that some features or functions are Microsoft / Intel C++ compilers specific.

__control and 87 functions would have no effect on compilations in x64 mode or with /arch:SSE2 or AVX. I think Sergei means use of x86/ia32 mode without /arch: when he mentions using them with default options.
Intel compilers have /fp:source as a rough equivalent to Microsoft /fp:fast. Yet, the Intel compilers make their more aggressive /fp:fast the default, while you must set /fp:fast to see competitive optimizations with MSVC (particularly in 32-bit mode).
Intel Fortran interprets /fp:precise to be the same as /fp:source, but they are different for Intel C++.

I used to think the differences between the Intel and MSVC interpretations were partly explained by auto-vectorization, but now VS2012 has some auto-vectorization (apparently not including vector reductions).

>>I always recommend to read Intel's article related to that subject:
>>
>>'Consistency of Floating-Point Results using the Intel(R) Compiler' By Dr. Martyn J. Corden and David Kreitzer
>>
>>I could upload if you won't be able to find the article.

The article is attached.

Anexos: 

AnexoTamanho
Download fp-consistency.pdf401.35 KB

Here are a couple of very useful web-links to look at:
http://en.wikipedia.org/wiki/Single_precision

http://www.binaryconvert.com http://www.binaryconvert.com/convert_float.html

[COMMENTED] There is a strange re-formatting issue.

>>...That would be very nice! Can you send me your source? YLQK9@mail.missouri.edu

Please check your e-mail and I also enclosed the sources.

Anexos: 

AnexoTamanho
Download sumtests.cpp.txt11.38 KB

>>...The precision issue is the most important thing right now...

On a web-page:
.
http://www.tddft.org/programs/octopus/wiki/index.php/Manual:Installation

there is a statement

>>...
>>testsuite/
>>Used to check your build. You may also use the files in here as samples of how to do various types of calculations.

- Do you have some reference data to compare with?
- How did you detect that precision loss?
- Could you post some technical details?

Hi everybody,

I'd like to share results of my additional test and I'm very impressed since a ~15 year old Borland C++ compiler outperformed all (!) modern C++ compilers in a test case when all optimizations were disabled:

...
>> With Borland C++ compiler <<
>> Non-Deterministic Tests <<
>> Array size: 32MB 'double' elements <<

*** Set of Full Sum Tests ***
Full Sum : Rolled Loops - 1-in-1
Sum is: 0.000000
Calculated in 2110 ticks

Full Sum : UnRolled Loops - 4-in-1 - A
Sum is: 0.000000
Calculated in 2094 ticks

Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( without priority boost )
Sum is: 0.000000
Calculated in 2000 ticks

Full Sum : UnRolled Loops - 8-in-1 - A
Sum is: 0.000000
Calculated in 2671 ticks

Full Sum : UnRolled Loops - 8-in-1 - B
Sum is: 0.000000
Calculated in 2313 ticks

Process Priority High
Full Sum : UnRolled Loops - 4-in-1 - B
Sum is: 0.000000
Calculated in 1984 ticks

Process Priority Realtime
Full Sum : UnRolled Loops - 4-in-1 - B // <= Best Time ( with priority boost )
Sum is: 0.000000
Calculated in 1953 ticks
...

Páginas

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!