https://software.intel.com/de-de/forums/topic/335715/feed
deI made a simplier program
https://software.intel.com/de-de/comment/1715788#comment-1715788
<a id="comment-1715788"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>I made a simplier program (one dimension). All data are aligned, and I need to use unaligned load (no problems with >= i7).</p>
<p>But ! The non SSE loop is 2 times faster. I don't understand...</p>
<p><pre class="brush: csharp">
#include
#include
#include
//#include
#define n1 1004
#define niter 200000
int i,j,t;
double U0[n1] __attribute__ ((aligned(16)));
double U1[n1] __attribute__ ((aligned(16)));
double Dx,Dy,Lx,Ly,InvDxDx,Dt,alpha,totaltime,Stab,DtAlpha,DxDx;
__m128d vmmx00;
__m128d vmmx01;
__m128d vmmx02;
__m128d vmmx10;
__m128d va;
__m128d vb;
__m128d vc;
__m128d vd;
clock_t time0,time1;
FILE *f1;
int main()
{
/* ---- GENERAL ---- */
alpha = 0.4;
totaltime = 1.0/100.0;
Dt = totaltime/((niter-1)*1.0);
Lx = 1.0;
Dx = Lx/((n1-1)*1.0);
InvDxDx = 1.0/(Dx*Dx);
DxDx = Dx*Dx;
Stab = alpha*Dt*(InvDxDx);
DtAlpha = Dt*alpha;
/* Stability if result <= 0.5 */
printf("Stability factor : %f n",Stab);
for( i = 0; i < n1; i++){U0[i] = 0.0;}
U0[1] = 1.0;
U0[2] = 1.0;
U0[3] = 1.0;
U0[n1-2] = 2.0;
// for ( i = 0; i < n1; i++) {
// for ( j = i + 1; j < n2; j++) {
// std::swap(U0[i][j], U0[j][i]);
// }
//}
va = _mm_set1_pd(-2.0);
vb = _mm_set1_pd(InvDxDx);
vd = _mm_set1_pd(DtAlpha);
time0=clock();
for( t = 0; t < niter; t++)
{
for( i = 2; i < n1-2; i+=2)
{
//printf("%d %d n",i,j);
//fflush(stdout);
vmmx00 = _mm_load_pd(&U0[i]);
vmmx01 = _mm_loadu_pd(&U0[i+1]);
vmmx02 = _mm_loadu_pd(&U0[i-1]);
vmmx10 = _mm_mul_pd(va,vmmx00); // U1[i][j] = -2.0*U0[i][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx01); // U1[i][j] = U1[i][j] + U0[i+1][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx02); // U1[i][j] = U1[i][j] + U0[i-1][j];
vmmx10 = _mm_mul_pd(vb,vmmx10); // U1[i][j] = U1[i][j] * InvDxDx;
vmmx10 = _mm_mul_pd(vd,vmmx10); // U1[i][j] = U1[i][j] * DtAlpha;
vmmx10 = _mm_add_pd(vmmx10,vmmx00); // U1[i][j] = U1[i][j] + U0[i][j];
_mm_store_pd(&U1[i],vmmx10);
// U1[i][j] = U0[i][j] + DtAlpha*( (U0[i+1][j]-2.0*U0[i][j]+U0[i-1][j])*InvDxDx
}
for( i = 2; i < n1-2; i+=2)
{
//printf("%d %d n",i,j);
//fflush(stdout);
vmmx00 = _mm_load_pd(&U1[i]);
vmmx01 = _mm_loadu_pd(&U1[i+1]);
vmmx02 = _mm_loadu_pd(&U1[i-1]);
vmmx10 = _mm_mul_pd(va,vmmx00); // U0[i][j] = -2.0*U1[i][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx01); // U0[i][j] = U0[i][j] + U1[i+1][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx02); // U0[i][j] = U0[i][j] + U1[i-1][j];
vmmx10 = _mm_mul_pd(vb,vmmx10); // U0[i][j] = U0[i][j] * InvDxDx;
vmmx10 = _mm_mul_pd(vd,vmmx10); // U0[i][j] = U0[i][j] * DtAlpha;
vmmx10 = _mm_add_pd(vmmx10,vmmx00); // U0[i][j] = U0[i][j] + U1[i][j];
_mm_store_pd(&U0[i],vmmx10);
// U1[i][j] = U0[i][j] + DtAlpha*( (U0[i+1][j]-2.0*U0[i][j]+U0[i-1][j])*InvDxDx
}
}
time1=clock();
printf("Loop 0, total time : %f n", (double) time1-time0);
f1 = fopen ("out0.dat", "wt");
for( i = 1; i < n1-1; i++)
{
fprintf (f1, "%dt%fn", i, U0[i]);
}
// REF
for( i = 0; i < n1; i++){U0[i] = 0.0;}
U0[1] = 1.0;
U0[2] = 1.0;
U0[3] = 1.0;
U0[n1-2] = 2.0;
time0=clock();
for( t = 0; t < niter; t++)
{
for( i = 2; i < n1-2; i++)
{
U1[i] = U0[i] + DtAlpha* (U0[i+1]-2.0*U0[i]+U0[i-1])*InvDxDx;
}
for( i = 2; i < n1-2; i++)
{
U0[i] = U1[i] + DtAlpha* (U1[i+1]-2.0*U1[i]+U1[i-1])*InvDxDx;
}
}
time1=clock();
printf("Loop 0, total time : %f n", (double) time1-time0);
f1 = fopen ("outref.dat", "wt");
for( i = 1; i < n1-1; i++)
{
fprintf (f1, "%dt%fn", i, U0[i]);
}
}
</pre></p>
<p>I really don't understand where I made a mistake...Does someone has any Idea ?</p>
</div></div></div>Tue, 20 Nov 2012 14:52:20 +0000benoit.leveuglecomment 1715788 at https://software.intel.comOK.
https://software.intel.com/de-de/comment/1715694#comment-1715694
<a id="comment-1715694"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>After some tests, the first loop is all the time the fastest when using GCC O3 level optimisations. I think it automaticaly unroll the loops because they are simple in this case.</p>
<p>I need to take a look at icc compiler because some time it makes excellents optimisations.</p>
<p>I also tried to make the Simd code, but it doesn't work, the executable stop running.</p>
<p><pre class="brush: csharp">
double *cU0[n1][n2];
double *cU1[n1][n2];
/* simd */
__attribute__ ((aligned(16))) __m128d va;
__attribute__ ((aligned(16))) __m128d vb;
__attribute__ ((aligned(16))) __m128d vc;
__attribute__ ((aligned(16))) __m128d vd;
__attribute__ ((aligned(16))) __m128d vmmx00;
__attribute__ ((aligned(16))) __m128d vmmx01;
__attribute__ ((aligned(16))) __m128d vmmx02;
__attribute__ ((aligned(16))) __m128d vmmx03;
__attribute__ ((aligned(16))) __m128d vmmx04;
__attribute__ ((aligned(16))) __m128d vmmx10;
__attribute__ ((aligned(16))) __m128d vmmx20;
[...]
va = _mm_set1_pd(-2.0);
vb = _mm_set1_pd(InvDxDx);
vc = _mm_set1_pd(InvDyDy);
vd = _mm_set1_pd(DtAlpha);
for( t = 0; t < niter; t++)
{
/* even */
for( i = 1; i < n1-1; i++)
{
for( j = 1; j < n2-1; j+=2)
{
// Need five variables : [i][j],[i][j+1],[i][j-1],[i+1][j],[i-1][j]
// respectively : vxmm0,vxmm1,vxmm2,vxmm3,vxmm4
// can be optimized after (re-used of already loaded values)
vmmx00 = _mm_load_pd(cU0[i][j]);
vmmx01 = _mm_load_pd(cU0[i][j+1]);
vmmx02 = _mm_load_pd(cU0[i][j-1]);
vmmx03 = _mm_load_pd(cU0[i+1][j]);
vmmx04 = _mm_load_pd(cU0[i-1][j]);
vmmx10 = _mm_mul_pd(va,vmmx00); // U1[i][j] = -2.0*U0[i][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx03); // U1[i][j] = U1[i][j] + U0[i+1][j];
vmmx10 = _mm_add_pd(vmmx10,vmmx04); // U1[i][j] = U1[i][j] + U0[i-1][j];
vmmx10 = _mm_mul_pd(vb,vmmx10); // U1[i][j] = U1[i][j] * InvDxDx;
vmmx20 = _mm_mul_pd(va,vmmx00); // U2[i][j] = -2.0*U0[i][j];
vmmx20 = _mm_add_pd(vmmx20,vmmx01); // U2[i][j] = U2[i][j] + U0[i][j+1];
vmmx20 = _mm_add_pd(vmmx20,vmmx02); // U2[i][j] = U2[i][j] + U0[i][j-1];
vmmx20 = _mm_mul_pd(vc,vmmx20); // U2[i][j] = U2[i][j] * InvDyDy;
vmmx10 = _mm_add_pd(vmmx10,vmmx20); // U1[i][j] = U1[i][j] + U2[i][j];
vmmx10 = _mm_mul_pd(vd,vmmx10); // U1[i][j] = U1[i][j] * DtAlpha;
vmmx10 = _mm_add_pd(vmmx10,vmmx00); // U1[i][j] = U1[i][j] + U0[i][j];
_mm_store_pd(cU1[i][j],vmmx10);
[...]
</pre></p>
<p>Do you locate the error ? Because it seems correct to me and I don't have a good debugger here with me ...</p>
</div></div></div>Mon, 19 Nov 2012 22:29:00 +0000benoit.leveuglecomment 1715694 at https://software.intel.comJust before testing more, as
https://software.intel.com/de-de/comment/1715660#comment-1715660
<a id="comment-1715660"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Just before testing more, as I thought your loops are not the sames as the original one. U1 needs U0 to be fully calculated before being processed. And vice versa. Look at the U0[i][j+1] for example. I made the same mistake when I first tried to optimise the loop. But because it is a convergence calcul, results will be the same at the end. However, the non correct loops will take a larger time to converge.</p>
<p>I made the code running on Gcc 4.7 on an i7 860. I will now test your loops and try to correct my SSE instructions. Results in a few hours.</p>
</div></div></div>Mon, 19 Nov 2012 19:10:09 +0000benoit.leveuglecomment 1715660 at https://software.intel.com>>>If you review the test
https://software.intel.com/de-de/comment/1715575#comment-1715575
<a id="comment-1715575"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>>If you review the test-case 'LOOP 0' you will see that there are dependencies between U0 and U1 arrays.>>><br />
Yes of course.<br />
A few months ago I had a very interesting "conversation" with the user "bronxz" the main subject of the discussion was advantage which posses a CPU over GPU in graphics rendering.We agreed that for complex logic which involves extensive branching , managing memory and caches a CPU has an advantage.GPU will be very useful when nicely vectorized data without complex interdependencies will be passed to it for processing.<br />
P.s<br />
After reviewing my sine function thread sadly I cannot find those posts(everything was lost).</p>
</div></div></div>Mon, 19 Nov 2012 06:43:38 +0000iliyapolakcomment 1715575 at https://software.intel.com>>...Please look at my sine
https://software.intel.com/de-de/comment/1715557#comment-1715557
<a id="comment-1715557"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>...Please look at my sine functions thread. I'm posting some code test-cases...</p>
<p>I'll take a look soon. Thanks.</p>
</div></div></div>Sun, 18 Nov 2012 21:51:52 +0000Sergey Kostrovcomment 1715557 at https://software.intel.comHi Iliya,
https://software.intel.com/de-de/comment/1715556#comment-1715556
<a id="comment-1715556"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Hi Iliya,</p>
<p>>>...if the calculations could be easily vectorized and be independent from each other just like pixels...</p>
<p>If you review the test-case 'LOOP 0' you will see that there are dependencies between U0 and U1 arrays.</p>
</div></div></div>Sun, 18 Nov 2012 21:48:10 +0000Sergey Kostrovcomment 1715556 at https://software.intel.com>>>I agree and it will
https://software.intel.com/de-de/comment/1715552#comment-1715552
<a id="comment-1715552"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>>I agree and it will require a complete re-design, re-testing of some subsystems. It makes sense if CUDA will give a performance improvement in 50x or 100x.>>></p>
<p>I have forgotten to add only if the calculations could be easily vectorized and be independent from each other just like pixels.</p>
<p>@Seregey<br />
Please look at my sine functions thread.I'm posting some code test-cases.</p>
</div></div></div>Sun, 18 Nov 2012 20:24:00 +0000iliyapolakcomment 1715552 at https://software.intel.com>>>> CUDA will be very
https://software.intel.com/de-de/comment/1715540#comment-1715540
<a id="comment-1715540"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>>>>> CUDA will be very helpful in your case...<br />
>>>><br />
>>Unfortunately no, and that is the challenge.</p>
<p>I agree and it will require a complete re-design, re-testing of some subsystems. It makes sense if CUDA will give a performance improvement in 50x or 100x.</p>
</div></div></div>Sun, 18 Nov 2012 17:21:44 +0000Sergey Kostrovcomment 1715540 at https://software.intel.comThanks for the feedback!
https://software.intel.com/de-de/comment/1715539#comment-1715539
<a id="comment-1715539"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>Thanks for the feedback!</p>
<p>>>...I gave you a simple loop. In normal time, Dx and Dy are not constant, so the loop can be more complicated...</p>
<p>I suspected that.</p>
<p>>>...In fact, because the solver can perform this loop more than 300 times per iterations, and considering that a run can be more than<br />
>>100,000 iterations, precision is extremely important...</p>
<p>I don't know if you tried to use 'long double' type but I wouldn't recommend even to try it. I've done lots of tests and it doesn't help in my case. Test is, multiplication of very big matricies.</p>
</div></div></div>Sun, 18 Nov 2012 17:14:00 +0000Sergey Kostrovcomment 1715539 at https://software.intel.comA bit late, my apologies. I
https://software.intel.com/de-de/comment/1715533#comment-1715533
<a id="comment-1715533"></a>
<div class="field field-name-comment-body field-type-text-long field-label-hidden"><div class="field-items"><div class="field-item even" property="content:encoded"><p>A bit late, my apologies. I was on travel and I couldn't get a secured internet connection. I am downloading your code and I will take a look at it tomorrow.</p>
<p>>>> CUDA will be very helpful in your case.Do chemistry calculation could be easily vectorized like a video processing?<br />
Unfortunately no, and that is the challenge. Another team is working on it, and they face difficulties with the Fortran 77 code that is using a lot of “go to” instructructions.</p>
<p>>>> Very interesting info,but it is too advanced for me.I have only written 1D numerical integrators.By reading your code it could be implemented relatively easily with the help of intrinsics.<br />
Yes. The equations are relatively complex, but the main core is always the same : Derivative calculations, which is the same than image processing.</p>
<p>>>> Is it an Open Source project?<br />
Not know, but it is on the way to be, yes. We are currently rewriting some parts of the code to make it more “usable”. Then, we will release the sources, which could be used in OpenFoam For example (an open source fluid mechanic solver).</p>
<p>>>> - Review all equations because in some cases they could be normalized in order to reduce number of multiplications ( it is related to variables InvDxDx, InvDyDy, DtAlpha )<br />
I gave you a simple loop. In normal time, Dx and Dy are not constant, so the loop can be more complicated. You need to calculate each terms at a time, but yes, I think we could comment the main loop and rewrite it in a more optimized way.</p>
<p>>>> - I was very impressed with performance results when MinGW C/C++ compiler was used<br />
Yes, they made impressive improvements lately.</p>
<p>>>> - Review a Test-Case #3 ( your SSE codes ) because you've declared a couple of 128-bit variables on the stack and it affects performance ( some memory for these variables will be allocated and de-allocated more than ~208,080,000 times )<br />
I will try this and report results tomorrow.</p>
<p>>>> >>... I also attached a zip-file with outputs for all versions of codes...<br />
I used Microsoft's WinDiff utility to compare results.<br />
>>> I've done a quick test with a 20-bit precision Fixed Floating Point ( FFP ) type instead of a 53-bit precision Double-Precision floating point 'double' type. I could say that was a right decision to use 'double' data type because there was a significant loss in precision of results if FFP types are used.<br />
Thank you. I don’t know this program, so I will take a look.<br />
In fact, because the solver can perform this loop more than 300 times per iterations, and considering that a run can be more than 100,000 iterations, precision is extremely important. For example, we tried to calculate using simple reals, and the results diverged after only 12 iterations. And if you add chemistry calculations, due to exponential operations, it explodes immediately. On top of that, the BICGSTAB solver converged at 10^-15 of precision, which cannot be achieved with simple precision.</p>
<p>I will report tomorrow the results on my Core i7 Xeon.</p>
</div></div></div>Sun, 18 Nov 2012 16:21:03 +0000benoit.leveuglecomment 1715533 at https://software.intel.com