FPU - Weird Performance

FPU - Weird Performance

fozi_b@yahoo.com's picture


Hi, i'm Diogo Teixeira from Portugal, student of computer science in Lisbon. I owntwoIntelP4 processors, on both desktop and laptop.

I'm developing a Software Renderer in C++, and i'm having weird results on Intel CPU's, i've tested P2, P3 and P4 so far. The strangest results come from the Pentium4.

It is based on Chris Hecker's Perspective Texture Mapping article,the code does nottake advantageof SIMD instructions, simple FPU computations are being made for each triangle. I was able to strip off most of the code and compiled the problem in3 files.

The small source can be found at:
http://fozi.codingcorner.net/RenderSoft_Source.rar

Performance comparison table:

AMD XP 1500+ (1.3GHz)....... 412fps

AMD XP 2600+ (2.0GHz)....... 607fps

IntelP4 1.5GHz.......................... 6 fps

IntelP4 2.5GHz.........................14 fps

As you can see these numbers are extremely weird, i think it might have something to do with some FPU states i'm not aware of. I've checked and changed my timing code and the numbers remained the same. I've read Intel docs that advise developers to use SIMD code, but even without using them i think the performance shouldnt be that low, should it?

So the problem must lie in my code, any help on this would behighly appreciated!

Thanks in advance:
Diogo Teixeira

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Sergey Maidanov (Intel)'s picture

Dear Diogo Teixeira,

Unfortunately you didn't specify which compiler you are using to compile your benchmarking program. So I supposedto usethe recent Intel C/C++ compiler v.8.0 to reproduce your issue.

I was able to reproduce performance numbers you obtained on two systems: Intel Xeon processor (2.2 GHz) and using AMD Athlon XP processor (1.6 GHz):

Amd (1.6 GHz) - 514 Fps
IntelXeon (2.2 GHz) - 12 Fps

The compiler switch used is /O2. So you can see that performance numbers are rather similar to what you obtained.Weare continuing the investigationbut I have some notes to addright now.

To get substantial performance benefits on Intel processors I recommend you to use advanced compiler optimization switches while compiling your benchmark. For example, if you will try to use the compiler switch /QxN (my assumption again is that you use Intel C/C++ compiler v. 8.0) to get performance numbers similar to what I get:

IntelXeon (2.2 GHz) - 544 Fps

The command I used to compile your benchmark is the following:
icl /QxN main.cpp RenderSoft.cpp winmm.lib

Notice that if you use /QxN swith then your code will not run on AMD processors. Instead you may try to use /QxK on AMD. But in my experiment it doesn't help AMD performance:

Amd (1.6 GHz) - 455 Fps.

If you will try one more advanced optimization switch /Qip in addition to /QxN you might get even more performance benefits:

IntelXeon (2.2 GHz) - 590 Fps

Another suggestion is to slightly modify your code to improve its performance:

void RenderSoft::RenderSolidPolygon(Poly *pPolygon)
{
int i, a = 0, b = 1, c = 2, n = pPolygon->dwNumVertices;
for (i = 0; i < n- 2; i++)
{
TLVertex *tmp;
#if 0
/* This is original code */
TLVertex *v1 = &pPolygon->VertexList[a%n];
TLVertex *v2 = &pPolygon->VertexList[(b++)%n];
TLVertex *v3 = &pPolygon->VertexList[(c++)%n];
#else
/* This is proposed code */
TLVertex *v1 = &pPolygon->VertexList[0];
TLVertex *v2 = &pPolygon->VertexList[b++];
TLVertex *v3 = &pPolygon->VertexList[c++];
#endif
....
}

The reason is that you use a division by modulo n which is quite expensive. In my view, the better to rewrite this fragment as shown above. The reason to don't use the % is because:
1) a remains unchanged in the loop (loop invariant);
2) b is always less than n in the loop. Thus b%n is always equal to b;
3) c is always less than n in the loop. Thus b%n we can replace with just c.

Using this modification I obtained the following figures with /QxN:

IntelXeon (2.2GHz) - 675 Fps

Using /QxN /Qipo I got:


IntelXeon (2.2GHz) -708 Fps

Sergey

pshelepugin's picture




Diogo and Sergey,
I looked thru this stuff and found out that the issue is Infs and NaNs the benchmark operates on. Actually every second time the parameters that passed to the hotest function cause division by zero, and then there are few operations on Infs I reduced the benchmark to the following test case (it reproduces the same sequence of operations as in the original function from the test):

// 1 0 0 0 1 0 0 0
float render(float botx,float topx,float boty,float topy,float ypre,float topz,float gradsdzdy,float gradsdzdx){
float z;
float fw = botx - topx; // (1-0)
float fh = 1.0f / (boty - topy); // 1/(0-0)
float fp_x = ((fw* ypre)* fh)+ topx; // ((1*1)*Inf)+0
float xpre = fp_x - topx; // (Inf-0)
float fp_xstep = fw * fh; // (1*Inf)

//NaN = (0+(1*0)+(Inf*0))
z = (topz + (ypre * gradsdzdy) + (xpre * gradsdzdx));
return z;
}

main(){
int i;float a;
for(i=0;i<100000000;i++) a = render(1,0,0,0,1,0,0,0);
printf("%f
",a);
}

This code is 104 times slower if compiled with /O2 in comparision with /QxN. So, FPU in such cases (i.e. when arguments are Infs & NaNs) is much slower vs SSE instructions.
Diogo, don't you think that something is wrong with input parameters? Are you sure that this application operates on correct data?..
Thanks.

fozi_b@yahoo.com's picture

Sorry for replying so late, I managed to solve the problem partially by eliminating the divisions-by-zero some weeks ago. The difference in performance was still considerable, i didnt think the difference between sse and non-sse code could be that big.


smaidano, thanks a lot for the tips! It didnt cross my mind that the mods could beaffecting the performance thatmuch in the performance,maybe because the scenes I tested had very low polygon counts. Lesson learned!


pshelepugin, I took me a while to found out that same problem. About the consistency of the data, could it be a problem with denormal values? I read somewhere it could affectthe FPU performance considerably.


With both your contributions the problem with that specific piece of code is now solved. I will checkthe rest of the code for these same problems hoping to have a fair performance onall my testmachines. Thanks again guys.

Login to leave a comment.