SSE slower ?

SSE slower ?

I have been trying to optimize my 3d rendering application using SSE for the p3 and p4. I have succesfully switched my Vector and Triangle classes over to use F32vec4 and the SIMD intrinsics. My applications now runs slower than before . Obviously, my approach is wrong. What can I do to find out where the slow down has happened ?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

To achieve a good speedup you have to do more than just using the SSE opcodes. Particularly important is the layout of the data in your 3D model database. If that's not already the case organize your data as SoA (Structure of Arrays) and align the data to 16 Byte boundaries. Unless data are statically organized in a SIMD friendly format like SoA, dynamic swizling code (aka shuffling) will be a major performance killer and you can very well end up with a slow-down.

typical AoS "Array of Structure" :

struct Pt3D
float x,y,z;

class Pts3DAoS
Pt3D *pts; // dyna. alloc missing

typical SoA :

class Pts3DSoA
float *x, *y, *z; // dyna. alloc missing

I really dont have much information here, but here are some next steps.
(i) Try looking at perfmon/other system wide performance. This is a basic first step and I know it is really obvious but it is also easy to forget. (I have). To be sure, make sure that you are actually close to 100% CPU bound and you are not waiting on I/O or on a semaphore and such.
(i)If you havent already done so, please consider using the VTune analyzer to profile your app before and after. This will give you an indication of where the hoptspot is and what happened to it.
(iii)A search on Google should give you some good articles. I had seem articles in the past on (Search for writeups by my esteemed colleague Haim).
(iv) Please let the compiler make it easy for you. Does autovectorization work on your algorithm? It is MUCH easier than tediously putting in intrinsics. Please see documentation on
to get you started. Please look for the Optimization guide and the compiler user guide for more information.

Well - I hope that helped.


You should also check this link that has a bunch of good guides on SSE code use:

SSE App notes

Another really good one is:
Power Programming - SIGGRAPH 2001

Aaron Coday

Message Edited by on 12-09-2005 10:42 AM

Thanks for all the suggestions.
Currently I have declared some of my simple types like;

typedef struct Vector {
float x,y,z;
} Vector;

typedef struct Triangle {
Vector v0;
Vector edge1;
Vector edge2;
} Triangle;

typedef struct Ray {
Vector origin;
Vector dir;
} Ray;

and my Vector operations are all declared as macros like;

#define VCT_ADD( v3, v1, v2 )
(v3).x = (v1).x + (v2).x,
(v3).y = (v1).y + (v2).y,
(v3).z = (v1).z + (v2).z

You would think that if I just changed to the following

typedef __m128 Vector;

and used the intrinsics in all my macros I would get a speed increase. ? But I don't ? Is it possible that the compiler is doong a better job optimizing my simple types that I will be able to do trying to rool my own SSE with intrinsics ?


> a speed increase. ? But I don't ? Is it possible that
> the compiler is doong a better job optimizing my
> simple types that I will be able to do trying to rool
> my own SSE with intrinsics ?

have a look at the generated ASM to know if your original version was already using SSE packed opcodes like ADDPS. for a better speedup you may consider to process four triangles at a time, it will allow more independent execution paths and capitalize better on the deep OOO window of P4s

Bronx - Good points, all. I find your mailings to be insightful and thanks for contributing.
I hope you will continue to post to this forum!


Leave a Comment

Please sign in to add a comment. Not a member? Join today