I have been trying to optimize my 3d rendering application using SSE for the p3 and p4. I have succesfully switched my Vector and Triangle classes over to use F32vec4 and the SIMD intrinsics. My applications now runs slower than before . Obviously, my approach is wrong. What can I do to find out where the slow down has happened ?
For more complete information about compiler optimizations, see our Optimization Notice.