I have some FLOP-intensive code that only gets around 25% - 30% of the theoretical performance on a Corei7-920.
I'm attaching the code as a Visual Studio project. In electro_SSE.cpp, you will find three implementations of int CalcField_CPU_T_Curvature. The first one is a generic template, completely unaware of vectorization. the second is my attempt (quite awful, I would say) at manually vectorizing everything with inline assembly, and the third is a more elegant implementation using intrinsics, that also reduces the number of memory loads by a factor of 4, versus the second version.
All three execute at the exact speed of 22-23 GFLOP/s. The only difference between the first and secon two implementations is that the first will cause the coompiler to generate rsqrt at one point, while, the other two use the more precise sqrt.
This limitation in performance makes me believe that the CPU may not be capable of more for this specific dataset, as I have already expressed here:
Some have requested that I post a simplified version of the source code, so here is an isolated test case that will work on its own