My SSE on an a single Nehalem core shows a 100% speedup for my program. The speedup contributed by SSEdecreases radically as one uses more cores on one CPU. i.e. every core uses the SSE from the cores on 1 chipusing an MPI program. If I run the same program on several Nehalems joined by Infiniband and use only 1 core per CPU(x4 or x8)I can once again show that the SSE provides a good speed-up.
I believe that the SSE slow down on a multicorecpu, when compared to using cores from multiple cpus is due to the saturation of the memory bandwidth. How can I prove or disprove this? Is there a non-complex way to detecting when the chip memory saturates. example, detect vm page faults or something.
I would imagine that this effect is even more severe with the AVX, as the AVX will demand even more memory bandwidthI am still stuck with the AVX emulator but will get round to testing it in practce when I figure out how to fit 1155 pins into a 1156 socket.