I have a code which shows high level of load block due to store overlap (due to 4k aliasing) in vtune. I have implemented the same code using SSE and this bottleneck seem to have disappeared.
But I couldnt find any information if the 4k aliasing bottleneck effects or doesnt effect SSE code. All the examples use non-sse code. Is there any documentation on if SSE load/store instructions are somehow immune to this problem?