Hi everyone. I'm using Intel C++ v7 compiler with the SIMD intrinsics (SSE). Below is an example of the types of simple loops my application contains:
StorVec1 = (float*) _mm_malloc((64*64)*4, 16);
StorVec2 = (float*) _mm_malloc((64*64)*4, 16);
ResultVec = (float*) _mm_malloc((64*64)*4, 16);
for(int x=0; x<4096; x+=4)
*(__m128*) (ResultVec+x) = _mm_mul_ps(*((__m128 *)(StorVec1+x)), *((__m128 *)(StorVec2+x)));
Although the data access patterns are all very regular and predictable (which, I believe, is optimum for the P4's automatic hardware prefetcher) I'd like to experiment with software prefetching (according to VTune my counts per instructions retired ratio is 1.27). Could someone please give me a head start by posting a couple lines of code giving me an idea of how to implement software prefetching for the above loop? I believe I have to use the "_mm_prefetch(...)" instruction with the "_MM_HINTT0" argument since the result of the operation will be used again in the following loop. Thanks everyone!