I'm trying my hand at optimizing my app for the Pentium 4 using Intel C++ v7 (for the first time). Very briefly, the app extracts square regions from a pair of digital images, does a gigantic amount of math on them, and saves data as ASCII files. The regions are always divisable by 4 (ex. 32x32, 64x64, or 128x128 pixels), and so are prime candidates for SIMD optimizations. I'm allocating memory as long 1-D "float" arrays using "_mm_malloc(...)" and aligning along 16-byte boundaries as follows (for a 64 x 64 element float array, for example):
Region1=(float*) _mm_malloc((64*64)*4, 16);
Since my regions are always divisible by 4 I felt that a long strip, or "Vector" of data was most suitable for SIMD single precision floating-point operations. However, while perusing the Intel Technology Journal articles on the P3, I found an article entitled "Applications Tuning for the Streaming SIMD Extensions" wherein they claim this is inefficient memory usage, and that I would be better served using a "hybrid" data layout as follows:
float A, B, C;
This is supposed to maintain the SIMD friendly data ordering (SoA rather than AoS) while increasing the likelihood that data members are located in the same memory page. They present the results of an experiment wherein the above approach showed the best overall performance while the SoA approach (my original algorithm) was "fully memory limited; it was unable to approach the best performance despite the natural SIMD ordering of data". It will take a concentrated effort to rewrite my app to use the Hybrid data ordering as presented by the authors, but if I can get a noticable improvement in memory usage I believe it is worth it. I was thinking of defining structures for each 32 or 64 element strip, as in (for 32x32 regions):
float region1, region2;
Not having any experience in these matters, I am asking anyone for any comments and opinions. Thanks everyone, and I look forward to hearing your views on this!