Intel C++ intrinsics SIMD and data layout

Intel C++ intrinsics SIMD and data layout

Hello all.

I'm trying my hand at optimizing my app for the Pentium 4 using Intel C++ v7 (for the first time). Very briefly, the app extracts square regions from a pair of digital images, does a gigantic amount of math on them, and saves data as ASCII files. The regions are always divisable by 4 (ex. 32x32, 64x64, or 128x128 pixels), and so are prime candidates for SIMD optimizations. I'm allocating memory as long 1-D "float" arrays using "_mm_malloc(...)" and aligning along 16-byte boundaries as follows (for a 64 x 64 element float array, for example):

Region1=(float*) _mm_malloc((64*64)*4, 16);

Since my regions are always divisible by 4 I felt that a long strip, or "Vector" of data was most suitable for SIMD single precision floating-point operations. However, while perusing the Intel Technology Journal articles on the P3, I found an article entitled "Applications Tuning for the Streaming SIMD Extensions" wherein they claim this is inefficient memory usage, and that I would be better served using a "hybrid" data layout as follows:

struct {
float A[8], B[8], C[8];
} Hybrid_data;

This is supposed to maintain the SIMD friendly data ordering (SoA rather than AoS) while increasing the likelihood that data members are located in the same memory page. They present the results of an experiment wherein the above approach showed the best overall performance while the SoA approach (my original algorithm) was "fully memory limited; it was unable to approach the best performance despite the natural SIMD ordering of data". It will take a concentrated effort to rewrite my app to use the Hybrid data ordering as presented by the authors, but if I can get a noticable improvement in memory usage I believe it is worth it. I was thinking of defining structures for each 32 or 64 element strip, as in (for 32x32 regions):

struct {
float region1[32], region2[32];
} Hybrid_data[32];

Not having any experience in these matters, I am asking anyone for any comments and opinions. Thanks everyone, and I look forward to hearing your views on this!

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Welcome to the forum!

Sorry for the rather late reply. Still catching up from the holidays :-)

Here is what I refer on a constant basis:

"Intel Pentium 4 and Intel Xeon? Processor Optimization Manual" could be of value to you. The set of three Software Developers Manual are something I refer to all the time as well.

Please note that the Intel compiler provides you with many features that may make it easier for you to accomplish what you are trying to do. More info at
Please see - for a quick introduction to optimizing with the Intel compiler.

Reading your writeup, you may particularly benefit from auto-vectorization and auto-parallelization. Please readup about it. The users guide for the Intel compiler is also online. (Look for "user guide" in the right hand side of the screen for wither windows or Linux).

I am eager to hear how your progress goes.


Hi Ganesh.

Thanks for the links. Being new at this, it's taking me awhile to sort out all the info. I found the P4 and Xeon manual only partially helpful since most of the examples were in assembly language (which I don't know).

Regarding the issue of data structures, I wrote a test program that included the approach I detailed above as well as the hybrid data approach and found, surprisingly,that the performance of the hybrid structure was lousy! I went back to the long vector approach and will try to fine tune by prefectching.

Leave a Comment

Please sign in to add a comment. Not a member? Join today