usage of "_mm_prefetch(...)"

usage of "_mm_prefetch(...)"

Hi everyone. I'm using Intel C++ v7 compiler with the SIMD intrinsics (SSE). Below is an example of the types of simple loops my application contains:

StorVec1 = (float*) _mm_malloc((64*64)*4, 16);
StorVec2 = (float*) _mm_malloc((64*64)*4, 16);
ResultVec = (float*) _mm_malloc((64*64)*4, 16);
.
.
.
for(int x=0; x<4096; x+=4)
{
*(__m128*) (ResultVec+x) = _mm_mul_ps(*((__m128 *)(StorVec1+x)), *((__m128 *)(StorVec2+x)));
}

Although the data access patterns are all very regular and predictable (which, I believe, is optimum for the P4's automatic hardware prefetcher) I'd like to experiment with software prefetching (according to VTune my counts per instructions retired ratio is 1.27). Could someone please give me a head start by posting a couple lines of code giving me an idea of how to implement software prefetching for the above loop? I believe I have to use the "_mm_prefetch(...)" instruction with the "_MM_HINTT0" argument since the result of the operation will be used again in the following loop. Thanks everyone!
- Lionel

21 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The pentium 4 and Xeon optimization manual I pointed to in the earlier reply has an example I hope will help you.

http://developer.intel.com/design/Pentium4/manuals/

Thanks again for using the forum.

Cheers,
Ganesh

// put this macro in some common header :
#define PREFETCH_T0(addr,nrOfBytesAhead) _mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)

#define LIONELK_FETCH_DIST 128 // tune this value according to real world experiments

void LionelK()
{
__m128 *StorVec1 = (__m128*) _mm_malloc((64*64)*4, 16),
*StorVec2 = (__m128*) _mm_malloc((64*64)*4, 16),
*ResultVec = (__m128*) _mm_malloc((64*64)*4, 16);

for(int x=0; x<1024; x+=2) // Unrolled twice to fetch 32 Byte at a time, OK for PIII cache lines
{
PREFETCH_T0(ResultVec+x,LIONELK_FETCH_DIST);
PREFETCH_T0(StorVec1+x,LIONELK_FETCH_DIST);
PREFETCH_T0(StorVec2+x,LIONELK_FETCH_DIST);

ResultVec[x] = _mm_mul_ps(StorVec1[x],StorVec2[x]);
ResultVec[x+1] = _mm_mul_ps(StorVec1[x+1],StorVec2[x+1]);
}
}

Thank you bronx!

- Lionel

Lionel,

On nearly all hardware currently in existence, you would be unlikely to get a noticeable pickup in performance by using _mm_prefetch() unless your memory access pattern is non-sequential *and* you know what cache line you would need to pull from main memory at least 100-200 CPU cycles before you need it.

Alex

I've spent almost a day on integrating PREFETCHT0 based codes into ourexisting data processing libraryand I don't see bigincrease in performance. I would say it is faster for ~1%.

A performance increase from loops-unrolling ( 4-in-1 )was more than 5%!

// HrtDataPrefetchT0
#ifdef _RTENABLE_HRT_DATAPREFETCH_SUPPORT

template < class T > inline RTvoid HrtDataPrefetchT0( T *pAddress )
{
_asm prefetcht0 [ pAddress ];
};

#undef CrtDataPrefetch
#define CrtDataPrefetchHrtDataPrefetchT0

#endif

The function HrtDataPrefetchT0isdeclared as a template because used in a template based library.

Best regards,
Sergey

Hi,A high CPI is not always a sign of bad performance. A cite out of a VTune opimization guide:

Optimized code (i.e: SSE instructions) may actually lower the CPI, and increase stall % but it will increase the performance. CPI and Stalls is just general guidance of efficiency the real measure of efficiency is work taking less time work is not something that can be measured by the CPU.

Also consider that if are working on large datasets your perfomance can also be memory bound and not CPU bound.
Alex

On recent CPU models, according to my limited understanding, issuing a software prefetch on a data stream should eliminate hardware prefetch on that stream. So, your own strategy would have to be superior to the standard hardware based strategy, and might have to be adjusted for each CPU.
On certain early CPU models, software prefetch could be used to jump start hardware prefetch, but such a tactic would backfire on recent CPUs.
You may need to shut off one or more varieties of hardware generated prefetch entirely, e.g. cache line pairing aka alternate sector prefetch.

Among situations where you might hope to gain by software prefetch:
Frequent TLB misses, if you know far enough ahead which page to prefetch (and few of them are needed)
So many data streams in use that the limit on hardware prefetched streams for your CPU is exceeded, yet the cache locality is unusually good considering number of streams
Those 2 cases are likely mutually exclusive.

Access with indirection:
If you turn on the compiler's prefetch option, it should generate software prefetch for a single level of indirection e.g. for(i=0;ibut such prefetch may not be optimum for your case.

Alexander, Tim, thank you very much for the information!

>>...So, your own strategy would have to be superior to the standard hardware based strategy, and
>>might have to be adjusted for each CPU...

That's the worst case because the library is so portable ( 99.99% of codes are in C++ )that no attempts are done to detect a CPU model. I would prefer to be as neutral as possible regarding different CPU models.

Best regards,
Sergey

That's an interesting discovery. There is a support for intrinsic functions on Embedded Platforms. If somebody is interested take alook at MSDN's article "Intrinsic Functions for Device Compilers". Here are some technical details:

Header file: Cmnintrin.h

For example, there is '__prefetch' intrinsic function:

...
void __cdecl __prefetch( void * );
...

and it is supported on x86, ARM, SH-4 and MIPS platforms.

My 2 cents:

1. Prefetching destination when you have large amounts of data flowing doesn't make sense because by the time next loop gets to work on the same data it will get evicted from the cache.
2. Prefetching manually for linear RAM access pattern doesn't make sense.
3. Prefetching manually requires optimal prefetch distance to be calculated which is more of a trial and error than an exact science, and depends on many factors such as CPU family / model, and the particular algorithm for which it should prefetch the data.
4. Prefetching algorithms in CPUs do not have a guarantee that they will do anything usefull on that particular CPU, much less on a different one.

I also think that before even trying to prefetch, data layout must already be optimized, and loop blocking should be used if dataset doesn't fit into largest available cache to maximize data locality.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Igor's comments relate directly to topics which I mentioned above, and I agree entirely with them.

I too agree with Igor. This agreement was born-out of several attempts to incorporate prefetching to improve performance of several synthetic test programs. In almost all the test scenarios prefetching failed failed to improve performance. And the few where some performance was observed, it was in the order of 1%-2% improvement. Quite often, code placement of loops will cause a variance of 1%-2% in performance. I had not investigated if prefetching or codeplacement was the source of the observed difference.

There is a potential programming scenario where you can "prefetch" without _mm_prefetch(). As to if this is useful this will depend on how you manipulate the data. On most "modern" multi-threaded processors you have cache heirarchy and thread participation in this heirarchy. Example: Sandy Bridge, HT siblings share L1&L2 cache. If you can coordinate thread teams into cache sharing groups, such as threads in same core (HT siblings), or threads in same socket (multi-processor with each processor containing independent L3). And if your algorithm is constructed where one of the threads in the team fetches data to be used by the other thread(s) in the team (and optionally becomes a late partner in the process) you might observe improvement in throughput.

This too, will require tuning experimentation (as was discussed by Igor).

Jim Dempsey

www.quickthreadprogramming.com

Here is somestatistics based on my tests tuning a Strassen's Heap Based algorithm for matrix multiplication:

Tested oncomputer: Pentium 4 CPU\ 1.60 GHz

Matrix size:1,024x1,024
Matrix Threshold Value: 64x64
Environment: Win32 on Windows XP\ Single CPU\ Single Thread \ No OpenMP support

~1% improvement - Data Prefetching
Note: actual range was ~0.5% to ~1.5%

~5% improvement- Loops Unrolling ( 4-in-1 )

~2% improvement -Process Priority Boostto Real-Time
Note: Mouse, Keyboard, Task Manager are all preempted! Nothing could be done untill all
calculations are done

PS:A ~15-year-old Borland C++ compiler v5.5.1demonstrated the best performance and it was~4.4 seconds.

Architectural details have evolved over the years, so it's not surprising that a current compiler didn't get the best result on an old P4.

Interaction of software and hardware prefetch on early P4 CPUs was entirely different from current CPUs. Prefetch optimization for the first P4 would usually be wrong for later CPUs.

gcc option to unroll by 4 acounts for several instances of performance superior to icc on Nehalem and Westmere CPUs. By the time this situation was evident, Intel compilers were looking ahead to AVX and some code generation practices which should have helped these CPUs were judged inadvisable for Sandy Bridge.

In spite of some of our managers calling 5% changes in performance significant, one has to question the idea of investing effort in such a small change in performance which may go away with the next software or hardware update.

Sorry, I corrected a misspelled word 'menagers' when quoting you...

>>...In spite of some of our managers calling 5% changes in performance significant, one has to
>>question the idea of investing effort in such a small change in performance which may go away
>>with the next software or hardware update...

A 5% improvement in performanceis considered as a very good improvement on a real-time embedded platform.

Guys, I don't think that all of you understood what I'm "fighting" for as a software developer. This is not
about "waisting" some time onthe software community forum:

1. It is about sharing some knowledge with softwaredevelopers and giving for free some good,even if
they are small, pieces of Assembler/C/C++ codes;

Note: If you don't likesome piece of code, don't use itand don't complain, right?

2. It is about comparing different opinions with your own opinion and enforcing yourself to do some
practical programming in order to prove or disprove some technical ideas or solutions;

Note: Guys, could you provide some codes to everybody ( not just me )and demonstrate
that Intel's invented, designed and used already on many CPUs 'prefetchtx' instruction
doesn't work, or absolutely useless?

3.It is about as better as possibleoptimizitation ofhighly portable CRT-basedC/C++ codes;

4.it is about achieving as better as possible performance of these C/C++ codes before a C/C++
compiler optimization is applied;

5.It is about achieving as smallest as possible delays in processing ina real-time environment involving
a Linear Algebra mathematics.

Thank you for your feedback!

Best regards,
Sergey

5% is certainly worth going after,.lt. 1% is not. You will never know a priori if the effort spent on prefetching will be -x% to +y% until you expend the effort. Prefetch tuning should be your last measure of optimization.
For an application running on an embedded real-time platform (IOW the target platform is fixed), and assuming the embedded processor is Atom (IA32 or Intel64), then your final phase optimization (prefetch) should incorporate use of VTune to find out where the code memory stalls are located and the type of stalls. Then as you attempt to improve performance through prefetching you can then compare reports as opposed to simply comparing end results. The comparison of reports may show that inserting prefetch at point A in the code exacerbates performance at point B in the code. This knowledge, knowing why this particular attempt at prefetching didn't work, will help you how to strategize your next attempt. Running VTune may interfere with your performance monitoring, therefore, be prepared for using either a software emulator or ICE to obtain the cache-miss information completely transparent from the execution of your test code.

The use of prefetchx isn't useless, the tricky part is in determining where you can make use of the prefetchx.

>>3... highly portable CRT-basedC/C++ codes

prefetch would not be highly portable. CPU archetecture (internal not abstract) affects where prefetch would make differences.

>> 4. it is about achieving as better as possible performance of these C/C++ codes before a C/C++
compiler optimization is applied

Prefetching design will likely have to be determined after compiler optimizations are applied

>>5... achieving as smallest as possible delays...

No argument here. If your embedded system is running on an Atom processor with HT, I would suggest you experiment with scripting your prefetching to run on one HT sibling of one core to coordinate with execution of the other HT sibling of the same core. This way you can actually enforce prefetching as opposed to suggest prefetching. This does come at the expense of loss of availability of the hardware thread.

Jim Dempsey

www.quickthreadprogramming.com

Good comments, thank you!

>>...your final phase optimization ( prefetch ) should incorporate use of VTune...

Exactly! That is why I'm already communicating with Patrick Fay ( Intel )on the"Software Tuning,
Performance Optimization and Platform Monitoring" sub-forum.

Best regards,
Sergey

Sergey

Intel also has a product: Intel Architecture Code Analyzer.

You might give that tool a look too.

It was written to observe pipeline latencies with SSE and AVX but it works with other instructions as well.
Of particular interest to you is not necessarily prefetchn, rather interlieving the instructions such that to make use of latency time (L2, L1, L3, RAM) with performing productive work.

Jim Dempsey

www.quickthreadprogramming.com

Dear colleagues,

This thread dates back to long ago, but I think the topic is not clarified elsewhere. I am optimizing some critical code, I found that _mm_prefetch was helping significantly if I use it in the similar way as it is used in examples from Intel's manual. However, I tried this:

char ** data_beg;
char ** data_end;
//loop 1
for (int i=0; i<n; ++i){
  for(char * ard = data_beg[i]; adr < data_end[i]; adr+= 32){
    _mm_prefetch(adr,_MM_HINT_T1);
  };
};
//loop 2
for (int i=0; i<n; ++i){
  compute(data_beg[i], data_end[i]);
};

The purpose of the first loop is to get the data intio L1 cache so that computation goes warm. It does it's job, loop 2 is speeded up about twice, compared to cold cache start. However, loop 1 takes about the same time as warm-started loop2!! So _mm_prefetch takes about the same time as if it was waiting for the memory accesses to complete. But loop1 does not read the memory at the addresses [date_beg[i] through data_end[i] ], the manual states that the prefetch instruction is very light-weight? Why then it takes time comparable to quite non-trivial computation in loop2?

I observe about the same behaviour on a smaller scale too with Ld1 cache (_MM_HINT_T0). I would appreciate if you point to me what I could be missing from the manual or how you can interpret these experiments. The CPU is Intel Core i7.

 

Thanks

 

 

 

 

Hi Alexander S.

Even the prefetch instruction is very light-weight, your prefetch is inside a nested loop, do not ignore the loop overhead (it requires some instructions like mov, add, cmp....).

At least, below code will be kind of better, I guess....but still, it has a loop, and I do not think the overhead can be ignored...It just reduce some overhead compared to your original one.

// merged loop
for (int i=0; i<n; ++i){
  for(char * ard = data_beg[i]; adr < data_end[i]; adr+= 32){
    _mm_prefetch(adr,_MM_HINT_T1);
  };
  compute(data_beg[i], data_end[i]);
};

Thanks,

Shenghong

Quote:

Alexander S. wrote:

Dear colleagues,

This thread dates back to long ago, but I think the topic is not clarified elsewhere. I am optimizing some critical code, I found that _mm_prefetch was helping significantly if I use it in the similar way as it is used in examples from Intel's manual. However, I tried this:

char ** data_beg;
char ** data_end;
//loop 1
for (int i=0; i<n; ++i){
  for(char * ard = data_beg[i]; adr < data_end[i]; adr+= 32){
    _mm_prefetch(adr,_MM_HINT_T1);
  };
};
//loop 2
for (int i=0; i<n; ++i){
  compute(data_beg[i], data_end[i]);
};

The purpose of the first loop is to get the data intio L1 cache so that computation goes warm. It does it's job, loop 2 is speeded up about twice, compared to cold cache start. However, loop 1 takes about the same time as warm-started loop2!! So _mm_prefetch takes about the same time as if it was waiting for the memory accesses to complete. But loop1 does not read the memory at the addresses [date_beg[i] through data_end[i] ], the manual states that the prefetch instruction is very light-weight? Why then it takes time comparable to quite non-trivial computation in loop2?

I observe about the same behaviour on a smaller scale too with Ld1 cache (_MM_HINT_T0). I would appreciate if you point to me what I could be missing from the manual or how you can interpret these experiments. The CPU is Intel Core i7.

 

Thanks

 

 

 

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today