usage of "_mm_prefetch(...)"

usage of "_mm_prefetch(...)"

lionelk's picture

Hi everyone. I'm using Intel C++ v7 compiler with the SIMD intrinsics (SSE). Below is an example of the types of simple loops my application contains:

StorVec1 = (float*) _mm_malloc((64*64)*4, 16);
StorVec2 = (float*) _mm_malloc((64*64)*4, 16);
ResultVec = (float*) _mm_malloc((64*64)*4, 16);
.
.
.
for(int x=0; x<4096; x+=4)
{
*(__m128*) (ResultVec+x) = _mm_mul_ps(*((__m128 *)(StorVec1+x)), *((__m128 *)(StorVec2+x)));
}

Although the data access patterns are all very regular and predictable (which, I believe, is optimum for the P4's automatic hardware prefetcher) I'd like to experiment with software prefetching (according to VTune my counts per instructions retired ratio is 1.27). Could someone please give me a head start by posting a couple lines of code giving me an idea of how to implement software prefetching for the above loop? I believe I have to use the "_mm_prefetch(...)" instruction with the "_MM_HINTT0" argument since the result of the operation will be used again in the following loop. Thanks everyone!
- Lionel

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
GaneshRao (Intel)'s picture

The pentium 4 and Xeon optimization manual I pointed to in the earlier reply has an example I hope will help you.

http://developer.intel.com/design/Pentium4/manuals/

Thanks again for using the forum.

Cheers,
Ganesh

bronx's picture


// put this macro in some common header :
#define PREFETCH_T0(addr,nrOfBytesAhead) _mm_prefetch(((char *)(addr))+nrOfBytesAhead,_MM_HINT_T0)



#define LIONELK_FETCH_DIST 128 // tune this value according to real world experiments

void LionelK()
{
__m128 *StorVec1 = (__m128*) _mm_malloc((64*64)*4, 16),
*StorVec2 = (__m128*) _mm_malloc((64*64)*4, 16),
*ResultVec = (__m128*) _mm_malloc((64*64)*4, 16);

for(int x=0; x<1024; x+=2) // Unrolled twice to fetch 32 Byte at a time, OK for PIII cache lines
{
PREFETCH_T0(ResultVec+x,LIONELK_FETCH_DIST);
PREFETCH_T0(StorVec1+x,LIONELK_FETCH_DIST);
PREFETCH_T0(StorVec2+x,LIONELK_FETCH_DIST);

ResultVec[x] = _mm_mul_ps(StorVec1[x],StorVec2[x]);
ResultVec[x+1] = _mm_mul_ps(StorVec1[x+1],StorVec2[x+1]);
}
}

lionelk's picture

Thank you bronx!

- Lionel

Alex T's picture

Lionel,

On nearly all hardware currently in existence, you would be unlikely to get a noticeable pickup in performance by using _mm_prefetch() unless your memory access pattern is non-sequential *and* you know what cache line you would need to pull from main memory at least 100-200 CPU cycles before you need it.

Alex

Sergey Kostrov's picture

I've spent almost a day on integrating PREFETCHT0 based codes into ourexisting data processing libraryand I don't see bigincrease in performance. I would say it is faster for ~1%.

A performance increase from loops-unrolling ( 4-in-1 )was more than 5%!

// HrtDataPrefetchT0
#ifdef _RTENABLE_HRT_DATAPREFETCH_SUPPORT

template < class T > inline RTvoid HrtDataPrefetchT0( T *pAddress )
{
_asm prefetcht0 [ pAddress ];
};

#undef CrtDataPrefetch
#define CrtDataPrefetchHrtDataPrefetchT0

#endif

The function HrtDataPrefetchT0isdeclared as a template because used in a template based library.

Best regards,
Sergey

Alexander Weggerle (Intel)'s picture
Hi, A high CPI is not always a sign of bad performance. A cite out of a VTune opimization guide:
Optimized code (i.e: SSE instructions) may actually lower the CPI, and increase stall % but it will increase the performance. CPI and Stalls is just general guidance of efficiency the real measure of efficiency is work taking less time work is not something that can be measured by the CPU.

Also consider that if are working on large datasets your perfomance can also be memory bound and not CPU bound.
Alex
TimP (Intel)'s picture

On recent CPU models, according to my limited understanding, issuing a software prefetch on a data stream should eliminate hardware prefetch on that stream. So, your own strategy would have to be superior to the standard hardware based strategy, and might have to be adjusted for each CPU.
On certain early CPU models, software prefetch could be used to jump start hardware prefetch, but such a tactic would backfire on recent CPUs.
You may need to shut off one or more varieties of hardware generated prefetch entirely, e.g. cache line pairing aka alternate sector prefetch.

Among situations where you might hope to gain by software prefetch:
Frequent TLB misses, if you know far enough ahead which page to prefetch (and few of them are needed)
So many data streams in use that the limit on hardware prefetched streams for your CPU is exceeded, yet the cache locality is unusually good considering number of streams
Those 2 cases are likely mutually exclusive.

Access with indirection:
If you turn on the compiler's prefetch option, it should generate software prefetch for a single level of indirection e.g. for(i=0;ibut such prefetch may not be optimum for your case.

Sergey Kostrov's picture

Alexander, Tim, thank you very much for the information!

>>...So, your own strategy would have to be superior to the standard hardware based strategy, and
>>might have to be adjusted for each CPU...

That's the worst case because the library is so portable ( 99.99% of codes are in C++ )that no attempts are done to detect a CPU model. I would prefer to be as neutral as possible regarding different CPU models.

Best regards,
Sergey

Sergey Kostrov's picture

That's an interesting discovery. There is a support for intrinsic functions on Embedded Platforms. If somebody is interested take alook at MSDN's article "Intrinsic Functions for Device Compilers". Here are some technical details:

Header file: Cmnintrin.h

For example, there is '__prefetch' intrinsic function:

...
void __cdecl __prefetch( void * );
...

and it is supported on x86, ARM, SH-4 and MIPS platforms.

Igor Levicki's picture

My 2 cents:

1. Prefetching destination when you have large amounts of data flowing doesn't make sense because by the time next loop gets to work on the same data it will get evicted from the cache.
2. Prefetching manually for linear RAM access pattern doesn't make sense.
3. Prefetching manually requires optimal prefetch distance to be calculated which is more of a trial and error than an exact science, and depends on many factors such as CPU family / model, and the particular algorithm for which it should prefetch the data.
4. Prefetching algorithms in CPUs do not have a guarantee that they will do anything usefull on that particular CPU, much less on a different one.

I also think that before even trying to prefetch, data layout must already be optimized, and loop blocking should be used if dataset doesn't fit into largest available cache to maximize data locality.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
TimP (Intel)'s picture

Igor's comments relate directly to topics which I mentioned above, and I agree entirely with them.

jimdempseyatthecove's picture

I too agree with Igor. This agreement was born-out of several attempts to incorporate prefetching to improve performance of several synthetic test programs. In almost all the test scenarios prefetching failed failed to improve performance. And the few where some performance was observed, it was in the order of 1%-2% improvement. Quite often, code placement of loops will cause a variance of 1%-2% in performance. I had not investigated if prefetching or codeplacement was the source of the observed difference.

There is a potential programming scenario where you can "prefetch" without _mm_prefetch(). As to if this is useful this will depend on how you manipulate the data. On most "modern" multi-threaded processors you have cache heirarchy and thread participation in this heirarchy. Example: Sandy Bridge, HT siblings share L1&L2 cache. If you can coordinate thread teams into cache sharing groups, such as threads in same core (HT siblings), or threads in same socket (multi-processor with each processor containing independent L3). And if your algorithm is constructed where one of the threads in the team fetches data to be used by the other thread(s) in the team (and optionally becomes a late partner in the process) you might observe improvement in throughput.

This too, will require tuning experimentation (as was discussed by Igor).

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
Sergey Kostrov's picture

Here is somestatistics based on my tests tuning a Strassen's Heap Based algorithm for matrix multiplication:

Tested oncomputer: Pentium 4 CPU\ 1.60 GHz

Matrix size:1,024x1,024
Matrix Threshold Value: 64x64
Environment: Win32 on Windows XP\ Single CPU\ Single Thread \ No OpenMP support

~1% improvement - Data Prefetching
Note: actual range was ~0.5% to ~1.5%

~5% improvement- Loops Unrolling ( 4-in-1 )

~2% improvement -Process Priority Boostto Real-Time
Note: Mouse, Keyboard, Task Manager are all preempted! Nothing could be done untill all
calculations are done

PS:A ~15-year-old Borland C++ compiler v5.5.1demonstrated the best performance and it was~4.4 seconds.

TimP (Intel)'s picture

Architectural details have evolved over the years, so it's not surprising that a current compiler didn't get the best result on an old P4.

Interaction of software and hardware prefetch on early P4 CPUs was entirely different from current CPUs. Prefetch optimization for the first P4 would usually be wrong for later CPUs.

gcc option to unroll by 4 acounts for several instances of performance superior to icc on Nehalem and Westmere CPUs. By the time this situation was evident, Intel compilers were looking ahead to AVX and some code generation practices which should have helped these CPUs were judged inadvisable for Sandy Bridge.

In spite of some of our managers calling 5% changes in performance significant, one has to question the idea of investing effort in such a small change in performance which may go away with the next software or hardware update.

Sergey Kostrov's picture

Sorry, I corrected a misspelled word 'menagers' when quoting you...

>>...In spite of some of our managers calling 5% changes in performance significant, one has to
>>question the idea of investing effort in such a small change in performance which may go away
>>with the next software or hardware update...

A 5% improvement in performanceis considered as a very good improvement on a real-time embedded platform.

Guys, I don't think that all of you understood what I'm "fighting" for as a software developer. This is not
about "waisting" some time onthe software community forum:

1. It is about sharing some knowledge with softwaredevelopers and giving for free some good,even if
they are small, pieces of Assembler/C/C++ codes;

Note: If you don't likesome piece of code, don't use itand don't complain, right?

2. It is about comparing different opinions with your own opinion and enforcing yourself to do some
practical programming in order to prove or disprove some technical ideas or solutions;

Note: Guys, could you provide some codes to everybody ( not just me )and demonstrate
that Intel's invented, designed and used already on many CPUs 'prefetchtx' instruction
doesn't work, or absolutely useless?

3.It is about as better as possibleoptimizitation ofhighly portable CRT-basedC/C++ codes;

4.it is about achieving as better as possible performance of these C/C++ codes before a C/C++
compiler optimization is applied;

5.It is about achieving as smallest as possible delays in processing ina real-time environment involving
a Linear Algebra mathematics.

Thank you for your feedback!

Best regards,
Sergey

jimdempseyatthecove's picture

5% is certainly worth going after,.lt. 1% is not. You will never know a priori if the effort spent on prefetching will be -x% to +y% until you expend the effort. Prefetch tuning should be your last measure of optimization.
For an application running on an embedded real-time platform (IOW the target platform is fixed), and assuming the embedded processor is Atom (IA32 or Intel64), then your final phase optimization (prefetch) should incorporate use of VTune to find out where the code memory stalls are located and the type of stalls. Then as you attempt to improve performance through prefetching you can then compare reports as opposed to simply comparing end results. The comparison of reports may show that inserting prefetch at point A in the code exacerbates performance at point B in the code. This knowledge, knowing why this particular attempt at prefetching didn't work, will help you how to strategize your next attempt. Running VTune may interfere with your performance monitoring, therefore, be prepared for using either a software emulator or ICE to obtain the cache-miss information completely transparent from the execution of your test code.

The use of prefetchx isn't useless, the tricky part is in determining where you can make use of the prefetchx.

>>3... highly portable CRT-basedC/C++ codes

prefetch would not be highly portable. CPU archetecture (internal not abstract) affects where prefetch would make differences.

>> 4. it is about achieving as better as possible performance of these C/C++ codes before a C/C++
compiler optimization is applied

Prefetching design will likely have to be determined after compiler optimizations are applied

>>5... achieving as smallest as possible delays...

No argument here. If your embedded system is running on an Atom processor with HT, I would suggest you experiment with scripting your prefetching to run on one HT sibling of one core to coordinate with execution of the other HT sibling of the same core. This way you can actually enforce prefetching as opposed to suggest prefetching. This does come at the expense of loss of availability of the hardware thread.

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com
Sergey Kostrov's picture

Good comments, thank you!

>>...your final phase optimization ( prefetch ) should incorporate use of VTune...

Exactly! That is why I'm already communicating with Patrick Fay ( Intel )on the"Software Tuning,
Performance Optimization and Platform Monitoring" sub-forum.

Best regards,
Sergey

jimdempseyatthecove's picture

Sergey

Intel also has a product: Intel Architecture Code Analyzer.

You might give that tool a look too.

It was written to observe pipeline latencies with SSE and AVX but it works with other instructions as well.
Of particular interest to you is not necessarily prefetchn, rather interlieving the instructions such that to make use of latency time (L2, L1, L3, RAM) with performing productive work.

Jim Dempsey

Blog: The Parallel Void

www.quickthreadprogramming.com

Login to leave a comment.