Software Tuning, Performance Optimization & Platform Monitoring

Throwing all optimizations at 4-level nested FORs


One question that I couldn't answer interests me:
How to speed up the fragment below?
Being an etude with 4 nested FORs what pragmas Intel provides?
Also what compiler options are there to speed up FORs?
Currently I use 12.1 but if there are some new or improved ones I immediately would go for 15.
My desire is to throw all present optimization at it in order to make it faster.

Multi-threaded L3 cache performance


I have found a lot of interesting reads about cache bandwidth performance modeling and benchmarking (e.g., and of course a lot to read about multi-threaded stream benchmark.

So here I am trying to understand the multi-threaded, or multi-core performance of the L3 cache. (too many posts about performance analysis start this way ;)

Let's say I want to check the speed of SSE2 vector transfers to the registers from various cache levels:

Performance Monitoring Counters


        I have currently using Intel(R) Xeon (R) X7350 @ 2.93GHz. As this architecture has only five PMCs, I can't read more than five performance events at a time.Even the worst is, when I installed PAPI 5.4.0 , it reads only two events simultaneously. So I have question in mind, "Is there any way that we can use other (general purpose or special purpose) registers as PMUs ? "

        I want to capture six to seven events and their count while an application is running. I wish to measure the power consumed by this application using the values of these counters.

Uncore PMUs on Xeon E5000 processors


I have a Xeon E5645 system and I am curious if there exist PMUs to monitor memory controller events. There seems to be very good and extensive documentation  for E5-2600 family. From the datasheet files for the E5000 family, there does not seem to be any mention of uncore PMUs.

Looking forward to your feedback. 




Is there any way to detect the Intel RST is running or mSATA existence?


My product had a case that customer can't detect the mSata in Windows7, that will caused our product can't be worked.

As I know, Intel provides their own drivers(iastora.sys) which enable some features not found natively. These features are mostly found in the Rapid Storage Technology UI which allows for raid volumes to be managed and monitored from within the OS itself.

As the title, my question is that is there any way(interface/SDK) to detect the mSata existence or intel RST running?



Could you explain me a difference between those two events UOPS_RETIRED.ALL_PS and UOPS_RETIRED.RETIRE_SLOTS_PS on Sandy Bridge?

I would expect that those events should give approximately the same numbers, since number of used slots should agree we with number of retired uops during period of time. Data below shows that number of used retirement slots is lesser by ~20%  than number ups retired.

Is it possible that uops retired w/o using slot? 

UOPS_RETIRED.ALL_PS - This event counts the number of micro-ops retired.

Unable to generate 'GPA' data with my Intel HD Graphics 4000

I'm trying to profile the execution of an OpenCL kernel on Intel HD Graphics 4000. I've installed the 30-day trial of both VTune and INDE.

If I right-click on my Project in VS2013 > Intel VTune Amplifier XE 2015 > New Analysis, I see this message in the window that opens:

Some GPU metrics are currently disabled by the BIOS. See the product Release Notes for details.

Intel Memory Latency checker w/ Windows support released

We just released v2.3 of Intel Memory Latency checker ( This adds support for Windows o/s while previous versions already supported Linux o/s. In addition, single socket Xeon processors (E3) are also supported. 

Intel Memory Latency checker can be used to measure latencies and bandwidth on Intel Xeon processors


Assine o Software Tuning, Performance Optimization & Platform Monitoring