Author's Blogs

Use VTune™ Amplifier XE 2015 to analyze MPI Hybrid code
By Peter Wang (Intel) Posted on 06/03/15 0
Traditional OpenMP* is a fork-join parallel programming technology. First, program runs with a single master thread which is serial code, later, master thread assigns sub-tasks on created multiple threads where are in parallel region, master thread waits until all threads complete sub-tasks to me...
How to profile MPI processes on all nodes?
By Peter Wang (Intel) Posted on 05/26/15 1
VTune(TM) Amplifier XE 2015 can analyze MPI processes combined in hybrid codes in cluster system. It means that VTune Amplifier runs parallel MPI program on N ranks to collect performance data, then identify which hot function on which rank consumed highest CPU time. First at all, need to set to...
Use which hardware PMU events to calculate FLOPS on Intel(R) Xeon Phi(TM) coprocessor?
By Peter Wang (Intel) Posted on 04/20/15 0
FLOPS means total floating point operations per second, which is used in High Performance Computing. In general, Intel(R) VTune(TM) Amplifier XE only provides metric named Cycles Per Instruction (average CPI), that is to measure performance for general programs. In this article, I use matrix1.c a...
Why didn't remote data collector work from OS X* to Linux?
By Peter Wang (Intel) Posted on 03/30/15 0
I wrote an article to introduce of using remote data collector in VTune(TM) Amplifier XE, that data collector supports Windows* host and Linux* host ( target always is Linux* server). Now OS X* host supports for performance data collecting from Linux platform. You may: 1. Install vtune_amplifie...
VTune™ Amplifier XE 2015 Update 2 supports for driverless hardware event-based sampling with call stack info
By Peter Wang (Intel) Posted on 03/15/15 1
In general, vtune drivers will be built and loaded to the Linux* system automatically during installing VTune™ Amplifier XE product, then hardware PMU event-based sampling can work.  However sometime, vtune drivers were built/loadeded unsuccessfully, because of one of below reason: 1.    There ...
Reducing overload when using basic hotspots analysis
By Peter Wang (Intel) Posted on 01/29/15 0
Problem: When the user ran VTune(TM) Amplifier XE's basic hotspots with huge (complicated) application, sometime profiling time was more than one hour to generate vtune result. It looked like the system was freezing during finalization period. Without using basic hotspots, the application ran sh...
Use Pause / Resume API of VTune™ Amplifier XE 2015 in your Intel® Xeon Phi™ program
By Peter Wang (Intel) Posted on 01/06/15 0
I ever wrote an old article about using Pause & Resume API for Xeon Phi™ programs in VTune Amplifier XE 2015 Beta, there were some limitations. Now, these limitations have been removed in 2015 initial release, this is why I need to re-write this for your reference.   Steps: (You can use atta...
Practice an example of profiling applications on Intel® Xeon Phi™ coprocessor on the sever from a client machine
By Peter Wang (Intel) Posted on 12/16/14 1
Scenario:  A Linux* server with Intel(R) Xeon Phi(TM) coprocessor card is a customized Linux* system, there is no X11 support so VTune™ Amplifier XE GUI cannot work on this server. The user should collect/analyze the result from another machine (client box). Solutions: 1.    If you can install...
Split huge function if called by loop for best utilizing Instruction Cache
By Peter Wang (Intel) Posted on 11/16/14 0
Instruction cache miss is a major issue which increases Front End Stalls. Usually the application with a large hot code section with many mispredicted branches, which results in many ICache misprediction stall, the stall increases with the number of times the hot code section is called. the solut...
Practice of using VTune™ Amplifier XE 2015 on GPU for OpenCL™ kernel analysis
By Peter Wang (Intel) Posted on 10/15/14 0
Intel® SDK for OpenCL™ Application can build application to work on Intel® HD Graphics. Using VTune™ Amplifier XE to analyze OpenCL™ application’s performance on GPU side, which covers:  1. GPU usage 2. GPU hardware metrics  3. OpenCL™ kernel execution.  You need to set environment to build/r...