How to profile MPI processes on all nodes?
By Peter Wang (Intel) Posted on 05/26/15 0
VTune(TM) Amplifier XE 2015 can analyze MPI processes combined in hybrid codes in cluster system. It means that VTune Amplifier runs parallel MPI program on N ranks to collect performance data, then identify which hot function on which rank consumed highest CPU time. First at all, need to set to...
VTune™ Amplifier XE 2015 Update 2 supports for driverless hardware event-based sampling with call stack info
By Peter Wang (Intel) Posted on 03/15/15 1
In general, vtune drivers will be built and loaded to the Linux* system automatically during installing VTune™ Amplifier XE product, then hardware PMU event-based sampling can work.  However sometime, vtune drivers were built/loadeded unsuccessfully, because of one of below reason: 1.    There ...
Reducing overload when using basic hotspots analysis
By Peter Wang (Intel) Posted on 01/29/15 0
Problem: When the user ran VTune(TM) Amplifier XE's basic hotspots with huge (complicated) application, sometime profiling time was more than one hour to generate vtune result. It looked like the system was freezing during finalization period. Without using basic hotspots, the application ran sh...
Use Pause / Resume API of VTune™ Amplifier XE 2015 in your Intel® Xeon Phi™ program
By Peter Wang (Intel) Posted on 01/06/15 0
I ever wrote an old article about using Pause & Resume API for Xeon Phi™ programs in VTune Amplifier XE 2015 Beta, there were some limitations. Now, these limitations have been removed in 2015 initial release, this is why I need to re-write this for your reference.   Steps: (You can use atta...
Practice an example of profiling applications on Intel® Xeon Phi™ coprocessor on the sever from a client machine
By Peter Wang (Intel) Posted on 12/16/14 1
Scenario:  A Linux* server with Intel(R) Xeon Phi(TM) coprocessor card is a customized Linux* system, there is no X11 support so VTune™ Amplifier XE GUI cannot work on this server. The user should collect/analyze the result from another machine (client box). Solutions: 1.    If you can install...
Split huge function if called by loop for best utilizing Instruction Cache
By Peter Wang (Intel) Posted on 11/16/14 0
Instruction cache miss is a major issue which increases Front End Stalls. Usually the application with a large hot code section with many mispredicted branches, which results in many ICache misprediction stall, the stall increases with the number of times the hot code section is called. the solut...
Practice of using VTune™ Amplifier XE 2015 on GPU for OpenCL™ kernel analysis
By Peter Wang (Intel) Posted on 10/15/14 0
Intel® SDK for OpenCL™ Application can build application to work on Intel® HD Graphics. Using VTune™ Amplifier XE to analyze OpenCL™ application’s performance on GPU side, which covers:  1. GPU usage 2. GPU hardware metrics  3. OpenCL™ kernel execution.  You need to set environment to build/r...
Easier using Pause & Resume API on Intel(R) Xeon Phi(TM) processors in VTune(TM) Amplifier XE 2015
By Peter Wang (Intel) Posted on 08/31/14 0
I once wrote an article introducing Pause/Resume API use on the Intel® Xeon Phi™ coprocessor. Intel® VTune™ Amplifier XE 2015 is now ready.  There are two changes for using this feature in 2015 version. Setting all environment variables required in 2013 version is NOT necessary with the new ver...
Using VTune(TM) Amplifier XE 2015 Beta co-existed with XE 2013
By Peter Wang (Intel) Posted on 08/04/14 0
When you installed old VTune(TM) Amplifier XE 2013, the installer will detect prior product and ask you to uninstall it first then install new Update. Now if you install VTune(TM) Amplifier XE 2015 Beta, the installer will not ask above and installing XE 2015 version directly. It means XE 2015 v...
Using Intel® TSX with VTune(TM) Amplifier XE 2015 Beta to measure transaction time & abort in your code?
By Peter Wang (Intel) Posted on 07/12/14 2
When the user develops multithreaded applications, the user should protect critical (sensitive) code area called by threads, so threads access shared memory without data conflict. Most of time, the user might use critical_section, mutex, semaphore, atomic, events, or other “locks” to protect crit...
Subscribe to Intel Developer Zone Blogs