By Nathan G Greeneltch, Published: 02/05/2018, Last Updated: 02/05/2018
Intel® VTune™ Amplifier is sourcecode profiling software, popular in the High Performance Computing (HPC) community for its versatile and accurate sampling as well as its low collection overhead. Software stack sampling, thread profiling, and lowlevel hardware event sampling are all available. Along with command line interface, Intel VTune Amplifier also has a mature and convenient graphical user interface. A user can “mousearound” and effectively dig through their code and map bottlenecks to specific lines in the source.
The Python* language has a reputation for convenience in scripting work, but not necessarily executing fast at runtime. As such, the HPC community has not concentrated heavily on tuning and profiling Python code. With the widespread adoption of the language for machine learning and data science, and the subsequent proliferation of deeplearning as a viable solution in many engineering environments, it is time to get serious about Python profiling. Towards this goal, Intel VTune Amplifier continues to add new and exciting features aimed directly at Python developers. In addition to source linelevel Python granularity, Intel VTune Amplifier provides navigable visual representations of Python memory analysis and mixedcode threading and scheduling. The table below compares Intel VTune Amplifier to other commonly used Python profilers. More information on profiling Python code with Intel Vtune Amplifier can be found here.
Feature 
cProfile 
Line_profiler 
Intel® VTune™ Amplifier 
Profiling technology 
Event 
Instrumentation 
Sampling, hardware events (native code), Instrumentation (Python code) 
Analysis granularity 
Functionlevel 
Linelevel 
Linelevel, call stack, time windows, hardware events 
Intrusiveness 
Medium (1.35x) 
High (410x) 
Low (1.051.3x) 
Mixed language programs 
Python 
Python 
Python, Cython, C++, Fortran 
In addition to the standalone version, Intel VTune Amplifier is available in multiple software suites available through software.intel.com. Namely, it is included in the Professional and Cluster editions of Intel® Parallel Studio XE, Intel® System Studio, and Intel® Media Server Studio. All versions come with priority support.
This article will demonstrate creating and tuning Python execution code using Intel VTune Amplifer. The subject of the code will be calculation of the covariance matrix. We will begin with a naïve approach, and slowly tune up the code to run faster. Finally, we will show code for python users to get outofthebox speed increases from using covariance implementations built into Numpy* and Intel® Data Analytics Acceleration Library (DAAL). See included full script at end of article for all code snippets.
The covariance (aka variancecovariance) matrix represents the mathematical generalization of variance to multiple dimensions. It consists offeature variances along the diagonal and elementwise covariance along the offdiagonal. Typically each element is normalized by the number of examples in the dataset. A common application of the covariance matrix is to decorrelate input data by providing a new basis set for projection in a compact way. This popular technique for dimensionality reduction is called Principle Component Analysis (PCA). See below for an n x n covariance matrix, where n is number of features, N is number of examples in dataset, and x1n is each feature’s expectant deviation (xi  µn, for all i data points):
Below is a naïve implementation of a function to find the covariance matrix. We will use Intel VTune Amplifier to find the slowest execution lines in the code, using the “Basic Hotspots” feature. NOTE: All code is included in full script at end of document.
def naive(fullArray):
print('Calculate by Hand Naive For Loops')
start = time.time()
#initialize results array
result = np.zeros((numCols, numCols), dtype=float)
# initialize norm arrays list
normArrays = []
# calculate norm arrays and populate norm arrays dict
for i in range(numCols):
normArrays.append(np.zeros((numRows, 1), dtype=float))
for j in range(numRows):
normArrays[i][j]=fullArray[:, i][j]np.mean(fullArray[:, i])
# calculate covariance and populate results array
for i in range(numCols):
for j in range(numCols):
result[i,j] = sum(p*q for p,q in zip(
normArrays[i],normArrays[j]))/(numRows)
end = time.time()
print('overall runtime = ' + str(end  start))
print(result[:5, :5])
Let’s begin with a Launch of the Intel VTune Amplifier graphical user interface. Click on the “New Analysis” button (blue box) on the top menu bar as shown in Figure 1.
Next, as shown in Figure 2 below, we will:
Select target type as “Launch Application” (Blue Box) to tell VTune we want to launch a new process and analyze while it runs.
Fill the “Application” and “Application Parameters” (Green Box) with the Python interpreter location and commands, respectively. The commands should include our script location, name, and input arguments; exactly as we would in a normal command line pass. Here we’ve included the arguments “100 1000” for the array shape, so we will be working on a randomly generated array of 100k elements.
Finally click on the “Choose Analysis” button (Yellow Box) to continue.
In the next window (Figure 3), we can choose our analysis type.
We do this by highlighting our choice in the “Algorithm Analysis” activate area (Blue Box). The Basics Hotspots analysis type will give us a nice overview look at our execution environment and allow us to quickly chase down the slowest parts of the code, aka the “hotspots”.
Next we will click “Start” (Yellow Box) on the top right to launch Python application and start Intel VTune Amplifier analysis.
Note that, the graphical interface of Intel VTune Amplifier doesn’t reveal python errors your script might incur. Ensure that your Python script works free of error.
After data collection, the first window we land on is the “Summary” tab as shown in Figure 4. Here we see elapsed time, CPU time, thread count, hotspot function names, and CPU usage histogram. Each piece of information is presented alogside a (?) symbol that when hovered over with a mouse, will display help and explanation for the line item. The CPU usage histogram is of particular note when analyzing Python code. This graph is a visual summary of how work is spread across different threads. Python execution is held to a single thread by the Global Interpreter Lock (GIL), so initially we will see no extra logical CPUs in the histogram (see Blue Box).
Next we will look at the “Bottomup” tab inside the Intel VTune Amplifier graphical interface, see Figure 5. This window shows function calls, along with corresponding columns of utilization time, spin time, and overhead time. The default sorting is done by Effective Utilization. The red “Poor” horizontal bars denote the “hotspots” in the application. These are the function calls responsible for the slowest pieces of application execution. Now let’s drill down the function and call stack on the leftmost column by clicking on the small triangles (Red Arrows 1 and 2 in image below) to expand/collapse each function. At any point in the drill down process, the user is presented with the full call stack on the rightmost column (see yellow box below). We can dig into the source code by mapping the hot function calls directly to source line. This is done by doubleclicking on the function line in either the left or right columns. A last note before we move on relates to threading summary at the bottom of the window (See blue box). Again, Python is restricting the application to a single thread.
Double clicking our troublesome function call brings up our source code, and defaults to highlighting the responsible line of code (see red arrow in Figure 6). In this example, the line includes a nested double forloop through a zipped array. Intel VTune Amplifier is telling us that our naïve approach to finding the covariance matrix is less than optimal. We can use this information to make a code change and try to speed up this section of the code.
Intel VTune Amplifier has helped us identify the slowest execution point in our code, so let’s make a code change to speed it up! Figure 7 is a zoom in on the troublesome line inside the graphical interface.
First we will map the source line from Intel VTune Amplifier code viewer to the same line in our script, opened in our text editor of choice. See Figure 8:
Now we can replace the innermost nest with a Numpy* multiplication call (np.multiply).The new code is shows in Figure 9. Numpy* is a powerful library with highly vectorized implementations of common mathematical operations. Thus we expect our hardware to be utilized more effectively when this updated line of code is executed. For the sake of demonstration, we will call this new function “Some Vectorization” (as opposed to the “Naïve” function with which we started). The next step is to collect data in Intel VTune Amplifier again with the updated script. NOTE: All code is included in full script at end of document.
Above we identified the slowest line in our naïve covariance matrix function, and replaced it with a vectorized multiply method from Numpy*. We called the updated function “Some Vectorization” to differentiate from the original “Naïve” version. We will now execute the “Some Vectorization” function application and recollect data with Intel VTune Amplifier. Figure 10 below is the result of this data collection. I’ve superimposed a screenshot from the “Naïve” data collection at the top right (yellow box) for reference. The wall time has dropped from 15.4s to 7.1s, so we are making progress. Again we will note the number of threads is still only 1, so we are executing all of our math in serial mode, thanks to Python’s GIL.
During the Naïve data collection analysis, we focused on the Bottomup tab inside the Intel VTune Amplifier graphical interface. This time, I will introduce the TopDown Tree tab instead, see Figure 11. Since most Python analysis will encompass execution of mixed code (i.e., Python + native), the Topdown Tree is particularly relevant to Python users. Here we can start with the Python layer interpreter calls and familiar Python functions that users recognize from their own scripts, and drill down into the FORTRAN or C layers, for instance. The collapsible function stack on the left column and full call stack on the right column are similar to the Bottomup tab, with the difference being the descending order of calls. The function stacks are again sorted by effective utilization time, so the hotspots are at the top of the list. If we expand the function stack out until we see our updated function (called “SomeVec” in the screenshot below), we see that ~90% of utilization time is due to this function. Further expansion identifies a call to np.mean (see Red Arrow #1) as a particularly hot/slow spot in the execution. A priori knowledge of the code suggests we shouldn’t spend such large amount of execution time in the np.mean method. Again we can double click on the highlighted row, or on the function in the full call stack on the right column (See Red Arrow #2) to bring up the source code.
The source code tab appears and the relevant code line is highlighted as shown in Figure 12. As expected, it includes a np.mean() call. Jump to Figure 13 for a zoomed in version of the code line.
If we analyze this for loop, we can see the broken logic in our loop. Let’s open the source code up in our editor of choice and see if we can do better.
The problem here (See red arrow in Figure 14) is our innermost loop iterates through each row and calculates the mean of all the columns. Furthermore the subtraction is done elementwise at each inner loop iteration. A more efficient way to accomplish this piece of math is to find the mean of each column in the outer loop once per column iteration and use Numpy’s* subtract method to vectorize the subtraction. Then pass the resultant array (“normArrays”) through to subsequent operations in the inner loop.
Let’s reorganize the code to be more efficient with the code changes denoted by the arrows in Figure 15. The next step is to collect data in Intel VTune Amplifier again with the updated script. We will call this new updated function “More Vectorization” to differentiate from the previous “Naïve” and “Some Vectorization” versions. NOTE: All code is included in full script at end of document.
Above we identified the slowest line in our “Some Vectorization” covariance matrix function, and replaced it with a better organized forloop and vector subtract method from Numpy*. We called the updated function “More Vectorization” to differentiate from the previous “Naïve” and “Some Vectorization” versions. We will now execute the “More Vectorization” function application and recollect data with Intel VTune Amplifier. Figure 16 shows the result of the data collection. I’ve superimposed a screenshot from the “Naïve” and “Some Vectorization” data collections at the top right (yellow and green boxes) for reference. The wall time has dropped from 15.4s to 7.1s to 1.4s, so we are making great progress with our code adjustments. Again for the final time we will note the number of threads is still only 1, so we are executing all of our math in serial mode, thanks to Python’s GIL.
Intel® VTune™ Amplifier is a powerful profiler with a large footprint in the HPC community. This article demonstrates Python profiling of a covariance matrix code execution. The “Basic Hotspots” analysis type was chosen for this work. A naïve implementation was improved upon by correcting logic in loop efficiency and adding faster Numpy* method calls where appropriate. Intel® continues to add Python compatibility with each new release for Intel^{®} VTune™ Amplifier. Check the product page and Python profiling page for more details and how to acquire. Speed up your Python code with dropin replacements of common Python libraries by installing the Intel Distribution of Python.
Below is a script you can use to reproduce the results from this article. Also included are Numpy* and Intel^{®} Data Analytics Acceleration Library (Intel® DAAL) code using the free PyDAAL python module for computing the covariance matrix. Pydaal customUtils can be found at the pydaal tutorials github page and are pulled from the gentle introduction series for pydaal.
'''covariance_script.py'''
import numpy as np
import sys
argv = sys.argv
import ctypes
import time
# Define size of random matrix and idx to slice
mode = argv[1] #modes_avail ['naive','someVec','moreVec','numpy','pydaal']
numCols = int(argv[2])
numRows = int(argv[3])
numTotal = numRows * numCols
try:
repeats = int(argv[4]) # not currenlty used
except:
repeats = 1
''' Begin Defining Functions'''
#define number of threads if optional argv[5] is passed
try:
nThreads = int(argv[5])
print('nThreads is ' + str(nThreads))
if mode == 'pydaal':
Environment.getInstance().setNumberOfThreads(nThreads)
print("DAAL set to use %d threads" % nThreads)
else:
mkl_rt = ctypes.CDLL('libmkl_rt.so')
mkl_get_max_threads = mkl_rt.mkl_get_max_threads
mklOrigThreads = mkl_get_max_threads()
print('Original mkl nThreads set to ' + str(mklOrigThreads))
def mkl_set_num_threads(cores):
mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(cores)))
mkl_rt.mkl_set_dynamic(ctypes.byref(ctypes.c_int(0)))
mkl_set_num_threads(nThreads)
mklNewThreads = mkl_get_max_threads()
print('mkl nThreads set to ' + str(mklNewThreads))
except:
pass
#"naive" covariance function
def naive(fullArray):
print('Calculate by Hand Naive For Loops')
start = time.time()
#initialize results array
result = np.zeros((numCols, numCols), dtype=float)
# initialize norm arrays list
normArrays = []
# calculate norm arrays and populate norm arrays dict
for i in range(numCols):
normArrays.append(np.zeros((numRows, 1), dtype=float))
for j in range(numRows):
normArrays[i][j]=fullArray[:, i][j]np.mean(fullArray[:, i])
# calculate covariance and populate results array
for i in range(numCols):
for j in range(numCols):
result[i,j] = sum(p*q for p,q in zip(
normArrays[i],normArrays[j]))/(numRows)
end = time.time()
print('overall runtime = ' + str(end  start))
print(result[:5, :5])
#"some vectorization" covariance function
def someVec(fullArray):
print('Calculate by Hand Some Vectorization')
start = time.time()
#initialize results array
result = np.zeros((numCols, numCols), dtype=float)
# initialize norm arrays list
normArrays = []
# calculate norm arrays and populat norm arrays dict
for i in range(numCols):
normArrays.append(np.zeros((numRows, 1), dtype=float))
for j in range(numRows):
normArrays[i][j]=fullArray[:, i][j]np.mean(fullArray[:, i])
# calculate covariance and populat results array
for i in range(numCols):
for j in range(i+1, numCols):
result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)
end = time.time()
print('overall runtime = ' + str(end  start))
print(result[:5, :5])
#"more vectorization" covariance function
def moreVec(fullArray):
print('Calculate by Hand More Vectorization')
start = time.time()
#initialize results array
result = np.zeros((numCols, numCols), dtype=float)
# initialize norm arrays list
normArrays = []
# calculate norm arrays and populat norm arrays dict
for i in range(numCols):
normArrays.append(np.zeros((numRows, 1), dtype=float))
normArrays[i]=np.subtract(fullArray[:, i], np.mean(fullArray[:, i]))
for j in range(i+1):
result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)
# calculate covariance and populat results array
for i in range(numCols):
for j in range(i+1, numCols):
result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)
end = time.time()
print('overall runtime = ' + str(end  start))
print(result[:5, :5])
# numpy covariance function
def numpy(fullArray):
print('Calculate with Numpy Library')
start = time.time()
for i in range(repeats):
result = np.cov(fullArray, rowvar=False, bias=True)
end = time.time()
print('overall runtime = ' + str(end  start))
print(result[:5, :5])
#imports specifically for pydaal
if mode == 'pydaal':
try:
from customUtils import getBlockOfNumericTable, getArrayFromNT, serialize, deserialize
from daal.data_management import HomogenNumericTable
from daal.services import Environment
from daal.algorithms.covariance import Batch, data, covariance
except:
print('PyDAAL not found, skipping daal analysis')
# intel daal covariance function
def pydaal(fullArray):
try:
def daal_cov(nT):
# Create algorithm to compute dense variancecovariance matrix in batch mode
algorithm = Batch()
# Set input arguments of the algorithm
algorithm.input.set(data, nT)
# Get computed variancecovariance matrix
result = algorithm.compute()
return result
print("success")
print('Calculate with PyDAAL Library')
nT = HomogenNumericTable(fullArray)
start = time.time()
for i in range(repeats):
res = daal_cov(nT)
end = time.time()
print('overall runtime = ' + str(end  start))
result = getArrayFromNT(res.get(covariance))
print(result[:5, :5])
except:
print('PyDAAL not found, skipping daal analysis')
def main():
'''Run Options Section '''
print("Initializing Data Matrix with %d million elements" % (numTotal/1e6))
''' Data Set Creation Section '''
seeded = np.random.RandomState(42)
fullArray = seeded.rand(numRows, numCols)
''' Run '''
if mode == 'naive':
naive(fullArray)
elif mode == 'someVec':
someVec(fullArray)
elif mode == 'moreVec':
moreVec(fullArray)
elif mode == 'numpy':
numpy(fullArray)
elif mode == 'pydaal':
pydaal(fullArray)
#Call Main Function
main()
CPU

Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell

Frequency

2.3 GHz

Logical CPU Count

64

Memory  32 GB DDR4 
OS  Ubuntu 4.4.0.112 
Intel's compilers may or may not optimize to the same degree for nonIntel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804