Profiling Python* With Intel® VTune™ Amplifier: A Covariance Demonstration

By Nathan G Greeneltch, Published: 02/05/2018, Last Updated: 02/05/2018

Introduction

Intel® VTune™ Amplifier is source-code profiling software, popular in the High Performance Computing (HPC) community for its versatile and accurate sampling as well as its low collection overhead. Software stack sampling, thread profiling, and low-level hardware event sampling are all available. Along with command line interface, Intel VTune Amplifier also has a mature and convenient graphical user interface. A user can “mouse-around” and effectively dig through their code and map bottlenecks to specific lines in the source.

The Python* language has a reputation for convenience in scripting work, but not necessarily executing fast at runtime. As such, the HPC community has not concentrated heavily on tuning and profiling Python code. With the widespread adoption of the language for machine learning and data science, and the subsequent proliferation of deep-learning as a viable solution in many engineering environments, it is time to get serious about Python profiling. Towards this goal, Intel VTune Amplifier continues to add new and exciting features aimed directly at Python developers. In addition to source line-level Python granularity, Intel VTune Amplifier provides navigable visual representations of Python memory analysis and mixed-code threading and scheduling. The table below compares Intel VTune Amplifier to other commonly used Python profilers. More information on profiling Python code with Intel Vtune Amplifier can be found here.

 

Feature

cProfile

Line_profiler

Intel® VTune™ Amplifier

Profiling technology

Event

Instrumentation

Sampling, hardware events (native code), Instrumentation (Python code)

Analysis granularity

Function-level

Line-level

Line-level, call stack, time windows, hardware events

Intrusiveness

Medium (1.3-5x)

High (4-10x)

Low (1.05-1.3x)

Mixed language programs

Python

Python

Python, Cython, C++, Fortran

 

Where to get

In addition to the stand-alone version, Intel VTune Amplifier is available in multiple software suites available through software.intel.com. Namely, it is included in the Professional and Cluster editions of Intel® Parallel Studio XE, Intel® System Studio, and Intel® Media Server Studio. All versions come with priority support.

Demonstration Details

This article will demonstrate creating and tuning Python execution code using Intel VTune Amplifer. The subject of the code will be calculation of the covariance matrix. We will begin with a naïve approach, and slowly tune up the code to run faster. Finally, we will show code for python users to get out-of-the-box speed increases from using covariance implementations built into Numpy* and Intel® Data Analytics Acceleration Library (DAAL). See included full script at end of article for all code snippets.

Mathematics Explanation

Equation for the covariance matrix

The covariance (aka variance-covariance) matrix represents the mathematical generalization of variance to multiple dimensions. It consists offeature variances along the diagonal and element-wise covariance along the off-diagonal. Typically each element is normalized by the number of examples in the dataset. A common application of the covariance matrix is to decorrelate input data by providing a new basis set for projection in a compact way. This popular technique for dimensionality reduction is called Principle Component Analysis (PCA). See below for an n x n covariance matrix, where is number of features, is number of examples in datasetand x1-n is each feature’s expectant deviation (x- µn, for all data points):

 

“NAÏVE” Implementation and Intel® VTune™ Amplifier

Code Introduction

Below is a naïve implementation of a function to find the covariance matrix. We will use Intel VTune Amplifier to find the slowest execution lines in the code, using the “Basic Hotspots” feature. NOTE: All code is included in full script at end of document.

 

def naive(fullArray):
	print('Calculate by Hand Naive For Loops')
	start = time.time()

	#initialize results array
	result = np.zeros((numCols, numCols), dtype=float)

	# initialize norm arrays list
	normArrays = []

	# calculate norm arrays and populate norm arrays dict
	for i in range(numCols):
		normArrays.append(np.zeros((numRows, 1), dtype=float))
		for j in range(numRows):
			normArrays[i][j]=fullArray[:, i][j]-np.mean(fullArray[:, i])

			
	# calculate covariance and populate results array
	for i in range(numCols):
		for j in range(numCols):
			result[i,j] = sum(p*q for p,q in zip(
							  normArrays[i],normArrays[j]))/(numRows)

	end = time.time()
	print('overall runtime = ' + str(end - start))
	print(result[:5, :5])

 

Launch Data Collection with Intel® VTune™ Amplifier

Let’s begin with a Launch of the Intel VTune Amplifier graphical user interface. Click on the “New Analysis” button (blue box) on the top menu bar as shown in Figure 1.

Python Profiling With Intel VTune Figure 1
Figure 1

 

Next, as shown in Figure 2 below, we will:

Select target type as “Launch Application” (Blue Box) to tell VTune we want to launch a new process and analyze while it runs.

Fill the “Application” and “Application Parameters” (Green Box) with the Python interpreter location and commands, respectively. The commands should include our script location, name, and input arguments; exactly as we would in a normal command line pass. Here we’ve included the arguments “100 1000” for the array shape, so we will be working on a randomly generated array of 100k elements.

Finally click on the “Choose Analysis” button (Yellow Box) to continue.

Python Profiling With Intel VTune Figure 2
Figure 2

 

In the next window (Figure 3), we can choose our analysis type.

We do this by highlighting our choice in the “Algorithm Analysis” activate area (Blue Box). The Basics Hotspots analysis type will give us a nice overview look at our execution environment and allow us to quickly chase down the slowest parts of the code, aka the “hotspots”.

Next we will click “Start” (Yellow Box) on the top right to launch Python application and start Intel VTune Amplifier analysis.   

Python Profiling With Intel VTune Figure 3
Figure 3

 

Note that, the graphical interface of Intel VTune Amplifier doesn’t reveal python errors your script might incur. Ensure that your Python script works free of error.

“Basic Hotspots” Summary View

After data collection, the first window we land on is the “Summary” tab as shown in Figure 4. Here we see elapsed time, CPU time, thread count, hotspot function names, and CPU usage histogram. Each piece of information is presented alogside a (?) symbol that when hovered over with a mouse, will display help and explanation for the line item. The CPU usage histogram is of particular note when analyzing Python code. This graph is a visual summary of how work is spread across different threads. Python execution is held to a single thread by the Global Interpreter Lock (GIL), so initially we will see no extra logical CPUs in the histogram (see Blue Box).  

Python Profiling With Intel VTune Figure 4
Figure 4

 

“Basic Hotspots” Bottom-up View

Next we will look at the “Bottom-up” tab inside the Intel VTune Amplifier graphical interface, see Figure 5. This window shows function calls, along with corresponding columns of utilization time, spin time, and overhead time. The default sorting is done by Effective Utilization. The red “Poor” horizontal bars denote the “hotspots” in the application. These are the function calls responsible for the slowest pieces of application execution. Now let’s drill down the function and call stack on the left-most column by clicking on the small triangles (Red Arrows 1 and 2 in image below) to expand/collapse each function. At any point in the drill down process, the user is presented with the full call stack on the right-most column (see yellow box below). We can dig into the source code by mapping the hot function calls directly to source line. This is done by double-clicking on the function line in either the left or right columns. A last note before we move on relates to threading summary at the bottom of the window (See blue box). Again, Python is restricting the application to a single thread. 

Python Profiling With Intel VTune Figure 5
Figure 5

 

Source Code Review and Change

Double clicking our troublesome function call brings up our source code, and defaults to highlighting the responsible line of code (see red arrow in Figure 6). In this example, the line includes a nested double for-loop through a zipped array. Intel VTune Amplifier is telling us that our naïve approach to finding the covariance matrix is less than optimal. We can use this information to make a code change and try to speed up this section of the code.   

Python Profiling With Intel VTune Figure 6
Figure 6

 

Intel VTune Amplifier has helped us identify the slowest execution point in our code, so let’s make a code change to speed it up! Figure 7 is a zoom in on the troublesome line inside the graphical interface.

Python Profiling With Intel VTune Figure 7
Figure 7

 

First we will map the source line from Intel VTune Amplifier code viewer to the same line in our script, opened in our text editor of choice. See Figure 8:

Python Profiling With Intel VTune Figure 8
Figure 8

 

Now we can replace the innermost nest with a Numpy* multiplication call (np.multiply).The new code is shows in Figure 9. Numpy* is a powerful library with highly vectorized implementations of common mathematical operations. Thus we expect our hardware to be utilized more effectively when this updated line of code is executed. For the sake of demonstration, we will call this new function “Some Vectorization” (as opposed to the “Naïve” function with which we started). The next step is to collect data in Intel VTune Amplifier again with the updated script. NOTE: All code is included in full script at end of document.

Python Profiling With Intel VTune Figure 9
Figure 9

 

“Some Vectorization” Implementation and Intel® VTune™ Amplifier

“Basic Hotspots” Summary View

Above we identified the slowest line in our naïve covariance matrix function, and replaced it with a vectorized multiply method from Numpy*. We called the updated function “Some Vectorization” to differentiate from the original “Naïve” version. We will now execute the “Some Vectorization” function application and recollect data with Intel VTune Amplifier. Figure 10 below is the result of this data collection. I’ve superimposed a screenshot from the “Naïve” data collection at the top right (yellow box) for reference. The wall time has dropped from 15.4s to 7.1s, so we are making progress. Again we will note the number of threads is still only 1, so we are executing all of our math in serial mode, thanks to Python’s GIL.

Python Profiling With Intel VTune Figure 10
Figure 10

 

“Basic Hotspots” Top-down View

During the Naïve data collection analysis, we focused on the Bottom-up tab inside the Intel VTune Amplifier graphical interface. This time, I will introduce the Top-Down Tree tab instead, see Figure 11. Since most Python analysis will encompass execution of mixed code (i.e., Python + native), the Top-down Tree is particularly relevant to Python users. Here we can start with the Python layer interpreter calls and familiar Python functions that users recognize from their own scripts, and drill down into the FORTRAN or C layers, for instance. The collapsible function stack on the left column and full call stack on the right column are similar to the Bottom-up tab, with the difference being the descending order of calls. The function stacks are again sorted by effective utilization time, so the hotspots are at the top of the list. If we expand the function stack out until we see our updated function (called “SomeVec” in the screenshot below), we see that ~90% of utilization time is due to this function. Further expansion identifies a call to np.mean (see Red Arrow #1) as a particularly hot/slow spot in the execution. A priori knowledge of the code suggests we shouldn’t spend such large amount of execution time in the np.mean method. Again we can double click on the highlighted row, or on the function in the full call stack on the right column (See Red Arrow #2) to bring up the source code.  

Python Profiling With Intel VTune Figure 11
Figure 11

 

Source Code Review and Change

The source code tab appears and the relevant code line is highlighted as shown in Figure 12. As expected, it includes a np.mean() call. Jump to Figure 13 for a zoomed in version of the code line.

Python Profiling With Intel VTune Figure 12
Figure 12

 

If we analyze this for loop, we can see the broken logic in our loop. Let’s open the source code up in our editor of choice and see if we can do better.

Python Profiling With Intel VTune Figure 13
Figure 13

 

The problem here (See red arrow in Figure 14) is our innermost loop iterates through each row and calculates the mean of all the columns. Furthermore the subtraction is done elementwise at each inner loop iteration. A more efficient way to accomplish this piece of math is to find the mean of each column in the outer loop once per column iteration and use Numpy’s* subtract method to vectorize the subtraction. Then pass the resultant array (“normArrays”) through to subsequent operations in the inner loop.  

Python Profiling With Intel VTune Figure 14
Figure 14

 

Let’s reorganize the code to be more efficient with the code changes denoted by the arrows in Figure 15. The next step is to collect data in Intel VTune Amplifier again with the updated script. We will call this new updated function “More Vectorization” to differentiate from the previous “Naïve” and “Some Vectorization” versions. NOTE: All code is included in full script at end of document.

Python Profiling With Intel VTune Figure 15
Figure 15

 

“More Vectorization” Implementation and Intel® VTune™

Above we identified the slowest line in our “Some Vectorization” covariance matrix function, and replaced it with a better organized for-loop and vector subtract method from Numpy*. We called the updated function “More Vectorization” to differentiate from the previous “Naïve” and “Some Vectorization” versions. We will now execute the “More Vectorization” function application and recollect data with Intel VTune Amplifier. Figure 16 shows the result of the data collection. I’ve superimposed a screenshot from the “Naïve” and “Some Vectorization” data collections at the top right (yellow and green boxes) for reference. The wall time has dropped from 15.4s to 7.1s to 1.4s, so we are making great progress with our code adjustments. Again for the final time we will note the number of threads is still only 1, so we are executing all of our math in serial mode, thanks to Python’s GIL.

Python Profiling With Intel VTune Figure 16
Figure 16

 

Conclusion and Call to Action

Intel® VTune™ Amplifier is a powerful profiler with a large footprint in the HPC community. This article demonstrates Python profiling of a covariance matrix code execution. The “Basic Hotspots” analysis type was chosen for this work. A naïve implementation was improved upon by correcting logic in loop efficiency and adding faster Numpy* method calls where appropriate. Intel® continues to add Python compatibility with each new release for Intel® VTune™ Amplifier. Check the product page and Python profiling page for more details and how to acquire. Speed up your Python code with drop-in replacements of common Python libraries by installing the Intel Distribution of Python.

Script With All Covariance (+ Numpy* & PyDAAL) Code

Below is a script you can use to reproduce the results from this article. Also included are Numpy* and Intel® Data Analytics Acceleration Library (Intel® DAAL) code using the free PyDAAL python module for computing the covariance matrix. Pydaal customUtils can be found at the pydaal tutorials github page and are pulled from the gentle introduction series for pydaal.

'''covariance_script.py'''


import numpy as np
import sys
argv = sys.argv

import ctypes
import time

# Define size of random matrix and idx to slice
mode = argv[1]  #modes_avail ['naive','someVec','moreVec','numpy','pydaal']
numCols = int(argv[2])
numRows = int(argv[3])
numTotal = numRows * numCols

try:
	repeats = int(argv[4]) # not currenlty used
except:
	repeats = 1


''' Begin Defining Functions'''

#define number of threads if optional argv[5] is passed
try:
	nThreads = int(argv[5])
	print('nThreads is ' + str(nThreads))
	if mode == 'pydaal':
		Environment.getInstance().setNumberOfThreads(nThreads)
		print("DAAL set to use %d threads" % nThreads)
	else:
		mkl_rt = ctypes.CDLL('libmkl_rt.so')
		mkl_get_max_threads = mkl_rt.mkl_get_max_threads
		mklOrigThreads = mkl_get_max_threads()
		print('Original mkl nThreads set to ' + str(mklOrigThreads))
		
		def mkl_set_num_threads(cores):
			mkl_rt.mkl_set_num_threads(ctypes.byref(ctypes.c_int(cores)))
			
		mkl_rt.mkl_set_dynamic(ctypes.byref(ctypes.c_int(0)))
		mkl_set_num_threads(nThreads)
		mklNewThreads = mkl_get_max_threads()
		print('mkl nThreads set to ' + str(mklNewThreads))
except:
	pass

	
#"naive" covariance function
def naive(fullArray):
	print('Calculate by Hand Naive For Loops')
	start = time.time()

	#initialize results array
	result = np.zeros((numCols, numCols), dtype=float)

	# initialize norm arrays list
	normArrays = []

	# calculate norm arrays and populate norm arrays dict
	for i in range(numCols):
		normArrays.append(np.zeros((numRows, 1), dtype=float))
		for j in range(numRows):
			normArrays[i][j]=fullArray[:, i][j]-np.mean(fullArray[:, i])

			
	# calculate covariance and populate results array
	for i in range(numCols):
		for j in range(numCols):
			result[i,j] = sum(p*q for p,q in zip(
							normArrays[i],normArrays[j]))/(numRows)

	end = time.time()
	print('overall runtime = ' + str(end - start))
	print(result[:5, :5])

	
#"some vectorization" covariance function
def someVec(fullArray):
	print('Calculate by Hand Some Vectorization')
	start = time.time()

	#initialize results array
	result = np.zeros((numCols, numCols), dtype=float)


	# initialize norm arrays list
	normArrays = []

	# calculate norm arrays and populat norm arrays dict
	for i in range(numCols):
		normArrays.append(np.zeros((numRows, 1), dtype=float))
		for j in range(numRows):
			normArrays[i][j]=fullArray[:, i][j]-np.mean(fullArray[:, i])

	# calculate covariance and populat results array
	for i in range(numCols):
		for j in range(i+1, numCols):
			result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)

	end = time.time()
	print('overall runtime = ' + str(end - start))
	print(result[:5, :5])

	
#"more vectorization" covariance function
def moreVec(fullArray):
	print('Calculate by Hand More Vectorization')
	start = time.time()

	#initialize results array
	result = np.zeros((numCols, numCols), dtype=float)

	# initialize norm arrays list
	normArrays = []

	# calculate norm arrays and populat norm arrays dict
	for i in range(numCols):
		normArrays.append(np.zeros((numRows, 1), dtype=float))
		normArrays[i]=np.subtract(fullArray[:, i], np.mean(fullArray[:, i]))
		for j in range(i+1):
			result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)

	# calculate covariance and populat results array
	for i in range(numCols):
		for j in range(i+1, numCols):
			result[i,j] = sum(np.multiply(normArrays[i],normArrays[j]))/(numRows)

	end = time.time()
	print('overall runtime = ' + str(end - start))
	print(result[:5, :5])


# numpy covariance function
def numpy(fullArray):
	print('Calculate with Numpy Library')
	start = time.time()
	for i in range(repeats):
		result = np.cov(fullArray, rowvar=False, bias=True)
	end = time.time()
	print('overall runtime = ' + str(end - start))
	print(result[:5, :5])

	
#imports specifically for pydaal
if mode == 'pydaal':
	try:
		from customUtils import getBlockOfNumericTable, getArrayFromNT, serialize, deserialize
		
		from daal.data_management import HomogenNumericTable
		from daal.services import Environment
		
		from daal.algorithms.covariance import Batch, data, covariance
		
	except:
		print('PyDAAL not found, skipping daal analysis')

	
# intel daal covariance function
def pydaal(fullArray):
	try:
		def daal_cov(nT):
			# Create algorithm to compute dense variance-covariance matrix in batch mode
			algorithm = Batch()
			# Set input arguments of the algorithm
			algorithm.input.set(data, nT)
			# Get computed variance-covariance matrix
			result = algorithm.compute()
			
			return result
		print("success")
		print('Calculate with PyDAAL Library')
		
		nT = HomogenNumericTable(fullArray)
		
		start = time.time()
		for i in range(repeats):
			res = daal_cov(nT)
			
		end = time.time()
		print('overall runtime = ' + str(end - start))
		result = getArrayFromNT(res.get(covariance))
		print(result[:5, :5])
	except:
		print('PyDAAL not found, skipping daal analysis')


def main():

	'''Run Options Section '''

	print("Initializing Data Matrix with %d million elements" % (numTotal/1e6))

	''' Data Set Creation Section '''
	seeded = np.random.RandomState(42)
	fullArray = seeded.rand(numRows, numCols)

	''' Run '''
	if mode == 'naive':
		naive(fullArray)
	elif mode == 'someVec':
		someVec(fullArray)
	elif mode == 'moreVec':
		moreVec(fullArray)
	elif mode == 'numpy':
		numpy(fullArray)
	elif mode == 'pydaal':
		pydaal(fullArray)

#Call Main Function
main()

Hardware Specs

CPU
Intel(R) Xeon(R) E5/E7 v3 Processor code named Haswell
Frequency
2.3 GHz
Logical CPU Count
64
Memory 32 GB DDR4
OS Ubuntu 4.4.0.112

 

 

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804