# Quick start to performance analysis and code optimization with Intel® Cilk™ Plus or OpenMP* and Intel® System Studio or Intel® Parallel Studio XE

## ### Introduction

This article is used to demonstrate a methodology of code optimization which can help the developers easily improve the performance of the programs.
In this post we look at Intel® System Studio and how it can help to make our software to run faster. The same optimizations can be made using Intel® Parallel Studio XE.
In this article we will cover the following topics :
✦ How to get started quickly to use the "VTune Amplifier" - performance analyzer to estimate the running time of the programs and identify bottlenecks.
✦ How to get started quickly to use the C/C++ language extension of Intel to improve the performance by adding parallelism to new or legacy code.

### Using the code

Suppose we have the following program:

At each iteration the program calculates the number of set bits for each byte as well as the number of set bits of eight adjacent bytes. Then, at each iteration it tests whether this 3x3 matrix is a magic square. (In recreational mathematics, a Magic Square is matrix, where the numbers in each row, and in each column, and the numbers in the main and secondary diagonals, all add up to the same number)

```const int g_iWidth = 2005;
const int g_iHight = 100000;
const int g_iNUM_ROWS = 21;
unsigned int g_uiMatrix = { 0 };
const unsigned char arr[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };

void Init_Array(unsigned char * _ptrMatrix)
{
int iSizeArray = g_iWidth * g_iHight;
for (int i = 0; i < iSizeArray; ++i)
{
_ptrMatrix [i] = arr[rand() % 10];
}
}

unsigned int CountNumberOfSetBits(unsigned char ucData)
{
unsigned int uiCount = 0;
while (ucData)
{
if (ucData % 2 == 1) {
uiCount++;
}
ucData = ucData >> 1;
}
return  uiCount;
}

bool isMagicSquare()
{
unsigned int uiSize = 3;
unsigned int uiRow, uiColumn = 0;
unsigned int uiSum, uiSum1, uiSum2;
bool bFlag = false;
uiSum = 0;

for (uiRow = 0; uiRow < uiSize; uiRow++) {
for (uiColumn = 0; uiColumn < uiSize; uiColumn++) {
if (uiRow == uiColumn) {
uiSum = uiSum + g_uiMatrix[uiRow][uiColumn];
}
}
}
for (uiRow = 0; uiRow < uiSize; uiRow++) {
uiSum1 = 0;
for (uiColumn = 0; uiColumn < uiSize; uiColumn++) {
uiSum1 = uiSum1 + g_uiMatrix[uiRow][uiColumn];
}
if (uiSum == uiSum1) {
bFlag = true;
}
else {
bFlag = false;
break;
}
}
for (uiRow = 0; uiRow < uiSize; uiRow++) {
uiSum2 = 0;
for (uiColumn = 0; uiColumn < uiSize; uiColumn++){
uiSum2 = uiSum2 + g_uiMatrix[uiColumn][uiRow];
}
if (uiSum == uiSum2) {
bFlag = true;
}
else {
bFlag = false;
break;
}
}

return bFlag;
}

void FindOutCoordinatesOftheCenterOfTheMagicSquares(unsigned char * _ptrMatrix)
{
for (int i = 1; i < g_iHight-1; ++i)
{
for (int j = 1; j < g_iWidth-1; ++j)
{
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j + 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j + 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j + 1)]);

bool bRes = isMagicSquare();
if (bRes)
{
//std::cout << "Center of magic square in the  [" << i << "," << j << "]\n";
}
}
}
}

int main()
{
unsigned char * ptrToMatrix = nullptr;
ptrToMatrix = new unsigned char[g_iWidth * g_iHight]; // allocated array 200500000 byte
Init_Array(ptrToMatrix);

FindOutCoordinatesOftheCenterOfTheMagicSquares(ptrToMatrix);

delete ptrToMatrix;
return 0;
}```

Actually, what the code above is doing doesn't matter, because it's only a simple example of heavy-handed code for our research.

We use the standard optimization to compiler option in the Visual Studio development environment.
So, the /O2 option selects a predefined set of options that affect the size and speed of files. Let's build the project to create an executable

### Intel® System Studio Installation

First we need to install Intel® System Studio, following the installation instructions. (If you have already installed Intel System Studio, then simply ignore this section). For details, see Intel® System Studio Installation Guide

### Intel® VTune™ Amplifier

Now let's start the Intel VTune Amplifier to understand where our application is spending time, to identify hotspots - the most time-consuming program units (bottlenecks) and to detect how they are called.  To analyze our executable in Intel(R) VTune Amplifier, we need to create a new project.
Let's select the "New Project" button to create a new project
The Create a Project dialog box opens.
We specify a name for the project - "Test-01" Click the "New Analysis" button on the VTune Amplifier toolbar. "New Amplifier Result" tab opens with the Analysis Type window active.
On the left pane of the Analysis Type window, locate the analysis tree and select "Basic Hotspots" under "Algorithm Analysis" folder, and click the "Project Properties" button on the right command bar. "Project Properties" dialog box opens with the "Target" tab open by default.
Select the "Launch Application" option as Target type and choose our application to analyze (In general it can be either a binary file or a script) Click the Source Search tab.
Click on the "..." button to specify the directories used to search for source files required for data finalization. Click the "Start" button on the right command bar. After running the ConsoleApplication1.exe with VTune Amplifier, we may see something like : The view of the "Summary" tab of the Basic Hotspot displays elapsed time - 36.756s for our sample application. The FindOutCoordinatesOftheCenterOfTheMagicSquares function (which shows up at the top of the list as the hottest function) took 33.607 seconds to execute.

What do the different colors on the line tell?

• Gray (Idle): By default, if the CPU Time on all threads is less than 0.5 of 100% CPU Time on 1 core, such CPU usage is classified as idle.
Formula: ∑i=1, ThreadsCount(CPUTime(T,i)/T) < 0.5, where CPUTime(T,i) is the total CPU Time on thread i on interval T.
• Red (Poor)    : By default, poor usage is when the number of simultaneously running CPUs is less than or equal to 50% of the target CPU usage.
• Orange (Ok)  :  By default, OK usage is when the number of simultaneously running CPUs is between 51-85% of the target CPU usage.
• Green (Ideal) : By default, Ideal usage is when the number of simultaneously running CPUs is between 86-100% of the target CPU usage.

The CPU Usage Histogram above shows the poor usage, because it this sample does not make use of parallelism. Therefore here it is more preferable to use multi-threaded architecture to process heavy tasks in more quickly/effective way.
Click on the FindOutCoordinatesOftheCenterOfTheMagicSquares link in the "Top Hotspots" table.
"Bottom-up window" tab opens the Hotspots or Hotspots by CPU Usage viewpoint to identify the most time-consuming functions and analyze their call flow at the bottom-level - from a function to its parent functions. After double-click on the first line in the table VTune Amplifier opens the source file positioning at the most time-consuming code line of the FindOutCoordinatesOftheCenterOfTheMagicSquares function. Now we can see all bottlenecks of this function which takes most of all time, and involves the greatest number of problematic code. Therefore, if we improve the performance of this function, we'll get a performance benefit for the whole program.

### Optimization by using OpenMP

Let’s now see how we can use OpenMP to optimize our app.

OpenMP (Open Multi-Processing) is an application programming interface (API) that provides a portable, multi-platform and scalable model for C, C++, and Fortran developers of shared memory parallel applications. It works on most platforms, processor architectures and operating systems, including Windows*, Linux*, OS* X and others. OpenMP consists of a set of compiler directives, library routines and environment variables that influence run-time behavior.

OpenMP functionality is available with Intel Parallel Studio XE. It may not be available with Intel System Studio unless additional products are installed. Now let’s parallelize the outer loop by using compiler directives of OpenMP.
Take a look at the function, FindOutCoordinatesOftheCenterOfTheMagicSquares(). All we need to do is to add the #pragma omp parallel for  num_threads(8) before the outer for loop.
So, let's share the loop on 8 threads:

```void FindOutCoordinatesOftheCenterOfTheMagicSquares(unsigned char * _ptrMatrix)
{
#pragma omp parallel for num_threads(8)
for (int i = 1; i < g_iHight - 1; ++i)
{
for (int j = 1; j < g_iWidth-1; ++j)
{
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j - 1)]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + j]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j + 1)]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j - 1)]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + j]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j + 1)]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j - 1)]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + j]);
g_uiMatrix =   CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j + 1)]);

bool bRes = isMagicSquare();
if (bRes)
{
//std::cout << "Center of magic square in the  [" << i << "," << j << "]\n";
}
}
}
}```

After running the new version of ConsoleApplication1.exe with VTune Amplifier, we may see something like : The VTune Amplifier showes the total elapsed time of the program was reduced from 36.756 to 9.479 seconds. It's more than 3 times (almost 4 times) faster than the first version.

### Optimization by using Intel® Cilk™ Plus

Intel Cilk Plus - extension of the C/C ++, which simplifies the development of applications for parallel tasks. This extension was developed in 1994 in laboratory of MIT Computer Science (http://supertech.csail.mit.edu/cilk/). In 2009 it was bought by Intel company. Today it offers a quick and easy way to harness the power of both multi-core and vector processing. This tool provides a simple yet surprisingly powerful model for parallel programming, while runtime and template libraries offer a well-tuned environment for building parallel applications.

Let's look at how with the help of Intel Cilk Plus we can optimize the code in the FindOutCoordinatesOftheCenterOfTheMagicSquares function.

#### I. Customizing Development Settings in Microsoft* Visual Studio

First of all we need to configure our project to use the Intel® C++ compiler. Let's select “Use Intel C++” from Project menu.
This action will cause the file of solution to be modified so you maybe asked to check-out the .sln file (if you use a source control system). In the Confirmation dialog box press OK. "Intel C++ 15.0" is displayed besides project name when the project is set to "use Intel C++" #### II. Connect the header files

The source code of the project is necessary to connect the header files.
Therefore, we are should include this header: #include <cilk/cilk.h> #### III. Adding 'cilk_for'

cilk_for - keyword for parallelizing loops, i.e. it's replacement for the normal C/C++ 'for' loop which permits the loop iterations to run in parallel.
The general cilk_for syntax is:

```cilk_for (declaration  and initialization;
conditional expression;
increment expression)```

The following rules and requirements are applied to cilk_for :

• Existence of only one variable counter, which in C++ (not in C) MUST be declared in the cilk_for statement, rather than before it.
• The conditional statements (if-statements) MUST to use one of the following comparison operators : <, <=, !=, >=, >
• The termination expression and loop control variable can appear on either side of the comparison operator, but the loop control variable cannot occur within the termination expression. The termination expression value must not change from one iteration to the next.
• Increment/decrement operators increments or decrements from the loop control variable using one of the following supported operations : +=, -=, ++,  -- (The loop control variable MUST NOT change from one iteration to the next)
• The body of a cilk_for MUST NOT modify the loop control variable because the loop body executes in parallel.

Now let’s parallelize the outer loop using cilk_for. Take a look at the function, FindOutCoordinatesOftheCenterOfTheMagicSquares(). All we need to do is replace the 'for' with 'cilk_for':

```void FindOutCoordinatesOftheCenterOfTheMagicSquares(unsigned char * _ptrMatrix)
{
cilk_for (int i = 1; i < g_iHight - 1; ++i)
{
for (int j = 1; j < g_iWidth-1; ++j)
{
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i - 1)) + (j + 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* i) + (j + 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j - 1)]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + j]);
g_uiMatrix = CountNumberOfSetBits(_ptrMatrix[(g_iWidth* (i + 1)) + (j + 1)]);

bool bRes = isMagicSquare();
if (bRes)
{
//std::cout << "Center of magic square in the  [" << i << "," << j << "]\n";
}
}
}
}```

Let's rebuild the project and then run it again and analyse our program under the Intel Amplifier VTune to estimate parallelization in our code and understand how effectively our application works now.

Notice: If you receive an Error message when you try to start a program - "The program can't start because cilkrts20.dll is missing from your computer. Try reinstalling the program to fix this problem." Make sure you add the following paths to the Path variable in the "System variables" of Environment Variables of Window:

• C:\Program Files (x86)\Intel\System Studio 2015 for Windows.0.012\redist\intel64\compiler
• C:\Program Files (x86)\Intel\System Studio 2015 for Windows.0.012\redist\ia32\compiler
• C:\Program Files (x86)\Intel\System Studio 2015 for Windows.0.012\bin\ia32

If the above paths are not in the 'Path' variable, open cmd.exe with administrator privileges and type following : After running the new version of ConsoleApplication1.exe with VTune Amplifier, we may see something like: Let's explore the results of VTune Amplifier. We can see that the total elapsed time it took to finish the app is 5.115 seconds, which is a big improvement from what it took at the beginning.
It is more than 7 times faster (!!!) than the initial version (36.756 sec) Also we can notice that the parallelization is efficiently utilized. In the initial version we used only one single thread, while in the new version the total thread count equals to 8!

### Optimization Notice

Intel warns in revision #20110804 that:
"Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice."

### Summary

• In this presentation we have examined some basic principles of using the VTune Amplifier tool.
We have demonstrated how it's possible to analyze the code to indentify bottlenecks and to estimate the running time of apps.
• After application's optimization with OpenMP the total elapsed time of the program was reduced from 36.756 to 9.479 seconds, which is almost 4 times faster. Nice!
• Adding the cilk_for keyword to parallelize the calls to the kernel showed good scaling and resulted in a further speedup on the same system.
After application's optimization with Intel Cilk Plus the total elapsed time of the program was reduced from 36.756 to 5.115 seconds, that is more than 7 times faster and it’s really great result! This is an illustration of the power and simplicity of Intel Cilk Plus
Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.