Published: 11/01/2017, Last Updated: 09/15/2017

This article presents use cases and provides examples which make use of the Intel® Math Kernel Library (Intel® MKL). Arduino Create*, a cloud-based IDE and UP2*, a single board computer based on Intel’s Apollo Lake platform, are used for the examples. The use cases are intended to expose the user to the capabilities provided by the Intel® MKL and the examples provide short code samples to implement the use cases.

**Note**: any hardware device supporting Intel® MKL requirements can used as the target hardware.

- UP2 (recommended) or
- Hardware device containing an Intel processor with SSE2 (Streaming SIMD Extensions 2) support

See Intel® MKL requirements for supported hardware.

- Download the Intel® MKL
- Create an Arduino Create* account

Software applications that require mathematical functions such as a matrix multiplication can achieve an increase in performance (faster response times) by leveraging the Intel® MKL. See https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html/features/benchmarks for a complete review and comparison of Intel® MKL benchmarks for Intel® Core™ i7, Intel® Xeon® and Intel® Xeon Phi™ processors.

With an offering of hundreds of functions, developers can choose which areas make the most sense to optimize based on the requirements of the application. *Figure 1* below shows the components of the Intel® MKL, two of which are highlighted and are the focus of this article.

**Figure 1 **– *Components offered in the Intel® MKL*

Component |
Name |
Description |
---|---|---|

BLAS | Basic Linear Algebra Subroutines | GEMM - Generic Matrix Multipication |

LAPACK | Linear Algebra Package | SVD - Single Value Decomposition |

**Table 1** - *MKL Components referenced in this article*

The performance improvements gained by using the Intel® MKL can only be achieved when using IA-32 or Intel® 64 architecture that supports at a minimum the SSE2 instruction set, which includes most CPUs released with the Pentium® processor 4.

The Intel® MKL basic requirements can be found here: https://software.intel.com/content/www/us/en/develop/articles/intel-mkl-111-system-requirements.html

This section will provide an overview of three use cases related to data compression and image manipulation. And the code samples section provides the code required to implement these examples.

Being closest to the origin of data collected in the real world, Internet of Things (IoT) devices are an ideal response time optimization candidate. Two ways to improve this response time on an edge device is to 1) execute the code on the device itself or 2) reduce the size of the data that needs to be transferred for analysis.

Making a decision on the device itself allows for the quickest response time because it avoids forwarding data to another service or the cloud for analysis and awaiting a response. Alternatively, there are scenarios where it is not possible to compute at the edge, in which the data must be forwarded to a separate system for analysis. Reducing the size and compressing the data helps to improve performance response time in those situations. Finally, filtering and scientific functions are required at the edge more than ever, giving developers the ability to quickly compute, transform and derive are critical for near real-time responses.

Two example areas that can benefit from a performance boost at the edge include:

- Image manipulation and analysis
- Data compression

Because images can be represented in memory in a matrix form, they can be manipulated and analyzed using familiar and powerful linear algebra algorithms. A few types of transformations and their real world applications include: Scaling (Matrix Multiplication), Translations (Matrix Addition and Transpose), and Compression using Singular Value Decomposition.

The Intel® Developer Zone offers many resources targeted to assist developers with the Intel® MKL, including a nice overview of matrix fundamentals.

Using Matrix addition and subtraction, the elements of binary matrices can be modified in a manner to shift and filter individual elements. In the field this can be used to reduce contrast, change colors, or shift pixel values entirely. Below is an example of adding values to a binary grayscale image that adds 100,000 to certain elements that had a grayscale value below 100,000.

**Figure 4 **–* Before and after effects of adding 100,000 to each pixel that has a value < 100,000*

The values in a matrix can be transposed, or flipped around the diagonal to change orientation of an image. Here is an example output of a grayscale bitmap 200x200 image transform. Look closely and you will notice it is not simply rotated, but instead transformed around the diagonal.

**Figure 5** – *Before and after Matrix Transform of 24 bit grayscale image*

A simple 24 bit grayscale bitmap image that is 500x500 will take up just over 40K of memory. Representing one pixel takes up 4 bytes = 200*200*4 = 40000, this is not including the one kilobyte header.

The Singular Value Decomposition Theorem states that all MxN matrices can be factored into three smaller matrices, which when recombined can provide a representation of the original matrix that is often acceptable for a suited purpose. Luckily the MKL has a built in routine for computing the SVD of a matrix, however here is a quick overview of how it works.

Any given matrix A that represents an image of MxN pixels (where m=rows and n=columns), can be represented by the product of three matrices: columns of left singular vectors (U), rows of right singular vectors (V^{T}), and diagonals of real singular values.

To calculate the SVD, the three matrices are constructed through derivation of eigenvectors of

AA^{T} and A^{T}A to form the columns of V and U respectively. Similarly, the singular values of S are computed by taking the square roots of the eigenvalues of AA^{T} or A^{T}A. Once calculated, these three matrices can be sampled by taking a set of rows from each that is less than m, and multiply them together to obtain a very close approximation of the original value. Figure 5.2 below shows SVD usage in the real world to reduce an image size by up to 90%.

**Figure 5.2** – *Before and after Matrix compression of 24 bit grayscale image from 40K to 4K*

Arduino Create is a cloud-based IDE for programming IoT devices. Complete with a management dashboard, users can now remotely manage and program their IoT devices, pushing code effortlessly as if the devices were directly connected. To get started, visit http://create-intel.arduino.cc/ and create an account.

The examples in this section use the UP2 hardware running Ubuntu 16.04 operating system with Intel® MKL 2017, however the overview will work for any compatible stack that is supported.

**Figure 6 **– *Screenshot of Arduino Create devices dashboard*

Before getting started, the Intel® MKL libraries will need to be installed on the board and configured properly.

To install the Intel® MKL on Ubuntu, visit http://software.intel.com/mkl and follow the download instructions. Registration is required prior to downloading, however it is complete free. Select the Intel® Performance Libraries for the operating system you will be working with, along with the latest version. Next, click on the Intel® MKL link to start the download.

**Figure 7** – *Screenshot of Intel’s MKL download options*

After downloading, initiate the install by unpacking the archive, running the install, and setting environment variables. The default installation folder is /opt/intel/mkl/

`tar –zxvf [Name of MKL file].tgz`

cd [unpacked folder]

sudo ./install.sh OR sudo ./install_GUI.sh (if running on a desktop GUI)

cd [install folder] (Default is /opt/intel/mkl/bin/)

sudo ./mklvars.sh intel64 (This script will set environment variables for your platform)

**Figure 8** – *Installation instructions after downloading Intel® MKL for Linux*

The Intel® MKL comes with many examples that help developers get up and running as fast as possible. Explore the /examples/ subfolder under the default installation folder (typically /opt/intel/mkl) and modify code to suit your specific requirements. The code examples outlined in this document have been taken from the Intel® MKL default examples and migrated to work with the Arduino style program structure. Migration to Arduino from C simply means ensuring that the standard setup() and loop() functions are available, and also that the <ArduinoMKL.h> is referenced, which is a header wrapper for the Intel® MKL libraries. Note that any libraries referenced in code must be locatable on the Arduino cloud during compile time. Table 2 below provides an example of the common actions taken to migrate example code from the Intel® MKL into Arduino*.

Intel® MKL Example C source | Arduino Migrated source |
---|---|

#include “MKLSpecificHeader.h” int main(argc, arv){ func();} |
#include “ArduinoMKL.h” setup() { func(); } |

**Table 2** - *Example code structure migration required for MKL C to Arduino*

Now that the Intel® MKL is installed, return to the Arduino Create Cloud IDE (URL) and run a sample application that leverages the MKL. On navigation menu, select libraries and then search for ‘MKL’. Open and explore the mkl-lab-solution example which demonstrates simple matrix multiplication using DGEMM – Double Precision General Matrix Multiplication.

**Figure 9** –* Screenshot of Arduino Create Libraries with search requested for MKL*

Next, open the Serial Monitor window by clicking Monitor on the navigation menu on the left. This will bring up the familiar debugging window available in the standard Arduino IDE that allows interfacing with the program as well as printing out debug statements. The Arduino Create IDE should now show both the source code and the Serial Monitor window as shown in Figure 10.

**Figure 10 **– *Arduino Create Editor and Monitor window*

At this stage, the program can either be Verified or Uploaded directly to the board. Figure 11 shows an example of the mkl-lab-solution ready for upload to a device named ‘Up2Ubuntu’. During this process, the MKL is actually compiled in the cloud as part of the verification process. The MKL libraries are dynamically linked and referenced when executed on the target device.

**Figure 11** – *Arduino Create Upload sketch to device*

As shown in Figure 12, the bottom pane will show the compiler output concluded by a results summary that indicates program size and percentage of storage space used.

**Figure 12** –* Build output from Arduino Create shows the Process ID number (PID)*

By logging into the target platform, the process ID can verified using ps –A and even monitored by running top –p2001

**Figure 13** –* output of ps –A shows the matching processID is indeed executing*

In the Arduino Create IDE, notice the monitor window is requesting size of the matrices to multiply. Using a value less than seven will show the output of the matrix multiplication, allowing you to manually verify the results if desired. Explore the code and try out different values. Figure 14 shows the output execution of the sample lab solution.

**Figure 14** – *Output of mkl-lab-solution with a matrix less than 7*

Now that we have the basic example working, the code can be modified to examine the performance differences when using standard matrix multiplication and compare the results against the MKL’s DGEMM. We will refactor the example code into a few functions to help with readability, implement a very basic CMM (Classic Matrix Multiplication) algorithm, and provide a testing interface to vary matrix size and number of runs. The code snippet in Figure 18 provides a general guideline to test out the performance differences with redundant code from the mkl-lab-solution omitted.

Intel’s Software Developer Zone offers many resources targeted to assist developers with the MKL, including a nice overview of matrix fundamentals:

The specific matrix multiplication routine leveraged in this example is a can be reference here:

`void cblas_dgemm (const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE transa, constCBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const double alpha,const double *a, const MKL_INT lda, const double *b, const MKL_INT ldb, const double beta, double*c, const MKL_INT ldc);`

**Figure 15** – *MKL BLAS GEMM Routine Signature Definition*

```
… / Includes
void setup()
{
// Maximize the number of threads
max_threads = mkl_get_max_threads();
printf (" Requesting Intel(R) MKL to use %i thread(s) \n\n", max_threads);
mkl_set_num_threads(max_threads);
…
printf("\n\nEnter matrix size OR -1 to exit");
scanf("%d",&N);
MM_Standard();
MM_Optimized();
}
void Dgemm_multiply(double* a,double* b,double* c, int N)
{
double alpha = 1.0, beta = 0.;
int incx = 1;
int incy = N;
cblas_dgemm(CblasRowMajor,CblasNoTrans,CblasNoTrans,N,N,N,alpha,b,N,a,N,beta,c,N);
}
void MM_Optimized()
{
start = clock();
Dgemm_multiply(a,b,c,N);
stop = clock();
time_Optimized = (double)(stop - start) / CLOCKS_PER_SEC;
}
void MM_Standard(){
int row, col, i=0, k=0;
start = clock();
for (row = 0; row < N; row++){
for (col = 0; col < N; col++){
for(k=0; k < N; k++){
c[N*row+col] += a[N*row+k] * b[N*k+col];
}
}
}
stop = clock();
time_Manual = (double)(stop - start) / CLOCKS_PER_SEC;
}
void loop() {
exit(0);
}
```

**Figure 18** -* Sample code snippets used to compare CMM and MKL DGEMM*

Adding two matrices together can be accomplished using the omatadd() routine. Matrix addition can be applied to images to incorporate interesting effects as shown in the introductory section. The example code below will add the matrices ‘Matrix_Input’ and ‘Matrix_Fade’ together, and store the output in an array called ‘Matrix_Out’.

`void mkl_domatadd (char ordering, char transa, char transb, size_t m, size_t n, const doublealpha, const double * A, size_t lda, const double beta, const double * B, size_t ldb, double * C,size_t ldc);`

**Figure 19** -* MKL BLAS Matrix Addition Routine Signature Definition*

```
// Pseudocode for adding matrices
…
mkl_domatadd ('R', 'N', 'N', height, width, 1.0, Matrix_Input, height, 1, Matrix_Fade, height, Matrix_Out, height);
```

**Figure 20** – *Code example for Matrix Addition*

Another example usage of the MKL BLAS is for transposing a matrix of data. Transposing a matrix is the operation of converting the row values to column values, an operation that is foundational to other more complex linear algebra theorems. Transposing a matrix using the MKL is bundled in a matrix copy routine, where you can also copy matrices while at the same time electing to transform all or part of the matrix. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html-developer-reference-c-mkl-imatcopy

`void mkl_simatcopy (const char ordering, const char trans, size_t rows, size_t cols, const floatalpha, float * AB, size_t lda, size_t ldb);`

**Figure 21** - *MKL BLAS Matrix Copy Signature Definition*

```
Example of using mkl_simatcopy transposition
Source matrix:
----------
1 1 1 1 1 1 1 1 1 1
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
-----------
Transposed matrix:
1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0
1 0 0 0 0 1 0 0 0 0
1 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 1
[Figure 22 – Output of Transposed matrix example]
size_t n=10, m=10; /* rows, cols of source matrix */
float src[]= {
1,1,1,1,1,1,1,1,1,1,
0,1,0,0,0,0,0,0,0,0,
0,0,1,0,0,0,0,0,0,0,
0,0,0,1,0,0,0,0,0,0,
0,0,0,0,1,0,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,
0,0,0,0,0,0,0,1,0,0,
0,0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,0,1
};
printf("\nExample of using mkl_simatcopy transposition\n\n");
printf("Source matrix:\n----------\n");
print_matrix(n, m, 's', src);
//Copy matrix and transpose using Row-major order
mkl_simatcopy('R' /* row-major ordering */,
'T' /* A will be transposed */,
m /* rows */,
n /* cols */,
1. /* scales the input matrix */,
src /* source matrix */,
m /* src_lda */,
n /* dst_lda */);
printf("\n-----------\nTransposed matrix:\n");
print_matrix(n, m, 's',src);
```

A final use-case worth mentioning is related to data compression. When dealing with large matrices at the edge, it can be very beneficial to employ a size reduction on a matrix of data. Large matrices can take up a lot of memory, which is a concern for local storage and during transport if the data needs to be sent to another location using lower bandwidth mediums. Using a popular Linear Algebra theorem, Singular Value Decomposition, a matrix can be significantly reduced in size to help solves these problems at the edge.

The MKL has built in support for computing the SVD of a matrices of both real and complex numbers, and includes example source code in MKLHOME/examples/lapacke/ folder. As all examples in this article, the code can be migrated from the C examples deployed with MKL directly to Arduino through migration steps outlined in *Table 2*.

The specific routine leveraged in this example can be reference here: https://software.intel.com/en-us/node/521150

`lapack_int LAPACKE_sgesvd( int matrix_layout, char jobu, char jobvt, lapack_int m, lapack_intn, float* a, lapack_int lda, float* s, float* u, lapack_int ldu, float* vt, lapack_int ldvt,float* superb );`

**Figure 24 **- *MKL LAPACK General Singular Value Decomposition Signature Definition*

**Figure 25**– *Arduino Create Monitor SVD output*

```
void setup() {
…
info = LAPACKE_dgesvd( LAPACK_ROW_MAJOR, 'A', 'A', m, n, a, lda,
s, u, ldu, vt, ldvt, superb );
/* Check for convergence */
if( info > 0 ) {
printf( "The algorithm computing SVD failed to converge.\n" );
exit( 1 );
}
/* Print singular values */
print_matrix( "Singular values", 1, n, s, 1 );
/* Print left singular vectors */
print_matrix( "Left singular vectors (stored columnwise)", m, n, u, ldu );
/* Print right singular vectors */
print_matrix( "Right singular vectors (stored rowwise)", n, n, vt, ldvt );
….
```

**Figure 26 **– *Code snippet for calling the dgesvd routine*

The Math Kernel Library offered by Intel provides highly optimized math functions and algorithms that are designed to work only with Intel hardware. For applications that utilize complex math functions such as matrix algebra or Singular Value Decomposition, and require faster response times than typical software optimized programming can provide, the MKL can provide much quicker response times. Whether running in the cloud or at the Edge, Intel offers hardware designed to provide the optimizations demanded by math intensive, scientific applications.

A core capability that makes the Intel® MKL enhancements possible is produced by the compiler optimizations that implement code vectorization through SSE optimizations. **Code vectorization** is a way of ensuring that at compile time, each computer operation contains both data and instructions, leveraging the SSE registers in a parallel fashion. SSE (Streaming Single Instruction/Multiple Data Extension) is an architectural implementation of code vectorization. Introduced with the release of the Intel® Pentium® 3 processor in 1999, SSE included an array of eight 128 bit dedicated floating point registers and 70 new operations, and enhancing performance for operations that are replicated between different sets of data.

Another feature leveraged by Intel® MKL is Advanced Vector Extensions (AVX). It is an additional instruction set that is included in Intel® Core™ microarchitecture, however is not available with Pentium® or Celeron® processors.

To determine hardware capabilities of a particular CPU, take a look at /proc/cpuinfo

`cat /proc/cpuinfo | grep ‘avx’`

cat /proc/cpuinfo | grep ‘avx2’

cat /proc/cpuinfo | grep ‘sse’

**Figure 2** – *Unix commands to determine processor capabilities*

Developers interested in calling the SSE directly should check out the Intel® Intrinsics Guide. It offers C APIs for SSE and SSE2, giving a developer direct access to SIMD features without requiring experience with assembly language.

**Note**: this product is completely separate from the Intel® MKL, however it is worth mentioning while on the discussion of SSE.

Intel’s Link Line Advisor is a web-based tool to help developers quickly construct linker flags (for linking libraries) that meet their platform requirements. Your platform requirements are the inputs to the advisor (for example, 64 bit integer interface layer) and the output is a link line and compiler options.

**Figure 3** – *Screenshot of Intel’s Link Line Advisor that helps to build linker flags*

Matt Chandler is a senior software and applications engineer with Intel since 2004. He is currently working on scale enabling projects for Internet of Things including software vendor support for smart buildings, device security, and retail digital signage vertical segments.

Intel® Math Kernel Library 2017 Install Guide

Intel® Math Kernel Library Cookbook

Intel® Math Kernel Library In-Depth Training

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804