# Extending R with Intel MKL

Introduction

The Using Intel MKL with R article discusses building R with the Intel® Math Kernel Library (Intel® MKL) BLAS and LAPACK to improve the performance of those parts of R that rely on matrix computations.  Since the BLAS and LAPACK routines are a long-standing set of functions with a standard API, the R build scripts have a built-in facility to use these. Setting R to use Intel MKL can provide significant performance improvement on parts of R, but there are other functions in Intel MKL that can provide great performance as well. This article will give you a quick tutorial on how to extend R by writing a wrapper for a function in Intel MKL and call it from your R script.

The wrapper

Let's begin by selecting a simple function in Intel MKL that is not a part of the BLAS or LAPACK. For this exercise we'll choose the vdPow function which takes two arrays, x and y, and creates a new array z whose nth element is x[n] to the y[n] power.

```#include "R.h"

#include "Rdefines.h"

#include "Rinternals.h"

#include "mkl.h"

SEXP mkl_vdpow(SEXP N, SEXP X, SEXP Y)

{

SEXP Z;

int *n;

int nn;

double *x, *y, *z;

/* Convert input arguments to C types */

PROTECT(N = AS_INTEGER(N));

PROTECT(X = AS_NUMERIC(X));

PROTECT(Y = AS_NUMERIC(Y));

n = INTEGER_POINTER(N);

x = NUMERIC_POINTER(X);

y = NUMERIC_POINTER(Y);

/* Allocate memory to store results */

nn = *n;

PROTECT(Z = NEW_NUMERIC(nn));

z = NUMERIC_POINTER(Z);

vdPow(nn, x, y, z);

UNPROTECT(4);

return Z;

}```

Building the wrapper

On Windows this is easily done by obtaining the Windows package and Rtools from CRAN. I downloaded R 2.15.2 and Rtools version 3.0.0.1927 and then created a file to set my environment as follows:

```SET R_TOOLS_ROOT="C:\Program Files\R\Rtools"

SET MINGW64_ROOT=%R_TOOLS_ROOT%\gcc-4.6.3

SET MINGW64_CMPLR_ROOT=%MINGW64_ROOT%\i686-w64-mingw32

SET PATH=%MINGW64_ROOT%\bin;%MINGW64_CMPLR_ROOT%\bin;%R_TOOLS_ROOT%\bin;%MINGW64_ROOT%\libexec\gcc\i686-w64-mingw32\4.6.3;%PATH%

SET LIB=%MINGW64_CMPLR_ROOT%\lib64;%MINGW64_CMPLR_ROOT%\lib;%LIB%```

I then build using the following commands:

```SET R_HOME="C:\Program Files\R\R-2.15.2"

SET MKL_ROOT="C:\Program Files (x86)\Intel\Composer XE 2013"

c:\"program files"\r\rtools\gcc-4.6.3\bin\gcc -c -O2 -Wall  -std=gnu99 -I%R_HOME%\include -I%MKL_ROOT%\mkl\include pow_wrp.c -o pow_wrp.o

c:\"program files"\r\rtools\gcc-4.6.3\bin\gcc -shared -Wl,-soname,pow_wrp.so -o pow_wrp.so pow_wrp.o %R_HOME%\bin\i386R.dll %MKL_ROOT%\mkl\lib\ia32\mkl_rt.lib %MKL_ROOT%\compiler\lib\intel64\libiomp5md.lib```

Calling the function from R

Here then is a simple script that calls the function we've just built

```dyn.load("c:/full_path_or_directory_on_path/pow_wrp.so")

mkl_pow <- function(n, x, y) .External("mkl_vdpow", n, x, y)

n <- 1000

x <- runif(n, min=2, max=10)

y <- runif(n, min=-2, max=-1)

z <- mkl_pow(n, x, y)```

Performance

Let's see what sort of performance difference the use of Intel MKL can make by comparing this new power function with the simple means available within R. To do the similar operation in R we could use a loop to perform each scalar operation. The following is the result of running the interactively in the 32-bit RGui that comes with the R-2.15.2 pre-built Windows package on my laptop (Intel MKL 11.0 update 2, R-2.15.2, 2-core Intel® Core™ i5-540M processor (2.53GHz) 4GB RAM, Microsoft Windows 7):

```> dyn.load("c:/a/collateral/kb/r/files/pow_wrp.so")

> mkl_pow <- function(n, x, y) .Call("mkl_vdpow", n, x, y)

> n <- 1000000

> x <- runif(n, min=2, max=10)

> y <- runif(n, min=-2, max=-1)

> start <- proc.time()

> z <- mkl_pow(n, x, y)

> end1 <- proc.time() - start

> end1

user system elapsed

0.03 0.02 0.05

> i <- n

> start <- proc.time()

> repeat{ z[i] <- x[i]^y[i] i <- i - 1 if (i==0) break() }

> end2 <- proc.time() - start

> end2

user system elapsed

5.49 0.03 5.17 ```

For this very large, size the Intel MKL function makes a large difference. The loop approach takes 5.49 seconds while the Intel MKL function takes 0.03 seconds. For smaller problems the performance will be less, but for systems with higher core-counts, compute intensive functions can show even greater speed-up.

Given the potential for large performance improvements, it may be worthwhile finding out the functions available in Intel MKL

For more complete information about compiler optimizations, see our Optimization Notice.

Top

The post is just an example on how you could use MKL beyond BLAS and LAPACK from R.

And to prove that MKL performance could be much better comparing to pure R here are the results of more interesting comparison of built-in R 3.0.1 vector power operation with MKL 11.3 Beta.

```n <- 10000000
x <- runif(n, min=2, max=10)
y <- runif(n, min=-2, max=-1)
start_time <- proc.time()
z_mkl <- mkl_pow(n, x, y)
total_time_mkl <- proc.time() - start_time

start_time <- proc.time()
z_r <- x^y
total_time_r <- proc.time() - start_time

i <- n
z_r_loop<-numeric(n)
start <- proc.time()
start_time <- proc.time()
repeat{ z_r_loop[i] <- x[i]^y[i]; i <- i - 1; if (i==0) break() }
total_time_r_loop <- proc.time() - start_time

print(total_time_mkl)
print(total_time_r)
print(total_time_r_loop)
print(all(z_r == z_r_loop))
print(all(abs(z_mkl - z_r) < 10.0^-16))```

The results of the script running on Haswell, 4 cores, Red Hat Enterprise Linux Server 6.4:

```> print(total_time_mkl)
user  system elapsed
0.070   0.018   0.034
> print(total_time_r)
user  system elapsed
1.052   0.031   0.271
> print(total_time_r_loop)
user  system elapsed
25.257   0.034  25.058
> print(all(z_r == z_r_loop))
[1] TRUE
> print(all(abs((z_mkl - z_r)/z_r) < 10.0^-10))
[1] TRUE```

R takes 0.271 seconds when R+MKL takes 0.034 seconds.

It might be that the example is a bit contrived, but the real R way of performing powers of a vector is certainly not to use a loop, but of course to use vector operations. This yields a speedup similar to the one presented. The interesting comparison would be between mkl_pow and the built in element-wise vector power operation.

```n <- 1000000
x <- runif(n, min=2, max=10)
y <- runif(n, min=-2, max=-1)
start <- proc.time()
z1 <- x^y
end1 <- proc.time() - start
end1

i <- n
z<-numeric(n)
start <- proc.time()
repeat{ z[i] <- x[i]^y[i]; i <- i - 1; if (i==0) break() }
end2 <- proc.time() - start
end2

cat("Vector version\n")
print(end1)
cat("Loop version\n")
print(end2)
cat("Identical results?\n")
print(all(z1==z))```

Yields for me in vanilla RStudio on my MacBook Pro Retina 2012:

```Vector version
user  system elapsed
0.040   0.002   0.041
Loop version
user  system elapsed
3.737   0.018   3.761
Identical results?
[1] TRUE```