Extending R with Intel MKL


The Using Intel MKL with R article discusses building R with the Intel® Math Kernel Library (Intel® MKL) BLAS and LAPACK to improve the performance of those parts of R that rely on matrix computations.  Since the BLAS and LAPACK routines are a long-standing set of functions with a standard API, the R build scripts have a built-in facility to use these. Setting R to use Intel MKL can provide significant performance improvement on parts of R, but there are other functions in Intel MKL that can provide great performance as well. This article will give you a quick tutorial on how to extend R by writing a wrapper for a function in Intel MKL and call it from your R script. 

The wrapper

Let's begin by selecting a simple function in Intel MKL that is not a part of the BLAS or LAPACK. For this exercise we'll choose the vdPow function which takes two arrays, x and y, and creates a new array z whose nth element is x[n] to the y[n] power.


#include "R.h"

#include "Rdefines.h"

	#include "Rinternals.h"

	#include "mkl.h"

SEXP mkl_vdpow(SEXP N, SEXP X, SEXP Y)


	  SEXP Z;

	  int *n;

	  int nn;

	  double *x, *y, *z;


	  /* Convert input arguments to C types */









	  /* Allocate memory to store results */

	  nn = *n;




	  vdPow(nn, x, y, z);




	  return Z;


Building the wrapper

On Windows this is easily done by obtaining the Windows package and Rtools from CRAN. I downloaded R 2.15.2 and Rtools version and then created a file to set my environment as follows:

SET R_TOOLS_ROOT="C:\Program Files\R\Rtools"


	SET MINGW64_CMPLR_ROOT=%MINGW64_ROOT%\i686-w64-mingw32

	SET PATH=%MINGW64_ROOT%\bin;%MINGW64_CMPLR_ROOT%\bin;%R_TOOLS_ROOT%\bin;%MINGW64_ROOT%\libexec\gcc\i686-w64-mingw32\4.6.3;%PATH%


I then build using the following commands:

SET R_HOME="C:\Program Files\R\R-2.15.2"

	SET MKL_ROOT="C:\Program Files (x86)\Intel\Composer XE 2013"

	c:\"program files"\r\rtools\gcc-4.6.3\bin\gcc -c -O2 -Wall  -std=gnu99 -I%R_HOME%\include -I%MKL_ROOT%\mkl\include pow_wrp.c -o pow_wrp.o

	c:\"program files"\r\rtools\gcc-4.6.3\bin\gcc -shared -Wl,-soname,pow_wrp.so -o pow_wrp.so pow_wrp.o %R_HOME%\bin\i386R.dll %MKL_ROOT%\mkl\lib\ia32\mkl_rt.lib %MKL_ROOT%\compiler\lib\intel64\libiomp5md.lib

Calling the function from R

Here then is a simple script that calls the function we've just built


mkl_pow <- function(n, x, y) .External("mkl_vdpow", n, x, y)

n <- 1000

	x <- runif(n, min=2, max=10)

	y <- runif(n, min=-2, max=-1)

z <- mkl_pow(n, x, y)


Let's see what sort of performance difference the use of Intel MKL can make by comparing this new power function with the simple means available within R. To do the similar operation in R we could use a loop to perform each scalar operation. The following is the result of running the interactively in the 32-bit RGui that comes with the R-2.15.2 pre-built Windows package on my laptop (Intel MKL 11.0 update 2, R-2.15.2, 2-core Intel® Core™ i5-540M processor (2.53GHz) 4GB RAM, Microsoft Windows 7):

> dyn.load("c:/a/collateral/kb/r/files/pow_wrp.so")

	> mkl_pow <- function(n, x, y) .Call("mkl_vdpow", n, x, y)

	> n <- 1000000

	> x <- runif(n, min=2, max=10)

	> y <- runif(n, min=-2, max=-1)

	> start <- proc.time()

	> z <- mkl_pow(n, x, y)

	> end1 <- proc.time() - start

	> end1

	user system elapsed

	0.03 0.02 0.05

	> i <- n

	> start <- proc.time()

	> repeat{ z[i] <- x[i]^y[i] i <- i - 1 if (i==0) break() }

	> end2 <- proc.time() - start

	> end2

	user system elapsed

	5.49 0.03 5.17 

For this very large, size the Intel MKL function makes a large difference. The loop approach takes 5.49 seconds while the Intel MKL function takes 0.03 seconds. For smaller problems the performance will be less, but for systems with higher core-counts, compute intensive functions can show even greater speed-up. 

Given the potential for large performance improvements, it may be worthwhile finding out the functions available in Intel MKL. Below is the diagram presented on the Intel MKL product site and is a good map of the type of functions that one will find in the library.


[[{"type":"media","view_mode":"media_large","fid":"165976","attributes":{"height":"250","width":"480","class":"media-image media-element file-media-large"},"link_text":null}]]


For more complete information about compiler optimizations, see our Optimization Notice.



The post is just an example on how you could use MKL beyond BLAS and LAPACK from R.

And to prove that MKL performance could be much better comparing to pure R here are the results of more interesting comparison of built-in R 3.0.1 vector power operation with MKL 11.3 Beta.

n <- 10000000
x <- runif(n, min=2, max=10)
y <- runif(n, min=-2, max=-1)
start_time <- proc.time()
z_mkl <- mkl_pow(n, x, y)
total_time_mkl <- proc.time() - start_time

start_time <- proc.time()
z_r <- x^y
total_time_r <- proc.time() - start_time

i <- n
start <- proc.time()
start_time <- proc.time()
repeat{ z_r_loop[i] <- x[i]^y[i]; i <- i - 1; if (i==0) break() }
total_time_r_loop <- proc.time() - start_time

print(all(z_r == z_r_loop))
print(all(abs(z_mkl - z_r) < 10.0^-16))

The results of the script running on Haswell, 4 cores, Red Hat Enterprise Linux Server 6.4:

> print(total_time_mkl)
   user  system elapsed
  0.070   0.018   0.034
> print(total_time_r)
   user  system elapsed
  1.052   0.031   0.271
> print(total_time_r_loop)
   user  system elapsed
 25.257   0.034  25.058
> print(all(z_r == z_r_loop))
[1] TRUE
> print(all(abs((z_mkl - z_r)/z_r) < 10.0^-10))
[1] TRUE

R takes 0.271 seconds when R+MKL takes 0.034 seconds.

andersgsjogren's picture

It might be that the example is a bit contrived, but the real R way of performing powers of a vector is certainly not to use a loop, but of course to use vector operations. This yields a speedup similar to the one presented. The interesting comparison would be between mkl_pow and the built in element-wise vector power operation.

n <- 1000000
x <- runif(n, min=2, max=10)
y <- runif(n, min=-2, max=-1)
start <- proc.time()
z1 <- x^y
end1 <- proc.time() - start 

i <- n
start <- proc.time()
repeat{ z[i] <- x[i]^y[i]; i <- i - 1; if (i==0) break() }
end2 <- proc.time() - start

cat("Vector version\n")
cat("Loop version\n")
cat("Identical results?\n")

Yields for me in vanilla RStudio on my MacBook Pro Retina 2012:

Vector version
   user  system elapsed 
  0.040   0.002   0.041 
Loop version
   user  system elapsed 
  3.737   0.018   3.761 
Identical results?
[1] TRUE


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.