Distributed DAAL vs ScaLAPACK in MKL

Distributed DAAL vs ScaLAPACK in MKL

It looks like some of the DAAL analysis routines are also found in MKL. For instance, QR decomposition in DAAL has the same functionality as p?geqrf in MKL (distributed QR decomposition). Is there any guidance on the differences between them: do they use the same back-end, when would I go for the DAAL implementation vs MKL?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Andrey,

Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL) provides the building blocks which can be used on all stages of the data analytics flow, from the data reading till modeling and decision making. Algorithmic component of the library includes the algorithms for data analysis, model training and prediction, in particular, summary statistics, regression and classification algorithms etc. The library provides matrix factorizations as those blocks are important for the data analysis. While the same matrix factorizations are available in Intel(R) Math Kernel Library (Intel(R) MKL), there are differences between Intel DAAL and Intel MKL QR:

  • In analytics area data sets are generally heterogeneous, and Intel  DAAL supports the work with such type of data via Data Management component while Intel MKL supports homogeneous data/matrices. So, when your application requires the factorization of heterogeneous data set you can call Intel DAAL QR factorization directly, without intermediate data conversion step
  • Some analytics applications require processing of the streaming data, one which arrive in blocks over the time. Intel DAAL QR factorization supports this scenario: once the next data block is available, you provide it in the library, which will then update the factors using only this block, without passing multiple times over the whole (raising) data set.
  • Intel DAAL QR factorization supports distributed type of the computations. However, internally it does not rely on any communication technology such as MPI. The library provides APIs for use on the computation nodes, and data/intermediate result transfer between the nodes using specific communication technology is responsibility of the user. The library provides documentation that describes the whole distributed computational flow in QR factorization and provides MPI/Spark samples for it

Intel DAAL library relies on kernels of  Intel MKL when possible to bring additional performance gain to the analytics applications.

Please, let me know if this answers your question



Great, that is exactly what I needed. Thank you, Andrey!


I am trying to understant how to perform a distributed QR decomposition in a distributed environment. I've read your explanation, the description of the algorithm and the sample provided. There are some questions that I haven't been capable to solve. I think that my questions are so related with your discusion that it is not worth creating a new post.

In the description of the algorithm and if I understood it well, the matrix is decomposed by rows and the blocks are a portion of the total matrix. For example, having two blocks of size n_1 x p and n_2 x p means that the original matrix was (n_1 + n_2) x p. Knowing that, should i assume that there is no way to execute a distributed QR decomposition of square matrices? (It is said in the description that each n_X has to be greater or equal than p). Is this an algorithm that is only valid for thin matrices? I understand that is the most common case in big data, but even if I am more interested in more typical "MKL" behaviour, I find really usefull the way DAAL interacts with the data, supporting Cassandra, HDFS and heterogeneous inputs.

Finally you said that the library provides documentation that describes the whole distributed computational flow in QR factorization and provides MPI/Spark samples for it. Are there any available example using PyDAAL (I only found examples on Java)? Here I found an example using Spark. I assume that the configuration that will be used is the one set on Spark, so not additional parameters have to be set. Nevertheless, I am more interested in running PyDAAL distributed examples (In particular, I am interested in running it in a cluster with a queue system). Is there any example on how to call a distributed execution using PyDAAL? 

Thank you very much in advance.

Hello, thanks for your comments and questions

Yes, the present version of Intel DAAL provides QR decomposition for tall and skinny matrices where number of rows is much bigger than number of columns. At the same time, we consider the ways to extend support for other matrix shapes in QR by analyzing possible use cases. If you could provide extra details on your typical matrix sizes, data type (or types, as you mention that heterogeneous input might be important), dense or sparse nature of the data, and communication technology (say, MPI*) it would help us better understand your use case.

And for samples, including one that demonstrate Python API for the library, please have a look at https://software.intel.com/en-us/intel-daal-support/code-samples​.

Let us know, if you need more details or any help from our side




Hi Andrey,

Thanks a lot for you fast and clear response. 

I'm trying to distribute a square dense matrix QR decomposition of double precision real numbers. I'm just in the research phase so I don't have a concrete use case.

Only one last question. In the example shown here there is not reference to neither spark or mpi. If I understood well the example and your previous comment, this is just a generic example but the developer should add the communications and handle them by hand, isn't it? The used is responsible to perform the communications.

If I am wrong, is there a way to launch directly this example through some kind of configuration, indicating where the computing nodes are located? 



Hi Ramon,

Yes, this is design decision for Intel DAAL - internally, the library does not rely on any communication technology. It provides APIs to be used on the nodes of the cluster to compute partial results and combine the results together for a given algorithm. This helps use the library with different communication technologies including MPI*, Spark*, etc. Yes, this is responsibility of the user to perform communication. The example you refer to is intended to give the initial idea about API of distributed QR while qr_fast_distributed_mpi.py​ sample available at the link I provided earlier demonstrates how to use distributed QR together with MPI* technology.

Please, let me know, if it clarifies




Leave a Comment

Please sign in to add a comment. Not a member? Join today