Recent posts
https://software.intel.com/en-us/recent/943978
enWhat performance I should expect from following code
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/472140
<p>Consider following two part of the codes:</p>
<p>/* Perform LU factorization and store in DSS_handle */<br />for(k = 0; k < N; k++){<br />gettimeofday(&stTime, NULL);<br />//DSS solver options<br />MKL_INT solOpt = (MKL_DSS_DEFAULTS | MKL_DSS_REFINEMENT_OFF) | MKL_DSS_TRANSPOSE_SOLVE;<br />MKL_INT nRhs = 3;<br />dss_solve_real(DSS_handle, solOpt, bufferRHS, nRhs, bufferX3);<br />dssSolCnt++;<br />gettimeofday(&endTime, NULL);<br />dssSolTime += (double)(endTime.tv_sec*1000000 + endTime.tv_usec - stTime.tv_sec*1000000 - stTime.tv_usec);<br />/* Do some other things */<br />}</p>
<p>For this code, dssSolTime, which represents the time required to performe forward and backward solutions, is 19.87sec for a 3408 * 3408 matrix.</p>
<p>Now, if I do the same calculations sequentially using following code,</p>
<p>/* Perform LU factorization and store in DSS_handle */<br />for(k = 0; k < N; k++){<br />gettimeofday(&stTime, NULL);<br />//DSS solver options<br />MKL_INT solOpt = (MKL_DSS_DEFAULTS | MKL_DSS_REFINEMENT_OFF) | MKL_DSS_TRANSPOSE_SOLVE;<br />MKL_INT nRhs = 1;<br />dss_solve_real(DSS_handle, solOpt, bufferRHS, nRhs, bufferX3);<br />dss_solve_real(DSS_handle, solOpt, bufferRHS+numOfEqs, nRhs, bufferX3+numOfEqs);<br />dss_solve_real(DSS_handle, solOpt, bufferRHS+2*numOfEqs, nRhs, bufferX3+2*numOfEqs);<br />dssSolCnt++;<br />gettimeofday(&endTime, NULL);<br />dssSolTime += (double)(endTime.tv_sec*1000000 + endTime.tv_usec - stTime.tv_sec*1000000 - stTime.tv_usec);<br />/* Do some other things */<br />}</p>
<p>it completes the computations much faster anf dssSolTime will be 2.04sec for the matrix (almost 10 times faster when I ask dss_solve_real to solve for all righ-hand-side vectors.)</p>
<p>I assumed that dss_solve_real is smart enough to create three threads to solve for all right-hand side vectors simultaneously. Therefore, I expected first code to be three times faster than second code. But, the huge performance degradation implies that I may be missing something here. So, it is appreciated if you let me know whether or not dss_solve_real can solve for three right-hand-side vectors in parallel. Also, kindly let me know what I should logically expect from these codes and which one should be faster.</p>
<p>Thanks</p>
Wed, 04 Sep 13 15:35:35 -0700Pouya Z.472140dss_solve_real takes more time to solve a linear system
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/472066
<p>I need to solve a system of linear equations with three righ-hand-side vectors. Initially, I was using the sequential version of MKL (compiling with "libmkl_sequential.a") and solving for each rhs vector sequentially as:</p>
<p>dss_solve_real(DSS_handle, solOpt, rhs1, 1, x1);</p>
<p>dss_solve_real(DSS_handle, solOpt, rhs1 + numOfVars, 1, x1 + numOfVars);</p>
<p>dss_solve_real(DSS_handle, solOpt, rhs1 + 2*numOfVars, 1, x1 + 2*numOfVars);</p>
<p>where, numOfVars represent number of variables.</p>
<p>Then, I decided to ask dss_solve_real to solve for all rhs vectors at once and I assumed that it will roughly lead to 3 times improvement. So, I compiled the code using "libmkl_intel_thread.a" and used following code:</p>
<p>dss_solve_real(DSS_handle, solOpt, rhs1, 3, x1);</p>
<p>In my surprise, the timing is very wierd. Sequential version takes 0.548 sec while when I want to solve for all rhs vectors at once, it takes 5.024sec, which is almost 10 times more than sequential version.</p>
<p>I feel there is something wrong here and I may be needed to set some environment variables. So, please let me know if you have similar experience.</p>
<p>Any help is appreciated.</p>
</p>
Tue, 03 Sep 13 16:05:06 -0700Pouya Z.472066Use Which Routine for Sparse Matrix-Sparse Vector Multiplication
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/419414
<p>I have a huge matrix with very few non-zero elements. Some of the columns and rows may also be completely zero. This matrix should be multiplied by a very long vector which has only few non-zero elements. I know that mkl_?cscmv performs the sparse matrix-vector multiplication but apparently only the matrix can be sparse. I am wondering if there is any MKL routine to calculate the production of such matrix and vector.</p>
<p>Thanks in advance for your help</p>
Mon, 19 Aug 13 19:15:26 -0700Pouya Z.419414How to set affinity while using MKl in sequential mode
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/404745
<p>I have written a multi-threaded code using pthread. Each thread calls an instance of dss_solve_real separately. I compile the code using following libraries to make sure that MKL works in sequential mode:</p>
<p>$(MKLROOT)/lib/intel64/libmkl_intel_ilp64.a $(MKLROOT)/lib/intel64/libmkl_sequential.a $(MKLROOT)/lib/intel64/libmkl_core.a -lm -lpthread </p>
<p>Also, I have disabled KMP_AFFINITY using: </p>
<p>env KMP_AFFINITY=disabled</p>
<p>The number of threads for MKL is also manually determined in the code using:</p>
<p>mkl_set_num_threads(1);</p>
<p>I use the following code to set affinity for each thread. This piece code is executed at the beginning of each thread's function:</p>
<p>pthread_t curThread = pthread_self();<br />cpu_set_t cpuset;<br />CPU_ZERO(&cpuset);<br />CPU_SET(threadCPUNum[threadData->numOfCurThread], &cpuset);<br />sched_setaffinity(curThread, sizeof(cpuset), &cpuset);</p>
<p>In this code, threadCPUNum[threadData->numOfCurThread] represents number of the CPU to which current thread will be binded to.</p>
<p>In order to make sure that MKL respects my CPU affinity settings, I initially bind all the threads to CPU0 by setting all elements of threadCPUNum array to zero. However, monitoring CPU utilization reveals that MKL does not pay attention to sched_setaffinity and uses different processors.</p>
<p>I would like to know what I am missing here and how I can force MKL function (dss_solve_real) to bind to a specific CPU.</p>
<p>Thanks in advance for your help.</p>
</p>
Wed, 31 Jul 13 15:33:40 -0700Pouya Z.404745Use one LU factorization in several instances of mkl_dss_solve
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/404629
<p>I am using Intel MKL library to solve a system of linear equations (A*x = b) with multiple right-hand side (rhs) vectors. The rhs vectors are generated asynchronously and through a separate routine and therefore, it is not possible to solve them all at once.</p>
<p>In order to expedite the program, a multi-threaded program is used where each thread is responsible for solving a single rhs vectors. Since the matrix A is always constant, LU factorization should be performed once and the factors are used subsequently in all threads. So, I factor A using following command</p>
<p>dss_factor_real(handle, opt, data);</p>
<p>and pass the handle to the threads to solve the problems using following command:</p>
<p>dss_solve_real(handle, opt, rhs, nRhs, sol);</p>
<p>However, I found out that it is not thread-safe to use the same handle in several instances ofdss_solve_real. Apparently, for some reason, MKL library changes handle in each instance which creates race condition. I read the MKL manual but could not find anything relevant. Since it is not logical to factorize A for each thread, I am wondering if there is any way to overcome this problem and use the same handle everywhere.</p>
<p>Thanks in advance for your help</p>
Mon, 29 Jul 13 14:55:19 -0700Pouya Z.404629