Create 8 VSLStreamStatePtr affected MKL "dtrsm"' s performance, include test code,issue still open

Create 8 VSLStreamStatePtr affected MKL "dtrsm"' s performance, include test code,issue still open

At first I want to generate random in multythreads in the following code:

#define nstreams 8
VSLStreamStatePtr stream[nstreams];

int k;
for ( k=0; k< nstreams; k++ )
{
vslNewStream( &stream[k], VSL_BRNG_MT2203+k, seed );
}

But I found, If I generate 8 VSLStreamStatePtr , other MKL functions performance will be affected(5 times slower then normal), these affected funtions are:

dtrsm("Right", "Upper", "No transpose", "Nunit", ...);

 

 

 

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi jian,

Here's some question about your issue:

1. How did you test the performance? If you enable MKL_VERBOSE to check, or write program to get clock time?
2. How about your problem size for trsv, gemv and trsm? and what about your seed for random data generation? Could you please provide a reproducer (just a sample case) that we can investigate?

Thanks.

Best regards,
Fiona

Hi, Fiona

Thanks for your response.

I need modify the issue: ONLY dtrsm is affected by new 8 vslNewStream.

Here is test code and result in my machine:

result:

Before new 8 VSLStreamStatePtr time: 1

After new 8 VSLStreamStatePtr time: 12

Code (c++):

 

#include "mkl_vsl_functions.h"

#include "mkl_vsl_defines.h"

#include "mkl_blas.h"

#include "mkl_service.h"

 

 

int MmatrixARows=26;

int NmatrixBColumns=3;

double alpha=1;

int ldm=29;

double matrixA[87]={0.00311007,-1.12899e-05,-0.000141499,-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06};

double matrixB_ori[87]={-1.82698e-14,-0.000785694,-1.98974e-14,-0.000778519,-2.71811e-14,-2.29056e-14,-2.7844e-14,-2.24393e-14,-3.12059e-14,-1.26095e-14,-4.47909e-10,-7.97785e-19,-1.74566e-07,-4.15789e-10,-2.17286e-29,-1.56053e-10,-1.34911e-12,-2.19906e-27,-3.5138e-09,-1.0398e-07,-5.29274e-06,-4.9252e-07,-8.93104e-05,-3.71938e-05,-1.28896e-09,-1.17735e-07,-3.26114e-08,0.00289051,-0.000149547,-2.34128e-13,-1.92531e-13,-0.000706043,-2.17513e-13,-3.48327e-13,-0.000670723,-3.56823e-13,-2.8756e-13,-3.99905e-13,-1.61591e-13,-5.73997e-09,-2.17241e-19,-1.42301e-08,-5.32835e-09,-5.47231e-30,-8.81321e-10,-3.39771e-13,-5.53829e-28,-8.70999e-10,-2.05336e-08,-5.30965e-06,-4.83782e-07,-8.84461e-05,-3.73614e-05,-1.5481e-11,-1.15641e-07,-4.02283e-07,-4.25622e-07,0.00303418,-0.00079269,-2.05185e-13,-2.71747e-13,-2.3181e-13,-0.000797297,-3.1283e-13,-3.80276e-13,-3.06461e-13,-4.2619e-13,-1.72212e-13,-6.11725e-09,-1.91907e-19,-1.90223e-07,-5.67858e-09,-4.75132e-30,-1.02734e-09,-2.95006e-13,-4.80861e-28,-5.38704e-10,-1.5342e-08,-1.98894e-06,-6.03159e-06,-8.35693e-05,-4.29327e-05,-1.62871e-10,-1.44174e-06,-1.87786e-13,-2.49161e-13,-0.00079269};

double matrixB[87];

 

int sweepCount = 1e6;

time_t time1, time2, time3, time4;

time(&time1);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time2);

std::cout<<" Before new 8 VSLStreamStatePtr time: "<<difftime(time2, time1)<<std::endl;

 

VSLStreamStatePtr                   ptr_[8];

for(int i = 0; i < 8; ++i) {

    vslNewStream(&ptr_[i], VSL_BRNG_MT2203 + i, 1);

}

 

time(&time3);

for(int count = 0;  count < sweepCount; ++count) {

    memcpy(matrixB, matrixB_ori, sizeof(double)*87);

    dtrsm("Right", "Upper", "No transpose", "Nunit", &MmatrixARows, &NmatrixBColumns, &alpha, matrixA, &ldm, matrixB, &ldm);

}

time(&time4);

std::cout<<"After new 8 VSLStreamStatePtr time: "<<difftime(time4, time3)<<std::endl;

 

Thanks

 

Hi, Fiona

Can you repeat the issue? Or need more detail info about lib verstion and cpu info?

 

Thanks. 

 

Quote:

Fiona Z. (Intel) wrote:

Hi jian,

Here's some question about your issue:

1. How did you test the performance? If you enable MKL_VERBOSE to check, or write program to get clock time?
2. How about your problem size for trsv, gemv and trsm? and what about your seed for random data generation? Could you please provide a reproducer (just a sample case) that we can investigate?

Thanks.

Best regards,
Fiona

Hi Jian,

I can reproduce your problem, we are investigating, I will give your response soon.

Best regards,
Fiona

Best Reply

The root cause analysis shows the problem  with internal mkl_serv_allocate() routine. The issue is escalated. We will keep you updated with the status of this issue!

 

Hi,Gennay

I am glad to hear that.Thanks for your help. 

Quote:

Gennady F. (Intel) wrote:

The root cause analysis shows the problem  with internal mkl_serv_allocate() routine. The issue is escalated. We will keep you updated with the status of this issue!

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today