hi there,
We are using MKL on a RedHat Linux Network Server with Xeon Processor which has 32 Physical Core (64 Logical Core). The Application uses a thread pool to handle network requests in parallel. Each request is handled independently. The performance improves with more threads:
4 threads : 45 seconds
8 threads : 23 seconds
16 threads : 15 seconds
24 threads : 14 seconds
32 threads : 15 seconds
However, the performance always caps at 16 threads, and drops a little bit with 32 threads. I replace the mkl cblas_sgemm function with atlas, then the performance keeps improving from 1 thread to 32 threads linearly.
And limit the mkl thread count by calling mkl_set_num_threads(1) at the beginning of main function or set environment variable to 1, also doesn't work and get the same result. The multiprocess solution also have the same problem(??). Another experiment which sleeps a small amount of time before calling mkl cblas_sgemm shows linear but not ideal result. It looks like there are some resource contention inside the MKL cblas_sgemm implementation? Or do we miss anything here?
Any comment or suggestion is highly appreciated! And thanks much in advance!
Thanks,
Yu
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #454545}
span.s1 {font: 12.0px '.PingFang SC'}