Difficulty Using micnativeloadex

Difficulty Using micnativeloadex

I have been running a benchmark code that calls an MKL subroutine, but have been noticing strange performance differences between two different methods of running natively on the MIC card. The first method I have tried is to directly logon to the MIC card and run the code as follows. I have already copied over the required libraries.

$ ssh root@mic0

$ export OMP_NUM_THREADS=120

$ export KMP_AFFINITY=scatter

$ ./a.out

The performance using this method is about 480 GFLOPS. However, if I try to use the micnativeloadex utility, then I get much worse performance.

$ /opt/intel/mic/bin/micnativeloadex a.out -d 0 -e "OMP_NUM_THREADS=120 KMP_AFFINITY=scatter"

Running this way I achieve a performance of only 340 GFLOPS. This is the performance I would get if I did not set either environment variable and used the defaults, but I thought the -e option was supposed to pass them. What am I doing wrong?

3 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

The environment variables are being passed along. I wrote up a little test case to show that. If you set KMP_AFFINITY  to verbose,scatter you will see that the environment variable is having the effect you expect. If you run this code, your will find that with micnativeloadex, it runs as uid 400 out of a directory in /tmp, not as yourself out of your home directory. Uid 400 is a special user named micuser that is used by micnativeloadex and by code using the offload programming model.

That said, I don't know why your code is slower using micnativeloadex. Are you timing some small kernel inside your code, so that you are not including the cost of transferring the code and libraries over, nor the cost of doing any reads or writes - in particular no writes to standard out? Is there any more information you can post here?

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main(int argc, char* argv[])
{
printf("OMP_NUM_THREADS = %s\n", getenv("OMP_NUM_THREADS"));
printf("KMP_AFFINITY = %s\n", getenv("KMP_AFFINITY"));
printf("current directory = %s\n", get_current_dir_name());
printf("user id = %d\n", getuid());
#pragma omp parallel
  {
    #pragma omp single
    {
      printf("omp_get_num_threads = %d\n", omp_get_num_threads());
    }
  }
}

It appears to be an issue related to MKL. In the benchmark code I am calling the multi-threaded DGEMM subroutine. And to answer your question, I only time the kernel itself using the dsecnd() timer function provided by MKL and not any other portion of code.

However, your post did give me an idea. I wrote an OpenMP matrix-matrix multiplication program that does not call MKL. It performed the same using both methods of execution just as you would expect. It seems as if something strange is happening with MKL on the MIC.

Thank you for your help.

Melden Sie sich an, um einen Kommentar zu hinterlassen.