MKL Performance Issue

MKL Performance Issue

DubitoCogito's picture

I have been running an MKL DGEMM benchmark in native mode on a KNC card, but have noticed strange behavior.  The performance is inconsistent and varies quite a bit.  I tried multiple thread affinity settings and noticed the same behavior with varying numbers of threads and threads per core.  The test consists of calling DGEMM for a set of 6,000 by 6,000 matrices a total of 1,000 times so I can compare calls.  During my testing the performance varied by as much as 100 GFLOPS.  A Google search did not reveal much.  However, I did find a Dr. Dobb's article that noted the unusual behavior and attributed it to OS jitter, but did not ellaborate.  Has anyone else noticed this behavior?

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
loc-nguyen (Intel)'s picture

Hi DubitoCogito,

Would you like to elaborate more the scenario when you observe such a big performance variation? For example,  You observe such big performance variation when the number of threads stays unchanged and thread affinity stays unchanged?

What are the settings of OMP_NUM_THREADS and KMP_AFFINITY before and after that scenario?

Regards.

DubitoCogito's picture

I apologize for my delayed reply. The cluster was offline for a few weeks for hardware and software upgrades. I reran my tests and am still seeing unexpected variations in performance. Here is the information for a particular run.

OMP_NUM_THREADS=120

KMP_AFFINITY=compact

KMP_PLACE_THREADS=60c,2t,0O

I ran the code with these parameters multiple times during different login sessions and saw performance between 615 and 711 GFLOPS. Also, I was the only person using the system and MIC card during the testing.

Here is a plot showing the unusual performance I am seeing running natively on the MIC card.

Attachments: 

AttachmentSize
Download mkl-dgemm.pdf35.51 KB
loc-nguyen (Intel)'s picture

Hello DubitoCogito,

Here are some observations:

- These settings indicate using 2 threads per core with “compact” affinity. The resource on each core is under-utilized, because each core is capable of 4 threads. Furthermore, the 2 OpenMP threads may migrate within the same core, and lead to non-optimal data locality. Not sure if this may explain the performance variations. Would you like to try KMP_AFFINITY=compact,granularity=fine, while keeping the same OMP_NUM_THREADS and KMP_PLACE_THREADS.

- GEMM is not tuned to be run with 2 thread / core. It also has a lock-add-based inter-core barrier which should be rather slow when called to synchronize threads that run on different cores. Again, I am not sure if this explains the performance discrepancy.

It would be very useful for us to see the benchmarking code, the output from "micinfo" command, and also the compiler version.

Regards,

Tim Prince's picture

You could run micsmc core utilization view as a visual indicator that the thread placement is the same each time, and that no unexpected processes are competing.  As you mentioned, MKL is tuned to use all cores, 4 threads per core, on such large problems.

DubitoCogito's picture

- Performance with more than 2 threads per core was much worse presumably because of resource contention. I know MKL runs with the maximum number of threads by default (meaning 240 threads on a 60-core chip), but I want to run with multiple MPI tasks per card. The average performance in GFLOPS running with a total of 120 OMP threads was as follows: 2 threads = 698, 3 threads = 428, 4 threads = 403. I used the following settings. Of course, the performance fluctuates, but running with only 2 threads per core has always been the fastest configuration for whatever reason.

OMP_NUM_THREADS=120 / KMP_AFFINITY=compact / KMP_PLACE_THREADS=60c,2t,0O , 40c,3t,0O , 30c,4t,0O

- The granularity=fine option is the default so MKL kernel threads should not be migrating between hardware threads on a core. I also verified it by setting KMP_SETTINGS=1 and looking at the output --- KMP_AFFINITY="noverbose,warnings,respect,granularity=fine,compact,0,0".

- The source code is base on an example I found somewhere on the Intel website. I uploaded the file as a *.txt file because it would not allow me to upload a file with the *.c extension.

- I am using Intel Composer XE v2013.1.117 which includes ICC v13.0.1 20121010.

I have attached the output from micinfo, KMP_SETTINGS, and a copy of the code I am running. Thank you for your help. I really appreciate your suggestions.

Edit: I used the followig command to compile the code: icc -O3 -mmic -openmp -restrict dgemm.c -mkl

Attachments: 

AttachmentSize
Download micinfo.txt2.2 KB
Download kmp-settings.txt1.6 KB
Download dgemm.txt1.68 KB
DubitoCogito's picture

I cannot run micsmc because some libraries are missing. I will check with the system adminstrator.

DubitoCogito's picture

I am unable to run micsmc in GUI mode because X11 is not configured on the cluster. However, I periodically ran 'micsmc -c' during execution and it showed the expected core utilization levels for each configuration. Also, the output from the OpenMP library (set KMP_AFFINITY=verbose,...) showed the correct kernel to hardware thread mapping. From the information I have been able to gather, the kernel threads appear to be placed on the correct MIC cores. I have attached a copy of the thread mapping output.

Attachments: 

AttachmentSize
Download mic-core.txt26.71 KB
Roman Dubtsov (Intel)'s picture

Hello DubitoCogito,

I ran the reproducer and found out that to get the highest stable performance with this test it is necessary to 

  1. Enable tsc timer by running 'echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource' on the coprocessor.
  2. Enable transparent huge pages by running 'echo always > /sys/kernel/mm/transparent_hugepage/enabled'. Alternatively, you could use mmap(MAP_HUGETLB) to allocate the memory for the matrices. The latter approach has the advantage that you're *always* know whether the memory is allocated with 2M pages or not.
  3. Use KMP_AFFINITY=compact,granularity=fine and OMP_NUM_THREADS=240.

The final performance should be about 750 GFlops.

A side note regarding the test. For N=6000 that is used in the test this is not relevant, but I think that it is useful enough to be mentioned.To get the best performance on KNC it is important to avoid 4K aliasing. This means that it is necessary to make sure that matrix leading dimension multipled by matrix elemtent size (8 for double) is not divisible by 4096. For DGEMM the only matrix that is sensitive to this effect is C.

DubitoCogito's picture

Thank you for the information. I realize the MKL kernels are tuned to use all MIC cores, but I am trying to understand the reason for the 100+ GFLOPS performance fluctuation I have seen during testing. I want to understand why the performance appears to be unstable with fewer threads.

DubitoCogito's picture

I switched to the tsc clock source as you suggested and the results compared with the default micetc are a bit different. The tsc source gives a ~5% higher FLOP count. My understanding of the different clock sources was that micetc provided a more stable measurement. I tried reducing the total number of iterations thinking it could be a cumulative latency effect, but got the same results. I also increased the dimensions of the matrix to make the DGEMM call time more than 1 second because I thought that perhaps the time was too short to be accurately measured, but still got the same results. I would assume there was a reason for making micetc the default. Which source should I consider to be valid?

Could you please explain the 4k aliasing issue in more detail.

Thank you for your help.

Roman Dubtsov (Intel)'s picture

Re: tsc. Here's a quote from the MPSS readme.txt:

  • When using the micetc clock source and calling gettimeofday on multiple threads, the time to call gettimeofday is more than 100x slower than when using the TSC clock source.
  • At times the kernel may declare the tsc clocksource to be unstable and select jiffies instead, this will degrade the timing resolution to 10ms. Users can check if this happened by checking the current clocksource device in /sys/devices/system/clocksource/clocksource0/current_clocksource, Users will have to restart the coprocessor to get back to using tsc as the clocksource.

Re: 4K aliasing. I should probably have written 'cache thrashing'. Consider the (parts of the) C columns C(i:i+8,j) (col-major format) with i = 1,8,16,..N, and j=1..N. The start address of column (i,j) is C + LDC*j + i. This means that if the value of LDC in bytes is a multiple of 4K then the columns with the same i will have the same start addresses modulo the 4K page size and thus the same index in the L1 cache. This means that since DGEMM updates a set of adjacent C columns with the same i simultaneously it will be able to use only a part of the L1 cache in the process. Due to the way the algorithm works this is not so important for A and B.

Re: your original question. It's hard to tell for sure without instrumenting / profiling the MKL code and looking into what's happening. I'll do this if I find some time next week.

DubitoCogito's picture

Thank you for the explanation of 4K aliasing. It was very informative.

Yes, I had already read that section of the MPSS release notes. However, I thought it would not be applicable to my situation because I do not call the timer function from within a parallel region. In fact, I have no OpenMP directives within the code. The gettimeofday() calls surround the call to the multi-threaded MKL DGEMM routine. Given my understanding of how OpenMP typically works, I would assume the threading environment is not initialized until the dgemm_() function call and is done so internally by MKL, but I could be mistaken. Also, I have not had this issue with a multi-threaded matrix-matrix multiplication code I wrote using OpenMP. I have consistently gotten similar results using both clock sources and have not seen the same difference.

Login to leave a comment.