Benchmarking MKL Lapack on ccNUMA systen

Benchmarking MKL Lapack on ccNUMA systen

I work with a large ccNUMA SGI Altix system. One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing scaling - stops scaling after 4 threads.

The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV.

From analysing the hotspots in VTune, I find that almost all the time is spent in overhead and spin time from the functions:

[OpenMP dispatcher]<- pthread_create_child and in [OpenMP fork].

The code was compiled using ifort with the options: -O3 -openmp -g -traceback -xHost -align -ansi-alias -mkl=parallel.  Using version 13.1.0.146 of the compiler and version 11 of MKL. The system is made up of 8 core Xeon sandy bridge sockets.

The code was ran with the envars:

OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

It is also ran with the SGI command for NUMA systems 'dplace -x2' which locks the threads to their cores.

So I suspect that there is something up with the options for the MKL, or the library isn't configured properly for our system. I have attached the code used.

Does anybody have any ideas on this?

Jim

AnhangGröße
Herunterladen code.tar.gz2.02 KB
39 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

>>...One of our users is trying to benchmark some LAPACK routines on our system and is getting some disappointing
>>scaling - stops scaling after 4 threads.
>>
>>The test I am running is of diagonalizing a 4097x4097 matrix of double precision floats. It uses the routine DSYEV...

It seems to me that a little performance advantage could be achieved for a 4097x4097 matrix ( I would rate it as small ). Here are two questions:

- Why 4097x4097 and not 4096x4096?
- Did you try larger matrix sizes, like 16Kx16K, 32Kx32K, and so on?

Hello again. Yes the user had tried larger matrices and got similar problems with scaling. When he ran the same code on a different machine and he managed to get it to scale beyond 8 threads for 16kx16k. I reran the code with a 16kx16k sized matrix with 4, 8, and 16 OMP threads on our ccNUMA system. The results of profiling for 4 threads are:
[OpenMP fork]              1414.601s    1414.601s  
[OpenMP dispatcher]     1165.936s    1165.936s  
[OpenMP worker]          153.393s       153.393s  
lapack_dsyev                45.606s        0s   
diag                              2.468s          0s

Where the first column is CPU time and the second is Overhead and spin time. The results for 8 and 16 threads show a similar trend.

Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely?

So does anyone have any ideas on this?

>>...Nearly all the time is spent idle even for 4 threads. It can't be because there isn't enough work to do, surely?

I agree that something is wrong and here are another questions:

- How much memory does the system have?
- Could you verify how physical and virtual memory were used during these tests? ( if you're on a Linux system try to use a graphical utility similar to Windows Task Manager )

I'll do a verification of your test codes on my Ivy Bridge system ( see * ) with Intel C++ Compiler XE 13.1.0.149 [ IA-32 & X64 ] ( Update 2 ) and MKL version 11.0.3.

( * ) - Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )

James,

I compiled your test case on a Windows 7 Professional 64-bit OS with 64-bit Fortran compiler using the following command line:

ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias Diag.f90

but execution fails because a matrix.chk file is Not found:

..\DiagTestApp>Diag.exe

Read the Hamilton-matrix...
forrtl: severe (29): file not found, unit 11, file ..\DiagTestApp\matrix.chk

Image PC Routine Line Source

Diag.exe 00000001400659C7 Unknown Unknown Unknown
Diag.exe 0000000140061383 Unknown Unknown Unknown
Diag.exe 0000000140034FA6 Unknown Unknown Unknown
Diag.exe 000000014001A975 Unknown Unknown Unknown
Diag.exe 00000001400195B0 Unknown Unknown Unknown
Diag.exe 000000014000B6E9 Unknown Unknown Unknown
Diag.exe 0000000140001985 Unknown Unknown Unknown
Diag.exe 0000000140001076 Unknown Unknown Unknown
Diag.exe 00000001400F814C Unknown Unknown Unknown
Diag.exe 000000014004EC2F Unknown Unknown Unknown
kernel32.dll 0000000076B5652D Unknown Unknown Unknown
ntdll.dll 000000007724C521 Unknown Unknown Unknown
...

Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it tomorrow when I go back to work apparently we're allowed up to 4gb on here...

>>...Sorry, that's the input file containing the matrix. the 16kx16k one is ~2gb in size so I didn't include it initially. I'll upload it
>>tomorrow when I go back to work apparently we're allowed up to 4gb on here...

Is there any chance to modify source codes and generate some random values, or some right numbers to get a solution? I think it will be the best solution... Anyway, on my side the application ( initial version ) is ready for testing. My system has 32GB of physical memory and 96GB of Virtual Memory and I think it will be able to handle your test case.

Hi

Sorry for the delay. The users code for generating the matrices is leviathon in complexity and takes forever. However all one needs for dsyev is a real symmetric matrix, so I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in unformatted fortran called 'matrix.chk'. Use this as the input file for the other program.

I can confirm that this also gives the same problems on our system as our users matrix.

Anlagen: 

AnhangGröße
Herunterladen rsym-matrix-gen.f90919 Bytes

>>...I wrote my own code (attached) that generates a simple 16kx16k Fiedler matrix: A(i,j,) = abs(i-j). This will output a file in
>>unformatted fortran called 'matrix.chk'. Use this as the input file for the other program...

I'll let you know results of my tests and thank you for the matrix generation program.

Did you manage to get anywhere with it?

J

>>>>...One of our users is trying to benchmark some LAPACK routines on our system and is getting some
>>>>disappointing scaling - stops scaling after 4 threads...
>>
>>Did you manage to get anywhere with it?

Yes and I'll post my results soon.

Could you provide some technical details about hardware your user is using?

It's a ~200 socket ccNUMA machine. Each socket is a 8 core Intel Xeon E5-4650L with about 7.5gb RAM per core. You request cores and memory for jobs using the MOAB scheduler. 

Here are results on Ivy Bridge system.

[ Hardware ]

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Windows 7 Professional 64-bit
32GB of RAM
96GB of VM

[ 64-bit application on Windows 7 Professional 64-bit OS ]

Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

[ Number of CPUs used: 4 ( 4 threads ) ]

Read the Hamilton-matrix...
allocation of mat of 16000x 16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 1096.2s ; cpu 4380.5s
...done!
FIN!

[ Number of CPUs used: 2 ( 2 threads ) ]

Read the Hamilton-matrix...
allocation of mat of 16000x 16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 1454.9s ; cpu 2908.5s
...done!
FIN!

[ Number of CPUs used: 1 ( 1 thread ) ]

Read the Hamilton-matrix...
allocation of mat of 16000x 16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 2532.1s ; cpu 2529.7s
...done!
FIN!

[ Summary ]

1 CPU - real 2532.1s ; cpu 2529.7s
2 CPUs - real 1454.9s ; cpu 2908.5s
4 CPUs - real 1096.2s ; cpu 4380.5s

[ 32-bit application on Windows 7 Professional 64-bit OS ]

Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

[ Number of CPUs used: 4 ( 4 threads ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Number of CPUs used: 2 ( 2 threads ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Number of CPUs used: 1 ( 1 thread ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Summary ]

1 CPU - N/A
2 CPU - N/A
4 CPU - N/A

Note: As you can see a test for a 32-bit application failed.

[ 32-bit application on Windows 7 Professional 64-bit OS ]

Command line to compile: ifort.exe /O3 /Qopenmp /QxHost /Qmkl:parallel /Qansi-alias /align:array32byte Diag.f90

[ Number of CPUs used: 4 ( 4 threads ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Number of CPUs used: 2 ( 2 threads ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Number of CPUs used: 1 ( 1 thread ) ]

Read the Hamilton-matrix...
/Error diag 41 trying to allocate arryas mat and e
diag, arryas mat and e - out of memory

[ Summary ]

1 CPU - N/A
2 CPU - N/A
4 CPU - N/A

Note: As you can see a test for a 32-bit application failed.

>>[ Summary ]
>>
>>1 CPU - real 2532.1s ; cpu 2529.7s
>>2 CPUs - real 1454.9s ; cpu 2908.5s
>>4 CPUs - real 1096.2s ; cpu 4380.5s

I could only confirm that performance scaling for cases with 1 CPU, 2 CPUs and 4 CPUs looks right. Unfortunately, I don't have a system with greater than 4 CPUs.

This is a short follow up and I wonder if Intel software engineers could verify scalability on a system with 8, or 16, or even more CPUs? Thanks in advance.

Note: Take into account that a set of environment variables was provided:
...
OMP_NUM_THREADS=16
MKL_NUM_THREADS=16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled
...

Hi Sergey, James,

Thanks a lot for the test.  just quick thought in my mind,

Some of blas functions are threaded by OpeMP, but in order to keep good performance, it only start at most 4 threads.  As the function gesv should depend on blas function, so the scaliblity of your test are limited to 4.  we will check it again and let you know the details,

Best Regards,

Ying  

Indeed, please let us know ASAP, we have a spare 1800 threads that apparently can never be utilized by MKL.

Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?

o_0

>>...Also if blas is limited to 4 threads, how can it ever fully utilize a Xeon Phi card?..

These MKL thread limitations do not look right and I see inconsistency because some MKL functions on my system used 8 threads instead of 4. Here is a really small example:
...
C = MATMUL( A, B ) ! Calculate product of two dense matricies
...
and 8 threads were used.

MKL matrix multiply is hand optimized to maximize effective use of multiple threads.  For matrix dimensions of 4096, it can use effectively at least 244 threads on the Intel(c) Xeon Phi(tm). That version of MKL won't perform efficiently on matrices with dimensions less than 32, but it is possible to use the number of threads corresponding to problem size effectively with host MKL or by compiling from source code with OpenMP.  For a problem so small that 4 threads would be the limit, single thread in-line expansion, e.g. Fortran MATMUL, should be better than launching a threaded job e.g. by MKL.

By the way, the ifort -opt-matmul (MKL support for MATMUL) isn't available on Intel(c) Xeon Phi(tm).  What is available is "automatic offload" where MKL function calls on host are executed on coprocessor, subject to environment variable and suffiiciently large size.

Okay, TimP.

Xeon Phi cards aside - is it true that a routine like DSYEV is only parallel to 4 threads in MKL on Xeon processors?

Also let us not forget, that my main problem is that when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and about 1% of the time in DYSEV. I still don't know why this is. When you run my program with 4 threads on your machines through VTune hotspots, does it also show that the threads are idle 99% of the time?

>>...when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and
>>about 1% of the time in DYSEV. I still don't know why this is...

I'll repeat tests on my Ivy Bridge with 4 CPUs and provide you additional technical details for comparison.

Note: It looks like processing in that case is Memory or I/O bound, and it is Not CPU bound.

on a 4  core platform mkl defaults to 4 threads even  if 8 logical are visible

>>...on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible...

I Do Not confirm this ( for a 64-bit WIndows platform / Non NUMA ) and I'll provide lots of technical details as soon as all my verifications are completed.

>>...we have a spare 1800 threads that apparently can never be utilized by MKL...

Actually you can but by using a different method and I call it as Application Based Partitioning ( ABP ).

Hi James, 

No, my guess is wrong, MKL haven't limitation for the funcion at version of 11. ( we had did this for small data size before).

i did test on one machine  Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz ,  2 package, 8 core. HT disable  totally 2x8=16 thread,

run with matrix 4096x4096

export MKL_NUM_THREADS=2

 real      27.0s ; cpu      53.9s

export MKL_NUM_THREADS=4

 real      15.2s ; cpu      60.7s

export MKL_NUM_THREADS=8

 real       9.5s ; cpu      75.8s

export MKL_NUM_THREADS=16

 real       7.4s ; cpu     117.2s.

So the problem should be not here.

Best Regards,

Ying

Okay that's good to hear that MKL routines can use more than 4 threads!

Thanks for your timings these are a good comparison help push me closer to the source of the problem. So I ran the same thing on our system as you did, Ying, and here are my timings:

run with matrix 4096x4096

export MKL_NUM_THREADS=2

real      27.3s ; cpu      54.2s

export MKL_NUM_THREADS=4

real      15.9s ; cpu      62.3s

export MKL_NUM_THREADS=8

real      11.3s ; cpu      88.4s

export MKL_NUM_THREADS=16

real      13.6s ; cpu     212.6s

These were ran with the other options:

OMP_NUM_THREADS= # 2,4,8,16
MKL_NUM_THREADS= # 2,4,8,16
KMP_STACKSIZE=2gb
OMP_NESTED=FALSE
MKL_DYNAMIC=FALSE
KMP_LIBRARY=turnaround
KMP_AFFINITY=disabled

So our results agree up to 8 threads. At 16 however, things start to look different on my machine. With 16 threads, this is two sockets on my machine. Each socket is connected via a NUMA link, unlike your machine where your two packages will have uniform access to memory.

So basically this MKL routine doesn't seem to scale beyond a single socket on our machine, which is the problem the user reported. Please provide some comment and suggest what I should do next,

 

Here are results of another set of tests:

[ 4 OMP & KMP threads ]

C:\WuTemp\FortTestApp1\x64\Release>Diag.exe
Read the Hamilton-matrix...
allocation of mat of 16000x16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 1466.5s ; cpu 11263.2s
...done!
FIN!

Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ).

[ 16 OMP & KMP threads ]

C:\WuTemp\FortTestApp1\x64\Release>Diag.exe
Read the Hamilton-matrix...
allocation of mat of 16000x16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 1435.0s ; cpu 11043.2s
...done!
FIN!

Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ).

[ 32 OMP & KMP threads ]

Read the Hamilton-matrix...
allocation of mat of 16000x16000
Read the Hamilton-matrix...
...end!
Diagonalization with dsyev:
real 1469.4s ; cpu 11306.9s
...done!
FIN!

Note: Total number of Win32 threads used during processing was 64 ( plus 1 thread for the main process ).

Command line options:

/nologo /O3 /QaxAVX /QxAVX /Qparallel /heap-arrays1024 /Qopt-matmul- /arch:AVX /fp:fast=2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc90.pdb" /libs:static /threads /Qmkl:parallel /c

[ Screenshot 1 ]

Anlagen: 

AnhangGröße
Herunterladen diagtestapp1.jpg152.25 KB

[ Screenshot 2 ]

Anlagen: 

AnhangGröße
Herunterladen diagtestapp2.jpg158.48 KB

Number of OMP and KML threads vs. calculation time ( 4 CPUs / 8 cores ):

04 - Calculated ( in seconds ): ~338
08 - Calculated ( in seconds ): ~338
16 - Calculated ( in seconds ): ~330
32 - Calculated ( in seconds ): ~329
64 - Calculated ( in seconds ): Failed to calculate and errors are as follows:

...
OMP: Error #136: Cannot create thread.
OMP: System error #1455: The paging file is too small for this operation to complete.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...

[ Screenshot 3 - OMP Errors: 136, 1455 and 178 ]

Anlagen: 

AnhangGröße
Herunterladen pagingfiletoosmall.jpg189.11 KB

Even if a previous post is Not related directly to the subject of the thread I'll provide a reproducer and instructions on how the problem can be reproduced.

>>...on a 4 core platform mkl defaults to 4 threads even if 8 logical are visible...

Tim,
As you can see on screenshots in my previous posts 64 worker threads were created plus one thread for the main application ( 65 in total ).

>>...when running MKL DYSEV on my system even on 4 threads, all the threads spend most of the time idle and about 1% of
>>the time in DYSEV...

James,
Utilizations for all 8 logical cores were ~100% and this is simply a prove that there some issue with NUMA.

Sergey,

Sorry but your tests only demonstrate that your Diag.exe does not scale beyond 4 threads at all. (Times with 8,16 and 32 threads are the same as with 4 threads). Probably because you only seem to have 8 cores in the system. And the 100% utilisation of all cores does not say anything about issues with NUMA. Idle spinning can create that just as well.

A.kaliazin,

I've stated from the beginning of investigation that a set of tests will be done on an Ivy Bridge system with 4 CPUs and 8 logical CPUs. My another comment was as follows:

...
...I could only confirm that performance scaling for cases with 1 CPU, 2 CPUs and 4 CPUs looks right. Unfortunately,
I don't have a system with greater than 4 CPUs...
...

However, I know how CPU, or Memory, or I/O bound processings look like and my another statement regarding CPU utilization was:

...
...It looks like processing in that case is Memory or I/O bound, and it is Not CPU bound...
...

I know that my tests could be rated as generic because I don't have a NUMA system. If you have a NUMA system please try to test the test application.

As MKL can use the resources of the 4 cores fully with 1 thread per core, it's hardly surprising that more threads don't improve performance.

Melden Sie sich an, um einen Kommentar zu hinterlassen.