Pardiso on WIN64 using only one thread

Pardiso on WIN64 using only one thread

Hello,

I have the exact same code running on Linux64 and Win64. Everything works well in Linux64. But in Win64, even though I set OMP_NUM_THREADS and MKL_NUM_THREADS to 2, Pardiso reports

< Parallel Direct Factorization with #processors: > 1

And this happens both with in-core as well as out-of-core. I'm using version 10.3, build 20110314.

Do I need to do anything else other than set the above 2 environment variables?

Thanks.

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

That's strange. You don't need to do anything else. What is task size and type of matrix you solve. We need to check it.--Gennady

Gennady,

The matrix has about 200,000 equations with about 8 million nonzeros. It's a symmetric indefinite matrix. Right before the first call to Pardiso I print

OMP_NUM_THREADS= 2
MKL_NUM_THREADS= 2

Here's all the printout:

=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

================ PARDISO: solving a symmetric indef. system ================
The local (internal) PARDISO version is : 103000115
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Single-level factorization algorithm is turned ON
Scaling is turned ON

Summary PARDISO: ( reorder to reorder )
================

Times:
======
Time spent in calculations of symmetric matrix portrait(fulladj): 0.233544 s
Time spent in reordering of the initial matrix(reorder) : 3.419899 s
Time spent in symbolic factorization(symbfct) : 0.900987 s
Time spent in allocation of internal data structures(malloc) : 0.139583 s
Time spent in additional calculations : 1.692345 s
Total time spent : 6.386359 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 1
< Hybrid Solver PARDISO with CGS/CG Iteration >

< Linear system Ax = b>
#equations: 219057
#non-zeros in A: 7798701
non-zeros in A (): 0.016252

#right-hand sides: 0

< Factors L and U >
#columns for each panel: 128
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 29451
size of largest supernode: 3438
number of nonzeros in L 77824417
number of nonzeros in U 1
number of nonzeros in L+U 77824418

Gennady,

I added the call

mkl_set_num_threads(2);

right before calling Pardiso the first time. Still, I get #processors: 1. This happens in both Win32 and Win64.

Any thoughts? Could I have inadvertendly set some parameter incorrectly that could be triggering this behavior?

-Arthur

Hi Arthur,Does you win32/win64 system has 2 or more physical cores? MKL checks this and sets number of threads to 1 if the system has only 1 physical core.Regards,Konstantin

Konstantin,

The machine has 2 physical processors. When running Pardiso the Windows task manager displays usage at around 50%.

Is there a way to turn on some internal debugging so we can get more information on this as MKL is running?

-Arthur

Just in case, if you mean you have hyperthreading enabled, remember that MKL tries to maximize performance by using just 1 thread per pair of hyperthread logical processors, unless you over-ride by setting MKL_DYNAMIC. The term physical processor is more likely to refer to a complete core, which would support a pair of logical processors when hyperthreading is enabled.

Tim,

Thanks for the reply.

I'll be very honest: your answer blew me away. I'm new to OMP, so I had never heard of either OMP_DYNAMIC or MKL_DYNAMIC before.

To the best of my knowledge, my machine has 2 processors, and I assume each has a single core. Each processor is a Intel Xeon, and Dell describes them as "C8508 Processor, 80546K, 3.0G, 2M, XNI 800, N0", where C8508 is the Dell part number (probably not too useful for you).

Given that, I tried mkl_set_dynamic(0) and mkl_set_dynamic(1). It made no difference. In both cases, during the matrix factorization, I only see one processor at work (task manager showing 50% utilization).

1) Should I see any difference between the two mkl_set_dynamic calls?

2) Is the fact that I see only 50% utilization of the CPU with the task manager a true indication that only one CPU is being used? I always believed this is the case, but maybe I don't have all the facts.

3) Is there a way to *force* MKL to use 2 processors, even if it believes it's better off with only 1? All I want to see is that everything is being done correctly. Once I know that's the case then I'll let MKL make its own smarter decisions.

Thanks again.

-Arthur

Apparently, it's an "Irwindale" single core HyperThread CPU. These were probably available in both dual and single CPU platforms. Typically, floating point performance of the dual CPU platform was reduced by 15% when HyperThread was left enabled, even on linux (worse on Windows, not so bad on single CPU). You can check your BIOS setup screen to see whether HyperThreading is enabled. If enabled, and you see just 2 processors in task manager, there's only 1 CPU, and running 1 thread would show 50% on task manager, even though you get more performance than you would with 2 threads.

OpenMP dynamic is a different facility from MKL dynamic.
I think I've confused you about MKL_DYNAMIC. See this earlier post specifically about how to get MKL to use all the HyperThreads by setting MKL_DYNAMIC=FALSE and specifying MKL_NUM_THREADS.

Tim,

Here's the information:

Number of processors = 2
Multi-core capable = NO
Hyperthreading capable = YES

Hyperthreading is OFF (it's the factory default and has never been changed)

Given the above information, if I believe what the task manager is showing me is the 2 CPUs on the machine.

So now, the question: what do I need to do - if at all possible - to get Pardiso to use the 2 processors in parallel?

Thanks.

-Arthur

Hi Arthur,Regarding the information you provided it seems your computer is not multi-core. So, MKL strategy is to use only 1 thread for achieving optimal performance.If you still want to use 2 threads, please call following functions prior calling MKL function (but you will not get performance improvement most likely):mkl_set_num_threads( 2 );mkl_set_dynamic( false );or set env. variables:set MKL_NUM_THREADS=2set MKL_DYNAMIC=falseRegards,Konstantin

Konstantin,

Does this mean that MKL will not parallelize accross multiple processors? If I had a 4-processor, each single core, I wouldn't be able to benefit from the parallelization in MKL?

-Arthur

Hi Arthur,MKL is able to run in parallel across multiple processors. MKL sets a number of threads equal to a number of physical cores available totally in your system. In your case, the number of physical cores is equal to 1 on Windows:Number of processors = 2 // Number of logical processors is 2
Multi-core capable = NO // No multi-core, it means 1 physical cores
Hyperthreading capable = YES // Hyperthreading, it means 2 logical processors per 1 physical coreTo make sure, you can run 'systeminfo' command under 'cmd' and report the info about the processor. I just tried to run PARDISO on my dual-core laptop (physically dual-core) and it reported 2 threads.As far as I'm concerned about you example (a 4-processor, each single core) - did you mean 4-socket system? I'm sure that MKL will use 4 threads in this case, as 4 cores will be available.Best regards,Konstantin

Konstantin,

Here's the relevant result from systeminfo:

System Manufacturer: Dell Inc.
System Model: Precision WorkStation 670
System Type: x64-based PC
Processor(s): 2 Processor(s) Installed.
[01]: EM64T Family 15 Model 4 Stepping 3 GenuineIntel ~2993 Mhz
[02]: EM64T Family 15 Model 4 Stepping 3 GenuineIntel ~2993 Mhz

I guess I'm still confused. Is there any combination of parameters/environment variables that I can set on this machine that will show #processors = 2? Or is this misleading and it is really using both processors?

-Arthur

Ok, it seems this way to determine precise information about the system is not the best. In fact, I checked that systeminfo reports logical processors (so, It will give the same information for 2 single-core processors, 1 dual-core and, for instance, a single-core with hyperthreading ON). However, I checked MKL on the system of 2 single-core processors (with rather old Nocona processor) and MKL PARDISO reported 2 threads.Let's try another effort to obtain precise info about your system: can you install free CPU-Z tool available here?http://www.cpuid.com/softwares/cpu-z.htmlOn CPU tab it reports (in the very bottom) number of Cores and Threads for each processor. So, if it will be different it means that hyperthreading is ON on your system.Regards,Konstantin

Konstantin,

Here's the information from CPU-Z:

Processor #1:
Core Speed: 2793.1Mz
Multiplier x14.0
Bus Speed 199.5MHz
Rated FSB 798.1 MHz
L1 Data 16 KBytes 8-way
Trace 12 Kuops 8-way
Level 2 2048 KBytes 8-way
Cores 1
Threads 1

The data for Procesor #2 is identical.

-Arthur

Ok, now it seems really strange..Could you please run the following program on your windows machine (I compiled it under MS VS 2008):#include "stdafx.h"#include "mkl.h"int _tmain(int argc, _TCHAR* argv[]){ printf("\nthreads = %d\n", mkl_get_max_threads()); mkl_set_num_threads(1); printf("\nthreads = %d\n", mkl_get_max_threads()); mkl_set_num_threads(2); printf("\nthreads = %d\n", mkl_get_max_threads()); mkl_set_num_threads(4); printf("\nthreads = %d\n", mkl_get_max_threads()); mkl_set_dynamic(false); printf("\nthreads = %d\n", mkl_get_max_threads()); return 0;}

Konstantin,

Here's the output of your program. I also echoed the important environment variables before running the program.

C:\>set OMP_NUM_THREADS
OMP_NUM_THREADS=2

C:\>set MKL_NUM_THREADS
MKL_NUM_THREADS=2

C:\>exam1.exe

threads = 1

threads = 1

threads = 1

threads = 1

threads = 1

Arthur, thank you for the information!It looks either like a bug or like you've linked MKL with sequential layer (mkl_sequential.lib instead of mkl_intel_thread.lib).Could you please report your linking line? If you use Visual Studio, it would be great if you send a content of "Project->"project Properties->Linker->Command line" item of the main menu.Regards,Konstantin

Konstantin,

BINGO!!! I think you got to the bottom of the problem!!! Here's what I'm linking with:

mkl_solver_lp64_sequential.lib mkl_intel_lp64.lib mkl_sequential.lib mkl_core.lib

I'll look at the documentation to check the list of libraries I need to include.

Thanks!

-Arthur

Well,

I'm clearly still doing something wrong. I'm now getting the following error:

MKL FATAL ERROR on load the function mkl_blas_xdswap

I guess I need some guidance on which libraries EXACTLY to use if I'm compiling with VS 2008, with both OpenMP and with Windows threads, on both 32 and 64 bit platforms.

In fact, if you could tell me all environment variables I have to set for my command prompt mode that would also help.

Thanks.

-Arthur

please see the Linker Adviserhere.

Gennady,

I now have the BLAS half-running properly.

I have 2 ways of building my code: (1) inside Visual Studio 2009 and (2) from the command line.

When I build everything from the command line everything works fine. However, when I run within Visual Studio in DEBUG mode I simply get wrong answers. The BLAS routines in my code (not using Pardiso now... just testing BLAS for the moment) are generating incorrect results when run from within VS.

These are my compiler flags inside VS:

/Od
/I "C:\Program Files (x86)\Intel\ComposerXE-2011\mkl\include"
/D "WIN32"
/D "_DEBUG"
/D "_LIB"
/FD
/EHsc
/RTC1
/MD
/openmp
/Fo"x64\Debug\" /Fd"x64\Debug\vc90.pdb"
/W3 /nologo /c /Wp64 /ZI /errorReport:prompt

And these are the relevant link flags, and, at the very end, the Intel libraries I'm using.

/INCREMENTAL:NO
/NOLOGO
/LIBPATH:"C:\Program Files (x86)\Intel\ComposerXE-2011\mkl\lib\intel64"
/LIBPATH:"C:\Program Files (x86)\Intel\ComposerXE-2011\compiler\lib\intel64"
/MANIFEST
/MANIFESTFILE:"x64\Debug\exam.exe.intermediate.manifest"
/MANIFESTUAC:"level='asInvoker' uiAccess='false'"
/DEBUG
/SUBSYSTEM:CONSOLE
/LARGEADDRESSAWARE
/DYNAMICBASE:NO
/FIXED:No
/MACHINE:X64
/ERRORREPORT:PROMPT
mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib

Any thoughts on what may be causing the BLAS in my code to generate incorrect results?

Also, I found out that if I set MKL_NUM_THREADS=1 along with OMP_NUM_THREADS=1 then it all works fine. I have the same code running on Linux, and there it works just fine with more than 1 thread.

Thanks.

-Arthur

Hi Arthur,I'm glad that the linking problem has been resolved. Ok, let's go further..Am I right that you have a code calling BLAS that works well on Linux and works well on Window when compiled from the command line? And it fails when compiled from VS, but works well everywhere if you set MKL_NUM_THREADS=1 & OMP_NUM_THREADS=1, doesn't it?A few questions:- Is the code on C or Fortran?- Which compiler do you use in command line and with VS? Intel, MS?- Please send how did you compile your code via the command line.- Can you send the code (or a peace of the code which can be used as a reproducer)?Regards,Konstantin

Konstantin,

First of all, thanks for all the help so far. You've been incredible. I truly appreciate it.

Now for your questions:

1) Yes: the code works on Linux, and also works on Windows when compiled from the command line. It also works when MKL_NUM_THREADS = OMP_NUM_THREADS = 1. The fact that it does work with more than one thread if built from the command line is reassuing; it tells me that I don't have data conflict of any kind in my code, and the problem may be elsewhere.

2) The program is in C.

3) I use VS compiler, 2008

4) Unfortunately, I'm unable to send the code - it's many thousands of lines of code with BLAS calls throughout its many parts. But this prompted me to consider the alternative: a stand-alone program that calls the BLAS, that I can build inside and outside VS and that I can share with you. I'll try to find some time to do that.

5) One important point I should share: from the command line I'm building an optimized code. In VS, a debug version. I had trouble locating the Microsoft OpenMP DLL and after a few online searches concluded that one had to "#undef _DEBUG" right before "#include ". I wonder whether this is an issue. If I don't do this, I get the error "VCOMP90.DLL not found" at runtime. If there's a way to get around this problem without the "#undef# hack I described above I'd love to know that. It could be it's what's causing my problems.

6) From the command line, these are the flags I use to compile:

/O2
/w
/W4
/EHsc
/nologo
/c
/openmp
/LD
/EHsc
/D_CONSOLE
/D__LIB__
/DFAR=
/D_WINDOWS
/MD

Maybe one issue is the include files - more specifically, omp.h. I am compiling with /openmp and I'm using VS compiler. However, in order to link with the BLAS I'm linking with libiomp5md.lib. Could this be an issue?

I also did the following: the only BLAS Level 3 function I'm using - and which will use both processors - is DGEMM. I have very few locations in my code where I have this function being called, and the matrices are not that large (at least, not in the small problem I'm running). So I wrote to a file the arguments to A,B,C,M,N,K,alpha,beta,LDA,LDB,LDC, with C being written before and after DGEMM is called.

Then I compared the results for OMP_NUM_THREADS/MKL_NUM_THREADS set to 1 and 2. Up to a point the numbers are identical, but after a few calls, the values before the calculation are all identical, but the resulting value of C is different (!!!) The scalar parameters are as follows:

M= 69, N= 3, K= 12, LDA= 81, LDB= 3, LDC= 69
alpha= -1.00000000000000e+000, beta= 1.00000000000000e+000

A, B, and C are identical before DGEMM is called. But the result C is different in the two cases.

Now what?

Thanks (again and again...)

-Arthur

Arthur,- it would be great if will you give us the example which we can check on our side...- the second - the leading dimentional LDA== 81. Is it correct?--Gennady

Gennady,

I believe the data I sent you is correct, including LDA. Do you see a problem with this number?

I'm attaching 2 files. Both contain the same A and B (C is initially zero). One file has the resulting C when running in 1 processor and the other one with 2 processors.

Also, the first 2 arguments to DGEMM are "N" and "T".

-Arthur

Attachments: 

AttachmentSize
Downloadapplication/octet-stream dgemm.out.tar.gz26.96 KB

Hi Arthur,It looks like using MS and Intel OpenMP libraries into a single application is the issue.Does your code need OpenMP, or is it just needed for MKL? If it's needed for MKLonly, I would try to switch-off any openmp flags in MS compiler and just try to link with Intel MKL libraries and withlibiomp5.And another thing - I would recommend you to use Intel C/C++ compiler if you use OpenMP, MKL etc.Regards,Konstantin

Konstantin,

I tested your hypothesis, and you are right. If I turn off the OpenMP from VS but still use MKL_NUM_THREADS=2, then everything works fine.

I guess this means that I either use the Intel Compiler with OpenMP if I want both OpenMP in my code and the parallel BLAS - which I haven't tested yet; or I have to sacrifice either the use of OpenMP in my code or the parallel BLAS from MKL.

Honestly, neither is a very good solution. Ideally, I should be able to use MKL without having to sacrifice any of the tools I'm currently using. I guess there's no workaround for this, is there?

-Arthur

Leave a Comment

Please sign in to add a comment. Not a member? Join today