Best practices for calling MKL from C#

Best practices for calling MKL from C#

I have a few newbie questions about the best practices for using MKL in a managed environment.

  • It's said that by default, MKL will try to pick the best number of threads. Does it mean that this is done everytime a BLAS function is invoked, because they are stateless?
    • If not, could you please explaion what actually happened?
    • If yes, can I request MKL to estimate the best number of threads once, and use that number for all subsequent function calls? 
  • Are there any tricks for minimizing the overhead of calling MKL from a managed environment?

Thanks.

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

To the best of my knowledge, you could limit the number of threads by setting OMP_NUM_THREADS or MKL_NUM_THREADS. If you always want 1 thread per MKL call, linking mkl_sequntial could be more efficient. MKL will not choose more threads than physical cores, when HyperThreading is recognized, unless you over-ride MKL_DYNAMIC.  MKL can choose a smaller number of threads on each call according to the size of the problem.  It will not remember a choice from a previous call, and I doubt you could capture its choice of number of threads so as to reduce the limit you have set.

I'm not sure what you mean about reducing the overhead from a managed environment, As always, you could tinker with KMP_BLOCKTIME in case you want to keep the thread pool active for more or less time than the default 200 milliseconds, which may help you take advantage of your KMP affinity settings.

iliyapolak's picture

>>>I'm not sure what you mean about reducing the overhead from a managed environment, >>>

Probably overhead of JIT compilation of managed code where net framework implemented inside dll is translating bytecode into machine code.

iliyapolak's picture

.>>>Are there any tricks for minimizing the overhead of calling MKL from a managed environment?>>>

I suppose that net framework JIT compiler will be able to optimize managed code hot spots.For example where c#  wrapper function will call into MKL library function then MSIL call instruction  will be translated into native processor assembly call instruction and stored in some kind of cache so the next time JIT compiler will not perform the translation.

>>...It's said that by default, MKL will try to pick the best number of threads...

That can be considered as a correct statement when /Qmkl:parallel compiler option is used. In that case a threaded version of MKL will be used and a maximum number of threads will be created for processing.

[ From Intel C++ compiler help ]
...
/Qmkl[:]
link to the Intel(R) Math Kernel Library (Intel(R) MKL) and bring
in the associated headers
parallel - link using the threaded Intel(R) MKL libraries. This
is the default when /Qmkl is specified
...

For example, if some application is compiled with a threaded version of MKL ( compiler option /Qmkl:parallel ) and an Intel CPU with 4 cores ( 8 logical CPUs ) is used then 8 threads will be created for processing. Take into account, that a total number of threads in the application will be equal to 9:

- 1 thread is for main process
- 8 threads for processing of some MKL functionality

Note: It is verified on a system with Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846.

>>...Are there any tricks for minimizing the overhead of calling MKL from a managed environment?..

If you're experiencing some performance issues when calling MKL from a managed environment then you need to consider a pure C/C++ application that uses MKL. I don't think it is possible to get rid of overheads related to additional software layer ( managed environment ) which .NET is based on.

iliyapolak's picture

It is hard to say without the testing what could be the expected performance penalty when MKL code is called from NET application,but I think that want really does matter is the fact that MKL is compiled to processor native machine code and is thoroughly optimized. So there will be some overhead of transferring control from NET to MKL.

Regarding the number of threads:

I do want to let MKL decide the best number of threads. However, I don't want MKL to do it in every function call. So the question is: if the cost of determining the best number of threads is not innelegible, can I set up so that MKL only needs to determine that best number only once?

Quote:

MKL_DYNAMIC

MKL_DYNAMIC being TRUE means that Intel MKL will always try to pick what it considers the best number of threads, up to the maximum specified by the user. MKL_DYNAMIC being FALSE means that Intel MKL will not deviate from the number of threads the user requested, unless there are reasons why it has no choice. The value of MKL_DYNAMIC is by default set to TRUE, regardless of OMP_DYNAMIC, whose default value may be FALSE.

From:http://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-intel-mkl-100-threading#1

>>...I don't want MKL to do it in every function call...

Take into account that MKL could be used in 3 ways:

- sequential, or
- parallel, or
- cluster

So, it is Not possible to mix, for example, sequential with parallel, and so on. Another thing is that Not all MKL functions are threaded.

I didn't get your message. Or maybe I didn't make myself clear enough. So let me explain my question again.

In my C# code, I have

[DllImport("mkl_rt")]
public static extern void cblas_dgemv(params);

The main program calls cblas_dgemv about 1 million times. I don't know if MKL tries to pick the best number of threads 1 million times or not. It seems like it because the runtime report for each iteration varies quite a bit. 

In that case, I just want MKL picking the optimal number of threads only once. Use that number of threads in every dgemv call.

Quote:

Sergey Kostrov wrote:

>>...I don't want MKL to do it in every function call...

Take into account that MKL could be used in 3 ways:

- sequential, or
- parallel, or
- cluster

So, it is Not possible to mix, for example, sequential with parallel, and so on. Another thing is that Not all MKL functions are threaded.

>>... I don't know if MKL tries to pick the best number of threads...

There is a very simple way to verify it:

- Open Windows Task Manager
- Make sure that Threads column is shown on a Processes property page
- Start your application
- Start processing
- Monitor how many threads ( actually OpenMP threads based on Win32 threads on Windows platforms ) are used during processing

>>...It seems like it because the runtime report for each iteration varies quite a bit...

Post it for review.

>>... I just want MKL picking the optimal number of threads only once...

In that case before processing started try to call omp_set_num_threads( uiNumThreads ) OpenMP function and uiNumThreads could be set from 1 to 16,384 with a stack size of 32KB.

iliyapolak's picture

There is also more advanced method to map/verify if OpenMP  threads are mapped directly to raw threads (Windows threads) this involves api monitor(s) and/or usage of logexts.dll tracking windbg extension.You will simply observe the call count to CreateThread function issued from MKL library.But for many users it is an overkill, so as Sergey suggested you can use Task Manager to verify the number of created threads.

Login to leave a comment.