MKL 10.3 memory use when called from multiple (boost) threads

MKL 10.3 memory use when called from multiple (boost) threads

I can "optionally" break my computation (that calls MKL) into multiple threads using BOOST::threads. Or I execute the a single "worker" routine from the main thread.

I am finding that MKL is not freeing memory in the situation where I use boost::threads ( these map to standard windows thread objects)

The basic loop is

LOOP i= 1..N

CREATE <n> THREADS TO SOLVE PROBLEM <i>

PRINT MKL_MEM_STATS

END LOOP

When running multiple threads that call MKL I see ( using mkl_mem_stat), after each set of threads has run to completion.

mkl_mem_stat:36077592 buffers 13
mkl_mem_stat:65724824 buffers 21
mkl_mem_stat:95372056 buffers 29
mkl_mem_stat:125019288 buffers 37
mkl_mem_stat:154666520 buffers 45
mkl_mem_stat:184313752 buffers 53
mkl_mem_stat:213960984 buffers 61
mkl_mem_stat:243608216 buffers 69
mkl_mem_stat:273255448 buffers 77
mkl_mem_stat:302902680 buffers 85
mkl_mem_stat:332549912 buffers 93

When calling the identical computational code using only the main thread I see

mkl_mem_stat:14823616 buffers 4
mkl_mem_stat:14823616 buffers 4
mkl_mem_stat:14823616 buffers 4
mkl_mem_stat:14823616 buffers 4
mkl_mem_stat:14823616 buffers 4
mkl_mem_stat:14823616 buffers 4
......

Note this is memory use by MKL. This is not a leak in my code, these are INTERNAL MKL memory allocations, that occur as below.

msvcrt!malloc+70

mkl_mc3!mkl_serv_allocate+3ea

mkl_mc3!mkl_blas_xzgemm3m+5f7

mkl_core!mkl_blas_xzgemm3m+c5

mkl_intel_thread!mkl_blas_zgemm3m+46c7

libiomp5md!_kmp_invoke_microtask+8c

libiomp5md!_kmp_fork_call+9d87

libiomp5md!_kmp_fork_call+10f4

libiomp5md!_kmpc_fork_call+6c

mkl_intel_thread!mkl_blas_zgemm3m+df9

9 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi Vasci,

As i understand, it is expected behavious if call MKL in multiply thread as MKL user guide discribles

Intel MKL Memory Management Software
Intel MKL has memory management software that controls memory buffers for the use by the library functions. New buffers that the library allocates when your application calls Intel MKL are not deallocated until the program ends. To get the amount of memory allocated by the memory management software, call the mkl_mem_stat() function. If your program needs to free memory, call mkl_free_buffers().

If another call is made to a library function that needs a memory buffer, the memory manager again allocates the buffers and they again remain allocated until either the program ends or the program deallocates the memory. This behavior facilitates better performance. However, some tools may report this behavior as a memory leak.

The memory management software is turned on by default. To turn it off, set the MKL_DISABLE_FAST_MM environment variable to any value or call the mkl_disable_fast_mm() function. Be aware that this change may negatively impact performance of some Intel MKL routines, especially for small problem sizes.

If your program needs to free memory, you may call mkl_free_buffers() in each (boost) threads.

Best Regards,
Ying

I do not see anything in the documentation you quote above or that I can find in the user-guide as implying that MKL allocates buffers "on a per thread basis" and never de-allocates them when the thread exits. Which is the behaviour I am seeing.

Note this does not apply to OMP threads.

The obvious solution is to call mkl_free_buffers after all threads are complete - this works.

Note, that you cannot simply call mkl_free_buffers when each thread completes. mkl_free_buffers frees ALL buffers which will affect other running threads. It can only be safely done when all threads are complete.

I am attaching a sample program that shows the issue. This may be expected behavior, but it is not clear in the documentation.

The output is
looping over dgemms in main thread mkl_mem_stat:15869824 buffers 7
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
looping over dgemms in main thread mkl_mem_stat:21147648 buffers 8
after threads run mkl_mem_stat:190401536 buffers 80

The includes should say, if the cpp tags actually worked


#include "mkl_cblas.h"

#include "stdlib.h"

#include "stdio.h"

#include "mkl_service.h"

#include "windows.h"

#include "process.h"

#include "ostream"

using namespace std;
static void simple_dgemm(int n)

{
	double* A;

    double* B;

    double* C;

	double beta=1.0;

	int n2=n*n;

    /* Allocate host memory for the matrices */

    A = (double*)malloc(n2 * sizeof(A[0]));

    B = (double*)malloc(n2 * sizeof(B[0]));

    C = (double*)malloc(n2 * sizeof(C[0]));
    /* Fill the matrices with test data */

    for (int i = 0; i < n2; i++) {

        A[i] = rand() / (double)RAND_MAX;

        B[i] = rand() / (double)RAND_MAX;

        C[i] = rand() / (double)RAND_MAX;

    }
	cblas_dgemm (CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0,A, n, B, n, beta, C, n);

	free(A);

    free(B);

    free(C);
}
void thread_func1(void *s)
{

	for(int i=0;i < 10 ;i++){

	 simple_dgemm(1000);

	}

}

static void PrintMKLMemoryUse(const char *msg)

{

	int nBuffers;

	MKL_INT64 allocatedMem=mkl_mem_stat(&nBuffers);

	cout << msg << " mkl_mem_stat:" << allocatedMem << " buffers " << nBuffers << endl;
}

#define NTHREADS 10

int main(int argc, char* argv[])

{

	// run something to get MKL running.

	for(int i=0;i < NTHREADS; i++){

		thread_func1(NULL);

		PrintMKLMemoryUse( "looping over dgemms in main thread ");

	}
	HANDLE hT[NTHREADS];

	int tmp;

	for(int i=0;i < NTHREADS ; i++){

		hT[i] = (HANDLE)_beginthread(thread_func1,0,&tmp);

	}

	WaitForMultipleObjects(NTHREADS, hT, TRUE, INFINITE);

	PrintMKLMemoryUse( "after threads run ");

	return 0;

}

Hi Vasci

Thanks you for your test case. we will investigate it and get back to you soon.

Thanks
Ying

FYI,I opened a premier support case #686263

Also, As I pointed out, this is NOT an issue with OpenMP threads. For example no "extra" memory use is seen after the following loop.


#pragma omp parallel for

	for(int i=0; i < NTHREADS ; i++){

		thread_func1(NULL);

	}

	PrintMKLMemoryUse( "after omp threads run ");

OK, for those interested the solution is to call
"mkl_thread_free_buffers" at the end of your thread.
But I would point out, nowhere is this **critical** information mentioned in the section 'Managing Performance and Memory' in the Users Guide. The function is documented in the reference manual.

Faça login para deixar um comentário.