Difference between multithreading DLL vs EXE under intel c++ compiler?

Difference between multithreading DLL vs EXE under intel c++ compiler?

Hi everyone,my project was built under microsoft visual studio 2005 (C++). Iused intel c++ compiler to accelerate my program, and the speed boost is about 30% percent. Recently I realized that the computation in my program could be divided into two independent parts, so I put them into two independent threads. After the main thread gets the signal that the two computation threads finished their jobs, it will proceed the final calculation and then the program will exit. Iran my program under windowsXP in a dual-core intel CPU. When the program was run as an execution file, another 50% speed boost was achieved as expected, and the peak CPU cost is about 80%. However when I change the program into a dynamic link library (DLL), and called it from another execution file, the speed was changed back tothelevel of initial single thread program (but after intel c++ optimziation).

Then I changed the project compiler back to visual c++, and compiled the DLL again. I found that the speed was boosted by a ratio of 2. So I am curious that is there any problem with the configuration of intel c++ compilerfor multithreading DLL or there is some tricks about the intel c++ compiler in this situation?

Any help will be highly appreciated!

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The Intel C++ compiler behaves the same as Microsoft compiler. If you provide the sample code then we can take a look.

How were you performing your parallization? (auto parallization, OpenMP, TBB, pthreads, other)?

Where were you performing your parallization? (DLL or .exe)

A multi-threaded DLL (e.g. C runtime multi-threaded DLL) refer to the DLL being safe to call from a multi-threaded application (i.e. this does not necessarily mean the DLL creates and/or uses multiple threads).

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove
How were you performing your parallization? (auto parallization, OpenMP, TBB, pthreads, other)?

Please let us know your answer to Jim's question.

In case you're using /Qparallel or /Qopenmp,
try to link the new openmp runtime libs. See this kb for how - http://software.intel.com/en-us/articles/how-to-use-intelr-compiler-openmp-compatibility-libraries-on-windows/

Another thing to try is following and see how the performance goes:
1. compile with icl
2. link with vc's openmp runtime lib.

Jennifer

Thanks you guys for the kindly suggestions. I paste the partly pseudo codes below for clearly showing thathow I implemented the multithreading in my DLL. I spent lot of time to extract the codes related to multithreading, and possibly there are some error inside. Please specify if you have any problem. The whole program could be complied and linked correctly and I couldhave exact result as expected.

/* -------------------------------------- */
/* stdafx.h
/* -------------------------------------- */
#include "GlobalMultiThreadClass.h"

extern HANDLE Task1_Thread
extern HANDLE Task2_Thread

extern HANDLE Task1_EventHandle
extern HANDLE Task2_EventHandle

extern TASK1_PARA_QUEUE Task1_Queue;
extern TASK2_PARA_QUEUE Task2_Queue;

extern GlobalMultiThreadClass GMTObject;


/* -------------------------------------- */
/* GlobalMultiThreadClass.h
/* -------------------------------------- */
class GlobalMultiThreadClass
{
public:
	GlobalMultiThreadClass();
	virtual ~GlobalMultiThreadClass();
};


/* -------------------------------------- */
/* GlobalMultiThreadClass.cpp
/* -------------------------------------- */
#include "stdafx.h"
#include "GlobalMultiThreadClass.h"

#include "Task1_Thread_Function.h"
#include "Task2_Thread_Function.h"

HANDLE Task1_Thread
HANDLE Task2_Thread

GlobalMultiThreadClass::GlobalMultiThreadClass()
{
	Task1_EventHandle = CreateEvent(NULL,FALSE,TRUE,NULL);
	Task2_EventHandle = CreateEvent(NULL,FALSE,TRUE,NULL);

	Task1_Thread = CreateThread(NULL,0,Task1_Thread_Function,NULL,0,NULL);
	Task2_Thread = CreateThread(NULL,0,Task1_Thread_Function,NULL,0,NULL);
}

GlobalMultiThreadClass::~GlobalMultiThreadClass()
{
	WaitForSingleObject(Task1_Thread,INFINITE);
	CloseHandle(Task1_Thread);
	CloseHandle(Task1_EventHandle);

	WaitForSingleObject(Task2_Thread,INFINITE);
	CloseHandle(Task2_Thread);
	CloseHandle(Task2_EventHandle); 
}


/* -------------------------------------- */
/* Task1_Thread_Function.h
/* -------------------------------------- */
#include "stdafx.h"

HANDLE Task1_EventHandle
TASK1_PARA_QUEUE Task1_Queue;

UINT Task1_Thread_Function(LPVOID lpParameter);

/* -------------------------------------- */
/* Task1_Thread_Function.cpp
/* -------------------------------------- */
#include "Task1_Thread_Function.h"

UINT Task1_Thread_Function(LPVOID lpParameter)
{
	bool RUN = true;
	while(RUN)
	{
		if(Task1_Queue.size()==0){
			Sleep(1);
			continue;
		}
		else
		{
			// fetch the parameters in queue
			// ...
			
			Task1_Queue.pop_front();
			
			// task1 execution
			// ...
			
			SetEvent(Task1_EventHandle);
		}
	}
	return 0;
}
/* -------------------------------------- */
/* Task2_Thread_Function.h
/* -------------------------------------- */
#include "stdafx.h"

HANDLE Task2_EventHandle
TASK2_PARA_QUEUE Task2_Queue;

UINT Task2_Thread_Function(LPVOID lpParameter);

/* -------------------------------------- */
/* Task2_Thread_Function.cpp
/* -------------------------------------- */
#include "Task2_Thread_Function.h"

UINT Task2_Thread_Function(LPVOID lpParameter)
{
	bool RUN = true;
	while(RUN)
	{
		if(Task2_Queue.size()==0){
			Sleep(1);
			continue;
		}
		else
		{
			// fetch the parameters in queue
			// ...
			
			Task2_Queue.pop_front();
			
			// task2 execution
			// ...
			
			SetEvent(Task2_EventHandle);
		}
	}
	return 0;
}


/* -------------------------------------- */
/* main_thread.h
/* -------------------------------------- */
#include "stdafx.h"
#include "Task1_Thread_Function.h"
#include "Task2_Thread_Function.h"

bool Main_Thread_Process();

/* -------------------------------------- */
/* main_thread.cpp
/* -------------------------------------- */
#include "main_thread.h"

bool Main_Thread_Process()
{
	// Pre-process here
	// ...
	
	WaitForSingleObject(Task1_EventHandle,INFINITE);
	ResetEvent(Task1_EventHandle);

	WaitForSingleObject(Task2_EventHandle,INFINITE);
	ResetEvent(Task2_EventHandle);

	Task1_Queue.push_back(&Task1_Para);
	Task2_Queue.push_back(&Task2_Para);
		
	WaitForSingleObject(Task1_EventHandle,INFINITE);
	SetEvent(Task1_EventHandle);

	WaitForSingleObject(Task2_EventHandle,INFINITE);
	SetEvent(Task2_EventHandle);
	
	// Post-computation here
	// ...
	
	// Quit	
	return true;
}

/* -------------------------------------- */
/* dll.h
/* -------------------------------------- */
#include "stdafx.h"
#include "main_thread.h"

GlobalMultiThreadClass GMTObject;

// other dll export functions definition here
// ...

__declspec(dllexport) int Export_Function(LPTSTR lpszMsg);

/* -------------------------------------- */
/* dll.cpp
/* -------------------------------------- */
#include "dll.h"

__declspec(dllexport) int Export_Function(LPTSTR lpszMsg)
{
	// ...
	
	Main_Thread_Process();
	
	// ...
}

Quoting - jimdempseyatthecove

How were you performing your parallization? (auto parallization, OpenMP, TBB, pthreads, other)?

Where were you performing your parallization? (DLL or .exe)

A multi-threaded DLL (e.g. C runtime multi-threaded DLL) refer to the DLL being safe to call from a multi-threaded application (i.e. this does not necessarily mean the DLL creates and/or uses multiple threads).

Jim Dempsey

Hi Jim, I also used OpenMP support in visual studio 2005 to accelerate my program, and remove all the multithreading codes from the project. It just like this:

main_thread.cpp

main_thread_function()
{
//...

#pragma omp parallel sections
{
#pragma omp section
Task1_Execution();

#pragma omp section
Task2_Execution();
}

//...
}

and call themain_thread_function() from DLL export function. When vc compiler was used (with openmp option swtiched on), the speed was boosted by 50%, while when intel c++ complier (openmp option also on) was used, there was no improvement about the speed:(.

Finally I found that a C/C++ compilation option could impact the execution speed a lot for intel c++ compiler, that is the "Enable C++ Exceptions" option in visual studio 2005. When you setup the option to "Yes with SEH Exceptions (/EHa)", the speed is rather slower for the execution of dynamic linked library, while when you setup to "Yes (/EHsc)" or "No", the speed is rather faster, and the last option lead to thefastest speed. I think it's strange as I didn't get similiar result when I organize the program as a .exe file. Now the speed of the multithreading DLL program after optimization of intel compiler is faster then the result from visual studio compiler (about 20~30% with no c++ exceptions supported option, and 10~15% for /EHsc option), as expected. Thanks you for the kindly help and suggestions above anyway.

Quoting - gobball

Finally I found that a C/C++ compilation option could impact the execution speed a lot for intel c++ compiler, that is the "Enable C++ Exceptions" option in visual studio 2005. When you setup the option to "Yes with SEH Exceptions (/EHa)", the speed is rather slower for the execution of dynamic linked library, while when you setup to "Yes (/EHsc)" or "No", the speed is rather faster, and the last option lead to thefastest speed.

Hi gobball,
could you tell me the version of Intel C++ compiler you're using? I'd like to find out why if it's the latest compiler and make sure it's not a bug.

thanks,
Jennifer

Quoting - Jennifer Jiang (Intel)
Hi gobball,
could you tell me the version of Intel C++ compiler you're using? I'd like to find out why if it's the latest compiler and make sure it's not a bug.

thanks,
Jennifer

Hi Jennifer, I am using the latest version (11.1) of intel c++ compiler which was downloaded from the intel website. I would like to provide some detailed information about the speed and compilation options in my project:

The core codes of the project is to do some imaging processing. Multithreading technology has been used to boost the execution speed on a dual core intel CPU. And the intel c++ compiler was expected toprovide faster speed than visual c++ compiler.

Firstly I processed a 512x512 image with my program. For EXE project (the core codes was compiledinto a execution file and was used to process the images) with Multithreading (one main thread plusextra twoparallel computation threads):

-----------------------------------------------------
Compiler ----- Exception Option ----- Speed
-----------------------------------------------------
VC(2005) No 1.3 sec
VC(2005) /EHsc 1.3 sec
VC(2005) /EHa1.3 sec

Intel(11.1) No 1.0 sec
Intel(11.1) /EHsc 1.0 sec
Intel(11.1) /EHa 2.0 sec

Then I processed a 256x256 image with my program. The results were listed below:

-----------------------------------------------------
Compiler ----- Exception Option ----- Speed
-----------------------------------------------------
VC(2005) No 0.25 sec
VC(2005)/EHsc 0.25 sec
VC(2005)/EHa 0.25 sec

Intel(11.1) No 0.215 sec
Intel(11.1) /EHsc0.23 sec
Intel(11.1) /EHa 0.45 sec

For DLL project I got similiar result as EXE. I must appologize for misleading everyone since seems the question I asked in this thread is not related to difference between the compilation of EXE and DLL in multithreading condition. Insteadly it seems to be related to the different effects of intel c++ compiler in differentcompilation options about C++ Exception. Hope these help.

Quoting - gobball
Hi everyone,my project was built under microsoft visual studio 2005 (C++). Iused intel c++ compiler to accelerate my program, and the speed boost is about 30% percent. Recently I realized that the computation in my program could be divided into two independent parts, so I put them into two independent threads. After the main thread gets the signal that the two computation threads finished their jobs, it will proceed the final calculation and then the program will exit. Iran my program under windowsXP in a dual-core intel CPU. When the program was run as an execution file, another 50% speed boost was achieved as expected, and the peak CPU cost is about 80%. However when I change the program into a dynamic link library (DLL), and called it from another execution file, the speed was changed back tothelevel of initial single thread program (but after intel c++ optimziation).

Then I changed the project compiler back to visual c++, and compiled the DLL again. I found that the speed was boosted by a ratio of 2. So I am curious that is there any problem with the configuration of intel c++ compilerfor multithreading DLL or there is some tricks about the intel c++ compiler in this situation?

Any help will be highly appreciated!

What I find as a major benefit with the Intel C++ Compiler (and debugger is great) is that you can make a solution (library) as optimized C++ in VS 2008, and save this as a .dll. This superfast, optimized .dll can then be included in a .Net assembly, where the less performance critical parts can be coded in C# (or IronPython for that matter). So The differece would be clearly that you can use a .dll in conjunction with other parts of an assembly in VS, but a .exe would have to be spawned, and that is off course not what I want anyway. Off course there are other differences as well. A .dlldoesn't contain startup code etc, so it is naturally faster to include in an application.

"Small chance of success, certainty of death... What are we waiting for?"

Quoting - dvyy
"Enable C++ Exceptions" should be intel c++ compiler bug !! report it !!

You don't have to go to Premier Support to report. I'm working on the test case right now and will send to compiler engineer once it's done.

Jennifer

Quoting - gobball
-----------------------------------------------------
Compiler ----- Exception Option ----- Speed
-----------------------------------------------------
VC(2005) No 0.25 sec
VC(2005)/EHsc 0.25 sec
VC(2005)/EHa 0.25 sec

Intel(11.1) No 0.215 sec
Intel(11.1) /EHsc0.23 sec
Intel(11.1) /EHa 0.45 sec

Hello all,
I'm unable to duplicate this performance issue using a small testcase.

  1. VS2005-release (/EHsc)

starting ......

task1: iter=1000, queue.size=1

task2: iter=2000, queue.size=1

task1: 109973080.000000

task2: 200633520.000000

total used tick count: 202090000.000000

end......

  1. Icl 11.1.038 /EHsc

starting ......

task1: iter=1000, queue.size=1

task2: iter=2000, queue.size=1

task1: 16315880.000000

task2: 16905680.000000

total used tick count: 18281200.000000

end......

  1. Icl - /EHa

starting ......

task1: iter=1000, queue.size=1

task2: iter=2000, queue.size=1

task1: 16401420.000000

task2: 16370620.000000

total used tick count: 17835540.000000

end......

If any of you could duplicate with a testcase, could you please attach to this thread?

Thanks,
Jennifer

I have a simple repro (we've hit this problem as well in our code).

#include "stdio.h"

extern "C" unsigned __int64 _rdtsc();
#pragma intrinsic(_rdtsc)

class D
{
public:
D(size_t s = 0) {}
~D() {}
};

double f[100000];
unsigned __int64 t;

int main()
{
D d();

t = _rdtsc();
double totalSum = 0;
for (size_t i = 0; i < 100000; ++i)
{
totalSum = totalSum + f[i];
}
printf("nSumVector done in %I64i (sum=%f)n", _rdtsc() - t, totalSum);
}

compile /O2 /EHa:
"SumVector done in 745769 (sum=0.000000)"

replace D d(); with D d(10);
"SumVector done in 1448634 (sum=0.000000)"

Performance drops 2x due to the loop not being vectorized anymore. Apparently, having to unwind in the presence of this totally hollow object causes the optimizer some grief. Interestingly enough, the problem also seems to go away if you keep D d(10), but remove the explicit destructor definition from the class.

Curious

Modify your test program to print out the address of f[0]. This will check data alignment

Then modify your test program to add a 2nd void class
...
class D2
{
public:
D2(size_t s = 0) {}
~D2() {}
};

..
int main()
{
D d();
D2 d2();

This may affect code alignment (inserting a 2nd "null"dtor)

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today