auto-parallelization in Windows

auto-parallelization in Windows

Hi all,

I have used Intel C++ Compiler 11.0 to compiler our code on Linux and it works fine. The compiler is able to auto-parallelize, vectorize and also parallelize OpenMP parts successfully. Now I have ported the code to Windows and using MS Visual Studio 2008 to compile our code using Intel C++ Compiler 11.0 for Windows. On Windows, however, OpenMP parts are parallelized successfully, but it seems that auto-parallelization is not active. I call the compiler with these compiler options:

/c /O3 /Og /Ob2 /Ot /Qipo /GA /EHsc /RTC1 /MT /GS /fp:fast=2 /Fo"x64\\Release/"
/W1 /nologo /Qopenmp /Qfp-speculation:fast /Qparallel
/Quse-intel-optimized-headers /Qprof_gen /Qprof_dir "x64\\Release"
/Qopt-report-file:"C:\\Documents and Settings\\Sophia\\My Documents\\Visual Studio
2008\\Projects\\CBSM\\opt.txt" /Qopenmp-lib:compat

I should also say that the compiler on Windows could not detect OpenMP pragmas until I added /Qopenmp-lib:compat. As I said, there are many loops that are auto-parallelized and vectorized using auto-parallelization feature in Linux, but same code is not auto-parallelized on Windows. Besides, I found out that there are two features in Intel Visual Fortran for Windows which are "High Performance Parallel Optimizer (HPO)" and "Automatic Vectorizer". Are they also included in Intel C++ Compiler on Winodws and/or Linux?

Thanks

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

In the 11.1 version, the /Qopenmp-lib:compat is the default; but it is not in 11.0. That's why you need the /Qopenmp-lib:compat.

The HPO and auto-vectorization are also in the Intel C++ Compiler for Windows & Linux & Mac OS.

The differences you saw with Windows and Linux may or may not be a bug. Please provide a testcase. If it is, we can fix it.

Thanks,
Jennifer

I don't know what do you mean by test case, but this is a big project. I will try to test on a small code too. Can you verify that I have used correct combination of the compiler options so auto-parallelization should be activated? Also, I didn't find any property for HPO and auto-vectorization in project properties window and Intel C++ optimization part in Visual Studio 2008. What options can should I use to activate them? I can add them manually.

Thanks,

D.

I see.

From the option list, you have /Qprof-gen. It's used for profiling the code and it will disable optimizations because the intention is to use /Qprof-uselater.

First of all, I'd recommend you to upgrade to 11.1.065.

About the auto-vectorization, /arch:SSE2 is the default in 11.x release. So you should get the auto-vectorization with SSE2 instructions. But ifyou want to target new processors or any specific processor with Intel SSE3 or SSE4, you can use followings (11.1 release):
. /arch:[IA32,SSE2,SSE3,SSSE3,SSE4.1]
. /Qax[SSE2,SSE3,SSSE3,SSE4.1,SSE4.2,AVX]
. /Qx[SSE2,SSE3,SSSE3,SSE4.1,SSE4.2,AVX,SSE3_ATOM]
See this article for more info about those: http://software.intel.com/en-us/articles/performance-tools-for-software-developers-intel-compiler-options-for-sse-generation-and-processor-specific-optimizations/

So you could use:
/c /O3 /Og /Ob2 /Ot /Qipo /GA /EHsc /RTC1 /MT /GS /fp:fast=2 /Fo"x64\Release/"
/W1 /nologo /Qopenmp /Qfp-speculation:fast /Qparallel
/Quse-intel-optimized-headers /Qprof_gen /Qprof_dir "x64\Release"
/Qopt-report-file:"C:\Documents and Settings\Sophia\My Documents\Visual Studio
2008\Projects\CBSM\opt.txt" /Qopenmp-lib:compat
==>

/c /O3 /Og /Ob2 /Ot /Qipo /GA /EHsc /RTC1 /MT /GS /fp:fast=2 /Fo"x64\Release/"
/W1 /nologo /Qopenmp /Qfp-speculation:fast /Qparallel
/Quse-intel-optimized-headers /Qprof_use /Qopt-report-file:"C:\Documents and Settings\Sophia\My Documents\Visual Studio
2008\Projects\CBSM\opt.txt" /Qopenmp-lib:compat /QxSSE2

/O3 is the high level optimization provided by Intel Compiler.

Jennifer

I have uploaded a test case. The test case has been run in Microsoft
Visual Studio 2008 Professional Edition. I also could not get the
report whether OpenMP defined loops are parallelized or not, but if you
run the program you will see that OpenMP region is parallelized however
the two other loops are just using one processor while I expected to be
parallelized/vectorized using auto-parallelization feature. So I would
like this simple project to:

1. Use auto-parallelization feature.
2. Report correctly which loops are parallelized/vectorized using OpenMP or auto-parallelization feature.
3. Auto-vectorization gets activated and let me know how.
4. High Performance Optimizer gets activated and let me know how you used it.

By the way, I have upgraded to 11.1 version but nothing changed and I still can not use auto-parallelization features.

Thanks,

D.

Attachments: 

AttachmentSize
Downloadapplication/zip OMPTest.zip223.63 KB

Thanks for the testcase!

About the loop1 (omp), it can be simplified like following:

test1

	// Loop #1
	#pragma omp parallel for reduction (+:nSum1)
	for (i=nStart; i<=100*nEnd; ++i)
		nSum1+=i;

About the loop#2 and loop #3, if you change the code like below, the loop will be auto-parallelized. This is a bug and I'll file a ticket for it.

int __inline getsum(int i)
{
int nSum=0;
nSum+=(int)sqrt(cos((sqrt(i*1.22234)*2.445)));
.......
return nSum;
}

int main()
{
...
// Loop #2
int k;
for (i=0; i<=100*100000; ++i)
for (j=0; j<=10000; ++j)
nSum2 += getsum(i);
...
}

To see if the loop is auto-parallelized, use option: /Qparallel /Qpar-report3.
To see if a loop is auto-vectorized, use option: /Qvec-report3. The auto-vectorization is enabled with those options: /arch:[IA32|SSE2|SSE3|SSE4...], /Qax[...], /Qx[...]
Please refer to this article for more details on targeting different architectures.
The Intel C++ compiler has a feature "parallel lint" that can diagnoses existing and potential issues with OpenMP paralleization and the option is "/Qdiag-enable:sc-parallel[n]"
To use HLO, usee -O3. You can see more detail with /Qopt-report.

Thanks,
Jennifer

Quoting Jennifer Jiang (Intel)
The auto-vectorization is enabled with those options: /arch:[IA32|SSE2|SSE3|SSE4...], /Qax[...], /Qx[...]

/arch:IA32 (available only on 32-bit compiler) disables auto-vectorization (at least with respect to float data types), as do options -Od, -Os, -O1.
/fp:source and the like disable certain auto-vectorizations. I don't know whether sum reduction vectorizations, such as those mentioned in this thread, are disabled only for float data types.

Thanks Tim for correcting me.

Yes, Tim is right.

Jennifer

This bug is fixed in 14.0 and 15.0. The 2nd and 3rd loops can all be auto-parallelized now.

Jennifer

Leave a Comment

Please sign in to add a comment. Not a member? Join today