Starting Guide

Starting Guide

Hello Firend,

      I am new in Xe_sudio composer of intel. I have good knowlege of Parallel Programing on GPU with CUDA and OPenCL. I want to learen intel xe composer icc , mkl & ipp. I have read all installtion guide and tutorial. But Can any one suggest me how will i start programing.

That Means,

How I will use single core and multiple core of my processor.

How will i divide my execution on diffrent cores.

Please Help me! I am using Intel i7 Processor.

Thanks,

49 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,
Welcome to X86 parallel programming. I would ask you to start with our compiler User and Reference guides to know about the Intel compiler (ICC) usage.
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013...

This document have discussed about the various Multi-Threading models supported by Intel ICC compiler, Using which you can take advantage of running your code efficiently on all the available cores of your system. Have you purchased an Intel compiler or any other suite like Parallel studio and so on? Or you can download the evaluation copy from http://software.intel.com/en-us/intel-parallel-studio-XE-2013-evaluation... . Please feel free to put up your queries.

Regards,
Sukruth H V

It's very good i want more.
Please I am checking Sample example

Take a look at a folder [ CompilerDIR ]\Samples\en_US\C++... ( or so ) and you could find there several C/C++ examples.

Hello,

take a look at our Content Library:
http://software.intel.com/en-us/search/site

Search for everything you're interested in (e.g. OpenMP*, Intel(R) Threading Building Blocks, Intel(R) Cilk(TM) Plus, ...).
Make sure to select the proper filters as you'll get flooded with results otherwise. I suggest to filter for either "Article", "Blog post" and "Courseware". Those three also can contain interesting examples.

Best regards,

Georg Zitzlsberger

Thanks :)

Hello,
Can anyone give me link where all mkl Liberary function WITH DESCRIPTION available? I have read following link
http://software.intel.com/sites/products/documentation/hpc/mkl/mkl_userg...

I've been searching myself, and I don't think there is a single satisfactory guide to "all" MKL functions. In several categories (e.g. BLAS, LAPACK, fftw), compatibility with open source libraries is maintained so literature on those libraries is applicable.
Specific questions about MKL should be posed on the MKL forum http://software.intel.com/en-us/forums/intel-math-kernel-library
Unfortunately, today I don't see the additional references which ought to be at the top of that forum.

Okey, Thank You very Much!!! :)

When I have installed Intel XE COMPOSER it will give me following messages.
ERROR: Package is corrupted. Installation cannot continue.
Like follow (But i have downloaded from Intel site)
Please tell me what i have to do?

Step no: 6 of 7 | Installation
--------------------------------------------------------------------------------
Each component will be installed individually. If you cancel the installation,
components that have been completely installed will remain on your system. This
installation may take several minutes, depending on your system and the options
you selected.
--------------------------------------------------------------------------------
Installing Amplifier XE Command line interface component...
ERROR: Package is corrupted. Installation cannot continue.
--------------------------------------------------------------------------------
Installing Inspector XE Command line interface component...
ERROR: Package is corrupted. Installation cannot continue.
--------------------------------------------------------------------------------
Installing Advisor XE Command line interface component...
ERROR: Package is corrupted. Installation cannot continue.
--------------------------------------------------------------------------------
Installing Intel C++ Compiler XE 13.0 Update 1 on IA-32 component... failed
--------------------------------------------------------------------------------
Installing Intel C++ Compiler XE 13.0 Update 1 on Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel Debugger 13.0 Update 1 on IA-32 component... failed
--------------------------------------------------------------------------------
Installing Intel Debugger 13.0 Update 1 on Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel Math Kernel Library 11.0 Update 1 on IA-32 component... failed
--------------------------------------------------------------------------------
Installing Intel Math Kernel Library 11.0 Update 1 on Intel(R) 64 component... done
--------------------------------------------------------------------------------
Installing Intel Integrated Performance Primitives 7.1 Update 1 on IA-32
component... failed
--------------------------------------------------------------------------------
Installing Intel Integrated Performance Primitives 7.1 Update 1 on Intel(R) 64
component... done
--------------------------------------------------------------------------------
Installing Intel Threading Building Blocks 4.1 Update 1 core files and examples
component... done
--------------------------------------------------------------------------------
Finalizing installation... done
--------------------------------------------------------------------------------

Hello,

in that case I'd recommend to download it again. There are known issues where the download is interrupted. Please, always verify the size of the downloaded tar-ball. And, in addition you can compare the MD5 sum with the latest Intel(R) Composer XE 2013 Update 1 packages:


$ md5sum l_ccompxe_2013.1.117.tgz

8796a1a1e5c98107ca69c75a7aa2b379  l_ccompxe_2013.1.117.tgz

$ md5sum l_fcompxe_2013.1.117.tgz

355c201ef30167580e5b0dfc217fbbe8  l_fcompxe_2013.1.117.tgz

Use the link "Start download with a download manager" when downloading the packages. This should always work.

Best regards,

Georg Zitzlsberger

Thanks, i will Download new file and then i will try to install it.

Hi Friend Georg Zitzlsberger,
I am trying to write program that calculate the aggregate sum of vector element.
Example

suppose A={1,2,3,4,5,6,7,8,9,0}
Result = 45
i got max through put time 140 to 150 milliseconds for 600,000,000 (600 million number)
Can we reduce time for execution time?

I have write following program.

#include
#include
#include
#include

unsigned int compute(unsigned int i)
{
return i; // return a value computed from i
}
int main(int argc, char* argv[])
{
unsigned int n = 400000000;
int *A = (int *)malloc(sizeof(int)*n);
for(unsigned int i = 0; i < n; i++)
{
A[i]=1;
}
cilk::reducer_opadd total;

// Compute 1..n
std::clock_t start = std::clock();
cilk_for(unsigned int i = 0; i <= n; ++i)
{
total += A[i];
}

std::cout << "Total (" << total.get_value()
<< ") is correct";
std::cout << "Total Time : "<<( double( std::clock() - start ) /double(CLOCKS_PER_SEC/1000)) <<'\n';
return 0;
}
Command for execution : #icc -fast -prallel filename.c filename

Is there any MKL function for aggregate sum of vector element?

Do you mean the BLAS ?sum functions? Evidently, with only 9 elements, those will be slower than any reasonable in-lined method, such as accumulate() or __sec_reduce_add(). Even the compilers' inline optimizations for sum reduction optimization may not be effective for such a short vector, and it may be worth while to prevent the compiler using AVX.
If you succeed in forcing threading on such a small case, you may succeed in running slower than MKL.

yes friend i had given only sample example of program i have used 600,000,000 element and find sum of that number and it was giving result in 140 to 150 ms i want more optimzation.

Hello,

I doubt that this simple example has much room for improvements. The compiler should already create sufficiently fast code.
Using threading for this (trivial) workload is likely overhead. Intel(R) Cilk(TM) Plus runtime takes care about the right balancing of grain size, though.

Hence, I don't see further room for improvement that would justify the effort for this example.

Best regards,

Georg Zitzlsberger

Means, It's an optimized code
Thanks,

How will i Download videos from www.software.intel.com?

Hello,

you cannot download the videos. You need a Flash* player to view them.

Best regards,

Georg Zitzlsberger

A common practice is to reduce the frequency of reductions:

#include "stdafx.h"

#include

#include

#include

#include

#include

#include 
int _tmain(int argc, _TCHAR* argv[])

{

	unsigned int n = 400000000;

	int *A = (int *)malloc(sizeof(int)*n);

	for(unsigned int i = 0; i < n; i++)

	{

	A[i]=1;

	}

	cilk::reducer_opadd total;
	// Compute 1..n

	total.set_value(0);

	clock_t start = clock();

	cilk_for(unsigned int i = 0; i < n; ++i)

	{

	total += A[i];

	}
	std::cout << "Total (" << total.get_value()

	<< ") is correct";

	std::cout <<  "Total Time : "<< ( double( clock() - start ) /double(CLOCKS_PER_SEC/1000)) << 'n';

	//==============

	total.set_value(0);

	start = clock();

	cilk_for(unsigned int i = 0; i < n; i+=65536)

	{

		unsigned int jend = std::min(i + 65536, n);

		int my_total = 0;

		for(unsigned int j=i; j < jend; ++j)

			my_total += A[j];

		total += my_total;

	}
	std::cout << "Total (" << total.get_value()

	<< ") is correct";

	std::cout << " Total Time : "<< ( double( clock() - start ) /double(CLOCKS_PER_SEC/1000)) << 'n';
	return 0;

}

Total (400000000) is correctTotal Time : 2944
Total (400000000) is correct Total Time : 354

Jim Dempsey

www.quickthreadprogramming.com

Thanks, My Friend jimdempseyatthecove !!!

Hi Friend,

I have used following function:

double *A = (double *) malloc (SIZE*sizeof(double));
result = cblas_dasum (vec_size, A, incX);

but How will i use? above function with int data type and How will i use void?
int *A = (int *) malloc (SIZE*sizeof(int));
result = cblas_dasum (vec_size, A, incX);

Swapnil,

You cannot apply BLAS/Cblas functions such as 'dasum' to arrays of integers, characters, etc. Even a function returning an integer, such as 'idamax', which returns the index of the largest (in absolute value) element of an array, takes an array argument that must be one of the real/complex types.

Jim,

Until they fix the forum software so that '<' and '>' do not get devoured in silence, please use '&amplt;' and '&ampgt;' or substitute the double quote,", for the '<' and '>' characters. In your code above, there are a number of header files names that have been blanked out, and it is hard to guess all of them.

Thanks.

I'd like to follow up...

>>... i have used 600,000,000 element and find sum of that number and it was giving result in 140 to 150 ms i want more
>>optimzation...

Do you mean a faster calculation of a sum of the vector elements?

I think for a data set of 572MB a calculation times 140 to 150 ms are not too bad however it is not clear what CPU you're using.

Hi Sergey,
I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.
Thanks,

<<>>

It would be interesting if Swapnil could post disassembled code of his vector or array summing function.

>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.

I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).

>>>I'll do a test on my Dell Precision Mobile>>>

Do you have a laptop or desktop computer.

>>>To be honest, your numbers 140 - 150ms are impressive ( looks too fast ).>>>

Judging by raw performance measured in gflops of single hyperthreaded core when the single logical core is handling integer data(for example loop counter fused cmp/jmp and dec instructions) executed on Port 5 and second logical core is handling double-floating point four scalar 4D component vector addition executed with the help of AVX 256 - bit vector instructions when not saturated it is possible to achieve theoretical throughput of 8 DP flops per cycle on Port 1 i.e ~24 gflops on single core.Multiplied by four physical cores you can reach almost 96 gflops.

Here is good article :http://software.intel.com/en-us/forums/topic/291765

>>>>I'll do a test on my Dell Precision Mobile...
>>
>>Do you have a laptop or desktop computer.

This is a 15.6-inch laptop.

>>>>...I am using Intel I7 (4 core Processor) (8 Threads). and 8 GB RAM and Ubuntu 12.04 OS.
>>
>>I'll do a test on my Dell Precision Mobile ( Intel Core i7-3840QM / Ivy Bridge / 4 cores / 8 logical processors / 16GB ). I simply
>>would like to verify how long it takes to get a sum for a 572MB vector of doubles. To be honest, your numbers 140 - 150ms are
>>impressive ( looks too fast ).

I've just completed a set of tests and numbers look right. Here are my results:
...
Succesfully Allocated 4.47GB
Initializing the array...
Done

[ Test 1 ] Calculating Sum of 600000000 elements ( Rolled Loops 1-in-1 )...
Sum of 600000000 elements calculated in: 0.718000 secs
Sum of 600000000 elements: 600000000.000000
Start: 4358917 ticks
End : 4359635 ticks

[ Test 2 ] Calculating Sum of 600000000 elements ( Unrolled Loops 4-in-1 )...
Sum of 600000000 elements calculated in: 0.327000 secs
Sum of 600000000 elements: 600000000.000000
Start: 4359635 ticks
End : 4359962 ticks
...
[ Note ]
A test was executed ( forced ) on one CPU ( #3 ) with all C++ compiler optimizations turned off in a one threaded 64-bit application ( without OpenMP ).

Here are the source codes of the test:
...
int _ARRAY_SIZE = 600000000;

double dSum;
DWORD dwStart;
DWORD dwEnd;

double *pdData = NULL;
pdData = ( double * )malloc( ( _ARRAY_SIZE * sizeof( double ) ) );
if( pdData != NULL )
{
_tprintf( _T("Succesfully Allocated %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) );

int i;

_tprintf( _T("Initializing the array...\n") );
for( i = 0; i < _ARRAY_SIZE; i++ )
{
pdData[i] = 1.0L;
}
_tprintf( _T("Done\n\n") );

dSum = 0.0L;
_tprintf( _T("[ Test 1 ] Calculating Sum of %d elements ( Rolled Loops 1-in-1 )...\n"), _ARRAY_SIZE );
dwStart = ::GetTickCount();
for( i = 0; i < _ARRAY_SIZE; i++ )
{
dSum += pdData[i];
}
dwEnd = ::GetTickCount();
_tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f );
_tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum );
_tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd );

dSum = 0.0L;
_tprintf( _T("[ Test 2 ] Calculating Sum of %d elements ( Unrolled Loops 4-in-1 )...\n"), _ARRAY_SIZE );
dwStart = ::GetTickCount();
for( i = 0; i < _ARRAY_SIZE; i+=4 )
{
dSum += ( pdData[i] + pdData[i+1] + pdData[i+2] + pdData[i+3] );
}
dwEnd = ::GetTickCount();
_tprintf( _T("Sum of %d elements calculated in: %f secs\n"), _ARRAY_SIZE, ( float )( dwEnd - dwStart ) / 1000.0f );
_tprintf( _T("Sum of %d elements: %f\n"), _ARRAY_SIZE, dSum );
_tprintf( _T("Start: %d ticks\nEnd : %d ticks\n\n"), dwStart, dwEnd );
}
else
{
_tprintf( _T("Failed to Allocate %.2fGB\n"), ( ( double )( _ARRAY_SIZE * sizeof( double ) ) / 1024 / 1024 / 1024 ) );
}
...

>>...i got max through put time 140 to 150 milliseconds for 600,000,000 (600 million number)
>>Can we reduce time for execution time?

Unroll your loops manually or with a pragma directive. It is simple, effective and improves performance in at least 2x. You could combine unrolling with OpenMP and it should also improve performance.

>>>'ve just completed a set of tests and numbers look right. Here are my results:>>>

Can you post dissasembled code of your test case?
I'm interested in unrolled loop mainly.

Soon I will create a new thread when I plan to compare FFT algorithms compiled by Intel and Microsoft compilers for speed of execution.
I have already done very quick comparision between Intel fully optimized setting and Microsoft unoptimized compilers setting and as excepted Intel compiler was faster.
This test was done on 2048 sine function elements input data array.
I plan to add variuos level of complexity to my test cases and post the results.

>>Can you post dissasembled code of your test case?

Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?

>>>Iliya, I posted the source codes, right? You could compile it and study in a debugger, etc, right?>>>

Yes, but in a next few days probably I will not be able to get access to the Internet(I'm moving to new town).

@Sergey

Regarding my planned FFT related compiler testing.Do you have any idea for interesting test cases.I mean the size of data sets,what kind of functions to test,usage of random data and so on?

Hello Everyone,
from last 2 day my internet was not working sorry for giving late response to everyone's post.
I have seen that every one interested to see aggregate sum code with optimization

It as follow:
#include <stdio.h>
#include <omp.h>
#include <stdlib.h>
#define SIZE 2000000000

struct timeval starttime;
// start time function implemantation
void startTimer()
{
gettimeofday(&starttime,0);
}
// end time function implemantation
double endTimer()
{
struct timeval endtime;
gettimeofday(&endtime,0);

return (endtime.tv_sec - starttime.tv_sec)*1000.0 + (endtime.tv_usec - starttime.tv_usec)/1000.0;
}

int main ()
{
long int sum=0;
int *A = (int *)malloc(sizeof(int)*SIZE);
int i;
for(i=0;i

>>>check on every ones system and post every one his own result with system configuration i am interested to see how it work on different system>>>

It can be interested to see the speed of execution when the floating point array is summed.You can take step for example 0.00000001 and sum it.I bet that at least two ports will be used Port1 and Port5.

@Swapnil

Do you want to participate in my compiler comparision test.I would like to measure speed of execution achieved by Intel and Microsoft compilers.I plan to run various test cases when hard to optimize FFT algorithm is used.

@ iliyapolak
I like to participate in your compiler comparison test.

Quote:

Swapnil J. wrote:

@ iliyapolak
I like to participate in your compiler comparison test.


Thank you.Later today I will create a new thread solely for the purpose of the FFT testing.

Hi Swapnil!
I'm posting preliminary results of one of my tests.In this test I compared fully optimized by Intel compiler code to unoptimized version of the same algorithm compiled by Microsoft compiler.This code fills the 4096 element array with the sine values and performs FFT on these values.
Here is code

#include "stdafx.h"
#include
#include
#include

#define SWAP(a,b) temp=(a);(a)=(b);(b)=temp
void fourier1(double data[],unsigned long nn, int isign);

int _tmain(int argc, _TCHAR* argv[])
{
int i,q;
LONGLONG start,end;
const unsigned long MaxIter = 1e+6;
double test[4096];

for( i = 0;i < 4096;i++)test[i] = sin((double)i);

start = GetTickCount64();
for(q = 0;q < MaxIter;q++){

fourier1(test,2048,1);

}
end = GetTickCount64();

printf("Intel compiler testcase start value is %ld \n",start);
printf("Intel compiler testcase end value is %ld \n",end);
printf("Intel compiler resulting overhead is %ld \n",(end-start));

//for( i = 0;i < 2048;i++)printf("FFT test-case 1, fourier transform of sin() = %.17f\n",test[i]);
return 0;
}

void fourier1(double data[],unsigned long nn,int isign){

unsigned long n,mmax,m,j,istep,i;
double wtemp,wr,wpr,wpi,wi,theta,temp,tempi;

n = nn<<1;
j = 1;

for( i = 1;i < n;i += 2){
if(j < i){
SWAP(data[j],data[i]);
SWAP(data[j+1],data[i+1]);
}

m = n >> 1;
while(m >= 2 && j > m){
j -= m;
m >>= 1;
}
j+=m;

}

mmax = 2;
while(n > mmax){
istep=mmax << 1;
theta = isign*(6.28318530717959/mmax);
wtemp = sin(0.5*theta);
wpr = -2.0*wtemp*wtemp;
wpi = sin(theta);
wr = 1.0;
wi = 0.0;
for(m = 1;m < mmax;m+=2){
for(i = m;i<=n;i+=istep){
j = i+mmax;
temp = wr*data[j]-wi*data[j+1];
tempi = wr*data[j+1]+wi*data[j];
data[j] = data[i] - temp;
data[j+1] = data[i+1] - tempi;
data[i] += temp;

data[i+1] += tempi;
}

wr = (wtemp=wr)*wpr-wi*wpi+wr;
wi = wi*wpr+wtemp*wpi+wi;
}
mmax = istep;
}
}

Result for Intel compiler test case(average of 3 consecutive runs).

Intel compiler testcase start value is 2915299
Intel compiler testcase end value is 3003331
Intel compiler resulting overhead is 88032 msec

So you have ~0.088032 milisecond per one loop cycle.

Strangely Microsoft compiler test did not complete in 15000 miliseconds I was forced to terminate it.Another run will be performed with 100k loop iteratations.

Fortunately Microsoft compiler test completed with the whopping 162490 miliseconds per 10000 loop iterations, that's mean 16.249 msec per one loop cycle.
Here is the result:

Microsoft compiler FFT size 4096 testcase start value is 7108747
Microsoft compiler FFT size 4096 testcase end value is 7271237
Microsoft compiler FFT size 4096 testcase resulting overhead is 162490 msec.

I'm mildly curious as to which Microsoft version you consider as "the" Microsoft version. MSVC in VS2012 is the first to make any use of simd instructions, but of course you must specify /arch:SSE2 (preferably AVX) if you wish this in the 32-bit version.
Specification of unsigned long rather than int looks like an unnecessary handicap, as well as having differing meaning on non-Windows platforms.

Quote:

TimP (Intel) wrote:

I'm mildly curious as to which Microsoft version you consider as "the" Microsoft version. MSVC in VS2012 is the first to make any use of simd instructions, but of course you must specify /arch:SSE2 (preferably AVX) if you wish this in the 32-bit version.
Specification of unsigned long rather than int looks like an unnecessary handicap, as well as having differing meaning on non-Windows platforms.

Hi Tim
For my test I used Visual Studio 2010 and I choose to completely disable any optimization on the side of VS 2010 C/C++ compiler.It was done solely for the sake of comparision between thos two compilers.As I wrote in my previous post soon I will create a thread when I will test both of the compilers.
I was not aware that VS 2010 compiler is not using SIMD vector instruction were optimization setting were choosen.

@Tim
It is very strange that VS 2010 compiler did not completed in timely manner 1e6 loop iterations.In order to minimize function call overhead I used trigonometric recurrence to calculate sin and cos harmonics.
Here are results for 1e4 loop iterations optimization is enabled.

Array filling with 4096 sin function values
Starting 1e4 loop iterations
[Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 start value
is 8773574 msec
[Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 end value is
8914271 msec

And for comparision the same code compiled by the Intel C/C++ compiler.
Data array filled with 4096 sine values.Number of loop iterations is 1e4
[Microsoft VS2010 compiler - optimization on FFT size] 4096 test 3 resulting ove
rhead is 140697 msec

For comparision the same algorithm operating on the same data set.Loop iterated 1e4 times. The results: Intel compiler testcase start value is 9510476 msec Intel compiler testcase end value is 9511396 msec Intel compiler resulting overhead is 920 msec

For anyone still interested in FFT testing I got new more accurate results.Instead of calling 1e6 times fourier() routine and measuring time of execution I measured with the help of compiler intrinisnc function __rdtsc()  first for-loop block(responsible for divding data into odd and even parts) and while loop block(main execution body) of the function.The results were as I stated earlier were more accurate.

For FFT 4096 point sine function transform the speed of execution was ~212145 nanoseconds i.e 212microseconds.

Later I will continue on my evaluation of the various function beign trnasformed and time needed to accomplish that.

Login to leave a comment.