Same code, same compiler options, performance is poorer in Intel then in Visual

Same code, same compiler options, performance is poorer in Intel then in Visual

Hi,
After reading so much stuff about how Intel C++ is better then others I decided to test it (I have some real code to optimize)
I was trying many options combinations (O1,O2,Ox, SSE2,SSE3, SSE4.1,SSE4.2, data alignment, IPO, auto parallelization with loop level set and not set) . I have C0x enabled, and I am using restrict keyword for Intel (Visual does not recognize it). I have optimization diagnostic level set to 3. I am compiling for X64.

And, after a dozen or so checks I can say that Intel is running 0.3 fps (which is about 4.8%) slower than Visual.

Auto-parallelizer actually makes things slower than linear (half slower to be exact). I think it is because my functions are small but called very often.
Obviously OpenMP had similar performance to auto-parallerizer.

gcc 3.4.6 is about 30% slower. But I will do the tests also on gcc 4.x.x and Open64.

Do you have any ideas what else could improve performance? Or why is it working still slower than Visual?

I am using Intel v.11 and Visual Studio 2005.

Thanks for help.

32 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Joe,

Have you run a profiler to help determine where the problem is?

The problem may not be related to the code you compile. For example, if you are compiling x64 code but using x32 library for graphics then the x64 will perform an operation called a "thunk" each time a call is made to the library (to transition and pass args from x64 to x32 and back when necessary).

Data and code alignment may affect the performance (usually does).

Don't blame the compiler when the circumstances are beyond its control.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - joex26
Hi,
After reading so much stuff about how Intel C++ is better then others I decided to test it (I have some real code to optimize)
I was trying many options combinations (O1,O2,Ox, SSE2,SSE3, SSE4.1,SSE4.2, data alignment, IPO, auto parallelization with loop level set and not set) . I have C0x enabled, and I am using restrict keyword for Intel (Visual does not recognize it). I have optimization diagnostic level set to 3. I am compiling for X64.

And, after a dozen or so checks I can say that Intel is running 0.3 fps (which is about 4.8%) slower than Visual.

Auto-parallelizer actually makes things slower than linear (half slower to be exact). I think it is because my functions are small but called very often.
Obviously OpenMP had similar performance to auto-parallerizer.

gcc 3.4.6 is about 30% slower. But I will do the tests also on gcc 4.x.x and Open64.

Do you have any ideas what else could improve performance? Or why is it working still slower than Visual?

I am using Intel v.11 and Visual Studio 2005.

Thanks for help.


Hi

Use WMI to give (public) caracteristics of your machine before to decrease product not justified.
Compiler require several different test specific hardware appropriated machine before an generalised evaluation.
About Gnu compiler better that you use recently upgrade with an version for penguin operating system, Bill side ,as if you eat soup with the fork.
Some people sell the tooth brushes to birds ,and give council as result effective
benefit.
The Penguin (no an older) is an bird with tooth ? maybe require a flag (- bite) for appreciate performance ..

Example compiler with an result flag optimized:
C: Program Files Bill_Icc /SSE** / bite optimized_pig.cpp

objective answer ????
Kind regards

Dear all,
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.

I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.

I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?

I am developing software based on this:
http://research.nokia.com/research/mobile3D

Quoting - joex26
I am developing software based on this:
http://research.nokia.com/research/mobile3D

See if you could get a testcase so we can investigate.

Jennifer

Quoting - joex26
Dear all,
Thank you for your responses. Please don't take my statements as disregard to the Intel's compiler.
I just posted real results from my tests which were taken as an average from 12 runs for each option set.

I am not using any 32-bit libraries so thunking should not be the case here. I will send WMI tomorrow when I will be back at work.

I did profiling using AQTime profiler. I am aware of hotspots in the software. The code is pretty much optimized actually, so maybe there is not much to do for Intel compiler there?

I am developing software based on this:
http://research.nokia.com/research/mobile3D

Hi Joex26
Seriously without the joke.
I think that better you make test with couple ICC 11.. and VC2008 or last version.
probably you discover approximative same result if you having code has not wrote specifically accorded Icc favorable side

See if you can change some (while) to (for) with correct (private local pointer) for possible vectorization also
see TBB Openmp used for side (SCTP) ext....
Also If you can use last type same I7 core, Atom, or ULV processor is better.
You are welcome here with free choice of your appreciation,also with your favorite compiler.
With or without Icc compiler, success to all member your team.
Computer sized same phone have well potential success now or tomorrow , i hope.
Kind regards
After reflection add this exchange (with flowers),
Precaution if tomorrow I buy the Nokia phone I can not find (optimized to me) penguin inserted for bites my ear;

Bustaf:
Your utterance is basically bubbling. What did you smoke, man?
Please read my text slowly, then you can notice what compilers I did use.

Rolling back loops to let them be unrolled automatically? It must be next century technology. I did not know it.

The only two words which make sense in your text are OpenMP and TBB.

I tried OpenMP but it works slower, similarly to automatic parallelization. And TBB works similar to OpenMP so I skipped it.

And I don't have any connection with Nokia company so basically I care about your attitude to Nokia phones by the same means as for last year's snow.

And yeah you are great - you have brown belt!

Quoting - joex26

Bustaf:
Your utterance is basically bubbling. What did you smoke, man?
Please read my text slowly, then you can notice what compilers I did use.

Rolling back loops to let them be unrolled automatically? It must be next century technology. I did not know it.

The only two words which make sense in your text are OpenMP and TBB.

I tried OpenMP but it works slower, similarly to automatic parallelization. And TBB works similar to OpenMP so I skipped it.

And I don't have any connection with Nokia company so basically I care about your attitude to Nokia phones by the same means as for last year's snow.

And yeah you are great - you have brown belt!

Hi
I tried OpenMP but it works slower, similarly to automatic parallelization.
Require you learn how must be correctly wrote source code ..
About while() to for()
With while compiler have not information where can be cut or divided chunks
as started pair, impair or two side ++ & --.
About Tbb if you working side C++ is very nice lib to reduct or simplify source;
Also OpenMp is not only reserved loop to use other side example modify old socket with new SCTP
create group dummy virtual address and allow divided to asynchronous chunk and observe result.( out as gateway default)
is not -4 % is + 25/30% improved.
About phone as Nokia or other marks i think serious subject with this type object
can be used now as computer is an opportunity to out effect financial crisis.
(require an respect)
An deception that you not discerning difference fun to real..
If you have an sample to show public your level under browser with your httpt server,
I have also, as same , all people can evaluate that of two must return to school.
i not like use same language; just forced with your aggressive answer. (supposed smoker)
Kind regards

Joe,

Would it be possible for you to post sample code showing the problem?

Autoparallelization, OpenMP and TBB have different characteristics. I would not place OpenMP and TBB in the same category. While I am not promoting TBB I suggest you not discount it so quickly as to say it works similar to OpenMP.

You might try profiling the code to see what is happening. Intel has VTune, but you can also run AMD's CodeAnalystn(CA) using timer based sampling on Intel processors. CA is a free download and IMHO is simpler to use. Also, Intel has a demo Parallel Advisor that might provide insight as to the bottleneck.

From the symptom description I suspect something else at play than compiler optimizations.

RE: thunk

Several months ago I downloaded the Havok Smoke Demo (32-bit). My system has Windows XP x64. While I can compile and run 32-bit applications the OpenGL display drivers required thunking to transition between x32 and x64. The result was abismial frame rate. When obscuring the display window (or minimizing it), performance was restored. There may be an option in the Performance Monitor to count thunks, that would tell if this is affecting your performance.

Jim Dempsey

www.quickthreadprogramming.com

Jim,
I used profiler to find bottlenecks in the software. I am working on the encoder side. You can download the software from the link I provided to have a wider view.

The most problematic functions are findSad2_16x and findSad2_8x.
They take 2/3ds of the total time. Unfortunately I cannot be precise about the time because my eval. period for AQTime has finished. I will continue profiling using some other profiler and later on send the data.

These two functions have similar while loops:

#ifdef INTEL
int32 findSad2_8x(u_int8 *restrict orig, u_int8* restrict ref, int w, int blkHeight, int32 sad, int32 bestSad)
#else
int32 findSad2_8x(u_int8 *orig, u_int8* ref, int w, int blkHeight, int32 sad, int32 bestSad)
#endif
{
#ifndef VECTORIZATION
int j;

j = 0;
do {
sad += ABS_DIFF(orig[j*MBK_SIZE+0], ref[j*2*w+0]);
sad += ABS_DIFF(orig[j*MBK_SIZE+1], ref[j*2*w+2]);
sad += ABS_DIFF(orig[j*MBK_SIZE+2], ref[j*2*w+4]);
sad += ABS_DIFF(orig[j*MBK_SIZE+3], ref[j*2*w+6]);
sad += ABS_DIFF(orig[j*MBK_SIZE+4], ref[j*2*w+8]);
sad += ABS_DIFF(orig[j*MBK_SIZE+5], ref[j*2*w+10]);
sad += ABS_DIFF(orig[j*MBK_SIZE+6], ref[j*2*w+12]);
sad += ABS_DIFF(orig[j*MBK_SIZE+7], ref[j*2*w+14]);
j++;
} while (sad < bestSad && j < blkHeight);
#endif
#ifdef VECTORIZATION
int j,i;
for(j=0;j { for(i=0;i<8;++i)
sad += ABS_DIFF(orig[j*MBK_SIZE+i], ref[j*2*w+i*2]);
if (sad >= bestSad)
break;
}
#endif
return sad;
}

where ABS_DIFF is macro fetching results from LUT
#define ABS_DIFF(a, b) ((absDiff+MAX_DIFF)[(int)(a) - (int)(b)])

static const u_int8 absDiff[2*MAX_DIFF+1] = {
255,254,253,252,251,250,249,248,247,246,245,244,243,242,241,240,
......
.....
and so on....

as you can see I tried to change the while to for (to help vectorizer) but it was working slower then.
Maybe the break; instruction is preventing vectorizer to work properly. I don't know.

I was trying to use OpenMP also on 'for' loops (in these two functions), but asblkHeight can either be 8 or 16 there is not much work to be done even for one thread and the function quits so I asume that splitting to few threads gives so much overhead that in result the whole program is working slower.

Since I read that TBB is also data-based parallelism I concluded that it will not help me much in that case.

About thunking.
During next profiling round I will take a look at thunks counter.

Options for Visual:
/Ox /Oi /Ot /GL /I "C:Program Files (x86)boostboost_1_39" /I ".include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_VC80_UPGRADE=0x0600" /D "_ATL_MIN_CRT" /D "_MBCS" /GF /FD /EHsc /MT /arch:SSE2 /fp:fast /GR- /Fp".Release/MVCEncoder.pch" /Fo".Release/" /Fd".Release/" /W3 /nologo /c /Wp64 /Zi /TP /errorReport:prompt

/OUT:".Release/MVCEncoder.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:"C:Program Files (x86)boostboost_1_39lib" /MANIFEST /MANIFESTFILE:"x64ReleaseMVCEncoder.exe.intermediate.manifest" /DEBUG /PDB:".Release/MVCEncoder.pdb" /SUBSYSTEM:CONSOLE /OPT:NOWIN98 /LTCG /MACHINE:X64 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

And for Intel:
/c /O1 /Og /Oi /Ot /Qipo /GA /I "C:Program Files (x86)boostboost_1_39" /I ".include" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_VC80_UPGRADE=0x0600" /D "_ATL_MIN_CRT" /D "_MBCS" /GF /EHsc /MT /GS /arch:SSE3 /fp:fast /Fo".Release/" /W3 /nologo /Wp64 /Zi /TP /Quse-intel-optimized-headers /Qstd=c++0x /Qrestrict /Qopt-report:3 /Qopt-report-file:"x64Release/MVCEncoder.rep"

kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:".Release/MVCEncoder.exe" /INCREMENTAL:NO /nologo /LIBPATH:"C:Program Files (x86)boostboost_1_39lib" /MANIFEST /MANIFESTFILE:"x64ReleaseMVCEncoder.exe.intermediate.manifest" /TLBID:1 /DEBUG /PDB:".Release/MVCEncoder.pdb" /SUBSYSTEM:CONSOLE /OPT:NOWIN98 /IMPLIB:"C:UsersmichalprojectsVisualStudio2005MVCTwoThreadsReleaseMVCEncoder.lib" /MACHINE:X64

O1 is working fastest for Intel.

Thanks for your advices.

Download CodeAnalyst from http://developer.amd.com/CPU/CODEANALYST/Pages/default.aspx
Use timer based profiling (works on IA32 and EM64T).

What it looks like the problem is the LUT is not vectorizing due to requiring a gather operation. Therefore I do not think SSE is or can be used effectively until later versions (AVX) supporting scatter/gather using your LUT

However, consider replacing your LUT with SSE instruction using PSADBW and itsintrinsic

__m128i _mm_sad_epu8(__m128i a, __m128i b)

Compute the absolute differences of the 16 unsigned 8-bit values of a and b;

Then use a horizontal add to get increment for sad

Jim Dempsey

www.quickthreadprogramming.com

Just an update to this issue.

I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.
When there's any progress, I'll let you know.

Jennifer

Quoting - Jennifer Jiang (Intel)


I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.

32-bit ICL is doing some spills in that loop, and not optimizing out the multiply, which can be corrected by cutting back on source unrolling. gcc can be tricky, according to whether you specified a level of unrolling aggressiveness appropriate to your CPU (different for Penryn and Core i7, for example).

Quoting - Jennifer Jiang (Intel)

Just an update to this issue.

I checked the loop in findSad2_16x(). It's not possible for vectorization. But the code generated by icl should not run slower. So I've filed an issue report to the compiler team.
When there's any progress, I'll let you know.

Jennifer

Hi
i don't know ,if relation with low performance (OpenMp subject) mentioned before in specific situation. I observe an problem threading with ICC
I add an easy function for given where problem .

Remark: (Personally i have not smae problem i drive threading manually old school (with lib) g++ or icc used)

Function is to searching relation text with an array of word
you having an loop to allow each word research to separate (chunks).
Situation, you want determinate with text string where you search have relation specific sector.
example array
{"network","wireless","lan","wan","gateway","mask","cid","datagram","stream","tcp","socket" ext..........}
If you add an flag (as shared same an integer) each (chunks) can dispose halted flag.
you deciding that x (nps in smaple) words suffice relation determinate sector relation.
refered as same Googler engine to determinate activity relation for deal (onFishOver isMoneyProsper($,$))

if flag value is x (nps in sample) all chunk have not already find occurrence, can be halted.
(thread not now necessary as obsolete halted to not working for the wind).
(asynchronous side improve is here (random probability matrix) is not the size number off call)
in opposition with g++ that give various time with an differently number call relation (nbs)
Made the test with Icc you observe unchanged with increase or decrease number relation
I think threading not working correctly... logically time must variable with the call
number relation.

except error on my part,function is easy and clear for give as evident, i think .....

SAMPLE:

#include
#include
#include
#include
#include
#include
#include
#include

////////////////////////////////////////////////////////////////////////////////
//VOID COUNT_ARRAY_OCCURS FOR COUNT ARRAY OCCURRENCES ASYNCHRONY IN AN STRING//
////////////////////////////////////////////////////////////////////////////////
int
count_array_occurs (char *a, char **b, int c, int d)
{
// A IS GLOBAL STRING WHERE MUST COUNTED OCCURRENCE
// B IS ARRAY OF WORDS OCCURRENCES AS MUST COUNTED IN A
// C IS SIZE CALLED OF ARRAY
// D IS NUMBER PROBABILITY RELATION REQUIRED TO AN DEDUCTION
int la = strlen (a);
int lc[c];
int noc[c];
int pos[la][c];
int j;
int x;
int k;
int p = 0;
for (int i = 0; i <= c - 1; i++)
{
lc[i] = strlen (b[i]);
noc[i] = 0;
}
//omp_set_nested (c);
#pragma omp parallel shared(a,p) private(j,k,x)
{
for (int i = 0; i <= c - 1; i++)
{
// WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0
// if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);
// if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}

if (p > d) //FLAG HALT AS SATISFACT POINT
{
i = c - 1;
}
#pragma omp sections nowait
{

#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a[j] == b[i][0])
{
for (k = j; k <= j + lc[i] - 1; k++)
{
if (a[k] == b[i][x] && x <= lc[i] - 1)
{
x++;
}
if (x == lc[i])
{
noc[i]++;
pos[noc[i]][i] = k - x;
p++;

}
}
}
}
}
}
}
for (int i = 0; i <= c - 1; i++)
{
std::cout << noc[i] << " <-OCC-> " << b[i] << std::endl;
int m = 1;
while (pos[m][i] != 0)
{
std::cout << pos[m][i] << " <-POS-> " << b[i] << " <-TO-> " << pos[m][i] + strlen (b[i]) << std::endl;
m++;
}
}
return (0);
}

// JUST ADDED AS TEST TO VERIFY WITHOUT LOOP (AS DEFINED)
///////////////////////////////////////////////////////////////////////////////
//VOID BADGERS_LOOP_2_OCCURS FOR COUNT 2 OCCURRENCES ASYNCHRONY IN AN STRING//
///////////////////////////////////////////////////////////////////////////////
//A IS STRING WHERE MUST COUNTED OCCURRENCE
//B IS IS FIRST WORD OCCURRENCE COUNTED IN A
//C IS IS SECOND WORD OCCURRENCE COUNTED IN A
int
badgers_loop_2_occurs (char *a, char *b, char *c)
{
int la = strlen (a);
int lb = strlen (b);
int lc = strlen (c);
int j;
int x;
int k;
int noc = 0;
int noc1 = 0;
#pragma omp parallel shared(a) private(j,k,x)
{
#pragma omp sections nowait
{
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a[j] == b[0])
{
for (k = j; k <= j + lb - 1; k++)
{
if (a[k] == b[x] && x <= lb - 1)
{
x++;
}
if (x == lb)
{
noc++;
}
}
}
}
#pragma omp section
for (j = 0; j <= la - 1; j++)
{
x = 0;
if (a[j] == c[0])
{
for (k = j; k <= j + lc - 1; k++)
{
if (a[k] == c[x] && x <= lc - 1)
{
x++;
}
if (x == lc)
{
noc1++;
}
}
}
}
}
}
return noc + noc1; // NOC + NOC1 IS SUM OF OCCURRENCES
}

int
main (int argc, char *argv[])
{
char testocc[4096];
char *strtab[20] = { "smoker", "system", "Run", "quotes", "not", "a", "sample is :", "on", "the", "server", "to", "re", "ll", "faults", "pre", "ch", "na", " is", ",", "." };
int nps = 100;
strcpy (testocc,
" nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character
(s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installe
r.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults)open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quoteswithin quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quoteswithin quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system
preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote openi
ng character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2nString testocc sample is:n (Run the system preparation tool on the server to check the client system for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method
.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined, it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.Repeat 2 nString testocc sample is:n (Run the system preparation tool on the server to check the client syste
m for the prerequisite software and again, if necessary, to install any missing software.The simplest way to install the software on a server is to use the batch installer.This method have effect to list table, is part of method.Your choice is selected and you must submit to extend.Thank you and sorry for all syntax faults) open The quote opening character (s).close The quote closing character (s).If undefined, it will equal open.open2 The quote opening character (s) for quotes within quotes.If undefined,
it will equal open.Close2 The quote closing character (s) for quotes within quotes.If undefined, it will equal close.");

std::cout << testocc << std::endl;
std::cout << "nFunction count_array_occurs n " << std::endl;
int res = count_array_occurs (testocc, &strtab[NULL], 20, nps);
std::cout << " nbadgers_loop_2_occurs (testocc, "a", "b") result :" << std::endl;
std::cout << badgers_loop_2_occurs (testocc, "a", "system") << std::endl;
}

END SAMPLE

Remark: add flag -Wno-write-strings with gnu compiler for disabling string warn(C++)

Information about utility (time) used
On a uniprocessor, the difference between the real time and the total microprocessor time, that is:
real - (user + sys)
is the sum of all of the factors that can delay the program, plus the program's own unattributed costs.
On an SMP, an approximation would be as follows:
real * number_of_processors - (user + sys)

Problem is not situated bad time (as just result wrong working),is the absence of variability
I have also test sized random to i loop to see if an problem pthread_key_t (first last) but same ????

GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s

GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s

ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s

ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s

(library)
Build (shared) ICC (but same result if static)
linux-vdso.so.1 => (0x00007ffff83fe000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f8feff2e000)
libimf.so => /opt/intel/Compiler/11.0/081/lib/intel64/libimf.so (0x00007f8fefbd8000)
libsvml.so => /opt/intel/Compiler/11.0/081/lib/intel64/libsvml.so (0x00007f8ff0193000)
libm.so.6 => /lib/libm.so.6 (0x00007f8fef955000)
libiomp5.so => /opt/intel/Compiler/11.0/081/lib/intel64/libiomp5.so (0x00007f8fef7c5000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8fef4be000)
libintlc.so.5 => /opt/intel/Compiler/11.0/081/lib/intel64/libintlc.so.5 (0x00007f8fef380000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f8fef169000)
libc.so.6 => /lib/libc.so.6 (0x00007f8feee16000)
libdl.so.2 => /lib/libdl.so.2 (0x00007f8feec12000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8ff014a000)

Build Gnu G++
linux-vdso.so.1 => (0x00007fffe93ff000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007f50e0fa9000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f50e0ca2000)
libm.so.6 => /lib/libm.so.6 (0x00007f50e0a1f000)
libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f50e0817000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007f50e0600000)
libc.so.6 => /lib/libc.so.6 (0x00007f50e02ad000)
/lib64/ld-linux-x86-64.so.2 (0x00007f50e11c5000)
librt.so.1 => /lib/librt.so.1 (0x00007f50e00a4000)

Remark: i have use this machine some older just have kernel and all lib original (no rebuild) default(debian 5) installed
Also origin GNU compiler no an last version snapshots.
(i must make test with core i7 or an other new have 4 cores , also last versions Icc (just require to find times..)
Maybe problem is this machine type ???

Target: x86_64-linux-gnu
gcc version 4.3.2 (Debian 4.3.2-1.1)

Machine used:
debian:/# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6403.60
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel Pentium 4 CPU 3.20GHz
stepping : 3
cpu MHz : 2800.000
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall lm constant_tsc pebs bts pni monitor ds_cpl est cid cx16 xtpr
bogomips : 6398.20
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
power management:

Parallel to parallel....
You know this metaphor about programmer that want absolute improve also for is already well ?
"For him prove how many his love is big, he has make an kiss so much strongly that he has swallowed her eye."

Kind r..e.g....ar........ds

Joe,

I suggest you change your programming style a little by adding comments to your {}'s such that your scopes are obvious. This way you may eliminate programming errors (or make assumptions you ought not to make).

Example

//omp_set_nested (c);
#pragma omp parallel  shared(a,p) private(j,k,x)
{
 for (int i = 0; i <= c - 1; i++)
 {
  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0
  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);
  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}

  if (p > d)                //FLAG HALT AS SATISFACT POINT                            
  {
   i = c - 1;
  }
  #pragma omp sections nowait        
  {
   #pragma omp section
   for (j = 0; j <= la - 1; j++)
   {
    x = 0;
    if (a[j] == b[i][0])
    {
     for (k = j; k <= j + lc[i] - 1; k++)
     {
      if (a[k] == b[i][x] && x <= lc[i] - 1)
      {
       x++;
      }
      if (x == lc[i])
      {
       noc[i]++;
       pos[noc[i]][i] = k - x;
       p++;
      } // if (x == lc[i])
     } // for (k = j; k <= j + lc[i] - 1; k++)
    } // if (a[j] == b[i][0])
   } // for (j = 0; j <= la - 1; j++)
   // end #pragma omp section
  } // end #pragma omp sections nowait
 } // for (int i = 0; i <= c - 1; i++)
} // end #pragma omp parallel
// begin serial code
for (int i = 0; i <= c - 1; i++)
{
std::cout << noc[i] << "   <-OCC->   " << b[i] << std::endl;
int m = 1;
while (pos[m][i] != 0)
{
std::cout << pos[m][i] << "   <-POS->   " << b[i] << "   <-TO->   " << pos[m][i] + strlen (b[i]) << std::endl;
m++;
}
}
return (0);
}

In the above you can now clearly see that you have

omp parallel
omp sections
omp section ?????? one section ???
end omp sections
end omp parallel

The effect of the above is only one thread is doing any productive work

Jim Dempsey

www.quickthreadprogramming.com

Also are you missing an omp for on your for(i= loop?

If you intend to run each thread using for(i= then you must resolve race conditions with

noc[i]++;
pos[noc[i]][i] = k - x;

where you have concurrent access to same array location by multiple threads.

Jim

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Also are you missing an omp for on your for(i= loop?

If you intend to run each thread using for(i= then you must resolve race conditions with

noc[i]++;
pos[noc[i]][i] = k - x;

where you have concurrent access to same array location by multiple threads.

Jim

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

Bustaf,

Sorry about calling you joe (joex26 started this thread)

I will add additional comments to the reformatted version of _your_ code

01.//omp_set_nested (c); 
   // ^^ your comment to turn off nested is OK
   // vv pragma to begin parallel region  
02.#pragma omp parallel  shared(a,p) private(j,k,x)   
03.{  
   // ^^ scoping brace for parallel region
   // -- all threads in team are running through this region
   // vv each thread executes following for loop
04. for (int i = 0; i <= c - 1; i++)   
05. { 
   // -- each thread arrives here with i=0,1,2,...,c-1
   // -- and arrive here at uncontrolable times  
06.  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0   
07.  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);   
08.  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}   
09.  
10.  if (p > d)                //FLAG HALT AS SATISFACT POINT                               
11.  {   
12.   i = c - 1;   
13.  }  
   // vv pragma to dividE up current team into sections 
14.  #pragma omp sections nowait           
15.  {   
   // ^^ brace to begin scope of sections
   // (and implicit first section)
   // -- first thread of team reaching sections executes this section
   // *** CAUTION ****
   // *** this sections has nowait AND is enclosed in a loop
   // *** making it possible that a thread may enter this sections
   // *** on iteration n+1 _prior_ to other team member entering
   // *** sections on iteration n. This violates all threads must pass
   // *** through sections (although eventually they will)
   // *** the specification is mute on this as to if an implementation
   // *** can work around this
   // vv due to lack of statements between sections { and pragma omp section
   // vv the following section is the 1st section of the sections
   // vv and therefor redundant
16.   #pragma omp section
   // vv no brace (ok) therefore the following for statement is section 1
17.   for (j = 0; j <= la - 1; j++)   
18.   {   
19.    x = 0;   
20.    if (a[j] == b[i][0])   
21.    {   
22.     for (k = j; k <= j + lc[i] - 1; k++)   
23.     {   
24.      if (a[k] == b[i][x] && x <= lc[i] - 1)   
25.      {   
26.       x++;   
27.      }   
28.      if (x == lc[i])   
29.      {   
30.       noc[i]++;   
31.       pos[noc[i]][i] = k - x;   
32.       p++;   
33.      } // if (x == lc[i])   
34.     } // for (k = j; k <= j + lc[i] - 1; k++)   
35.    } // if (a[j] == b[i][0])   
36.   } // for (j = 0; j <= la - 1; j++)   
37.   // end #pragma omp section   
38.  } // end #pragma omp sections nowait   
39. } // for (int i = 0; i <= c - 1; i++)   
40.} // end #pragma omp parallel   
41.// begin serial code   
42.for (int i = 0; i <= c - 1; i++)

Comments

The sections with nowait inside a loop within a parallel region
(and without barrier) is operating under unspecified rules.


If you were to remove nowait (or add barrier) then only one
thread would perform productive work (as explained earlier)
This would be equivilent to making the sections and section into a single

If you want each thread to enter the for(j loop then you must resolve
the possibility that multiple threads may concurrently execute
"noc[i]++" with the same value for i, and under which case the result
is not determinant. A similar (but not quite same) issue exists with
"pos[noc[i]][i] = k - x" where multiple theads execute the statemen
at the same time with the same value of i, in which case you would
be using an indeterminant value in noc[i] as a subscript to store
a value of "k-x" which may also differ between threads.

Unless you want jibberish in noc and pos, the above code (as you wrote)
is senseless.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

Bustaf,

Sorry about calling you joe (joex26 started this thread)

I will add additional comments to the reformatted version of _your_ code

01.//omp_set_nested (c); 
   // ^^ your comment to turn off nested is OK
   // vv pragma to begin parallel region  
02.#pragma omp parallel  shared(a,p) private(j,k,x)   
03.{  
   // ^^ scoping brace for parallel region
   // -- all threads in team are running through this region
   // vv each thread executes following for loop
04. for (int i = 0; i <= c - 1; i++)   
05. { 
   // -- each thread arrives here with i=0,1,2,...,c-1
   // -- and arrive here at uncontrolable times  
06.  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0   
07.  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);   
08.  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}   
09.  
10.  if (p > d)                //FLAG HALT AS SATISFACT POINT                               
11.  {   
12.   i = c - 1;   
13.  }  
   // vv pragma to dividE up current team into sections 
14.  #pragma omp sections nowait           
15.  {   
   // ^^ brace to begin scope of sections
   // (and implicit first section)
   // -- first thread of team reaching sections executes this section
   // *** CAUTION ****
   // *** this sections has nowait AND is enclosed in a loop
   // *** making it possible that a thread may enter this sections
   // *** on iteration n+1 _prior_ to other team member entering
   // *** sections on iteration n. This violates all threads must pass
   // *** through sections (although eventually they will)
   // *** the specification is mute on this as to if an implementation
   // *** can work around this
   // vv due to lack of statements between sections { and pragma omp section
   // vv the following section is the 1st section of the sections
   // vv and therefor redundant
16.   #pragma omp section
   // vv no brace (ok) therefore the following for statement is section 1
17.   for (j = 0; j <= la - 1; j++)   
18.   {   
19.    x = 0;   
20.    if (a[j] == b[i][0])   
21.    {   
22.     for (k = j; k <= j + lc[i] - 1; k++)   
23.     {   
24.      if (a[k] == b[i][x] && x <= lc[i] - 1)   
25.      {   
26.       x++;   
27.      }   
28.      if (x == lc[i])   
29.      {   
30.       noc[i]++;   
31.       pos[noc[i]][i] = k - x;   
32.       p++;   
33.      } // if (x == lc[i])   
34.     } // for (k = j; k <= j + lc[i] - 1; k++)   
35.    } // if (a[j] == b[i][0])   
36.   } // for (j = 0; j <= la - 1; j++)   
37.   // end #pragma omp section   
38.  } // end #pragma omp sections nowait   
39. } // for (int i = 0; i <= c - 1; i++)   
40.} // end #pragma omp parallel   
41.// begin serial code   
42.for (int i = 0; i <= c - 1; i++)

Comments

The sections with nowait inside a loop within a parallel region
(and without barrier) is operating under unspecified rules.


If you were to remove nowait (or add barrier) then only one
thread would perform productive work (as explained earlier)
This would be equivilent to making the sections and section into a single

If you want each thread to enter the for(j loop then you must resolve
the possibility that multiple threads may concurrently execute
"noc[i]++" with the same value for i, and under which case the result
is not determinant. A similar (but not quite same) issue exists with
"pos[noc[i]][i] = k - x" where multiple theads execute the statemen
at the same time with the same value of i, in which case you would
be using an indeterminant value in noc[i] as a subscript to store
a value of "k-x" which may also differ between threads.

Unless you want jibberish in noc and pos, the above code (as you wrote)
is senseless.

Jim Dempsey

Quoting - bustaf

Quoting - jimdempseyatthecove

Quoting - bustaf

1] First I am not Joe
2]Sorry, I believe that you have absolutely nothing included
As appropriate to your completely suggestions literature as completely outside
Please give the same (your style) code to same functionality with specific results that use Time to reference.
also make your glasses for see as you have not see.
You can make your professor playing with Shakespeare style technical with hidden literature with other but no with me.
Trace all threads and probably you understand how many you are far...

Bustaf

Bustaf,

Sorry about calling you joe (joex26 started this thread)

I will add additional comments to the reformatted version of _your_ code

01.//omp_set_nested (c); 
   // ^^ your comment to turn off nested is OK
   // vv pragma to begin parallel region  
02.#pragma omp parallel  shared(a,p) private(j,k,x)   
03.{  
   // ^^ scoping brace for parallel region
   // -- all threads in team are running through this region
   // vv each thread executes following for loop
04. for (int i = 0; i <= c - 1; i++)   
05. { 
   // -- each thread arrives here with i=0,1,2,...,c-1
   // -- and arrive here at uncontrolable times  
06.  // WRONG ADDED JUST FOR VERIFICATION CROSS AND KMP_AFFINITY 0   
07.  // if((i % 2) ==0){p=0;}else{p=1;}cpu_set_t mask;CPU_ZERO(&mask);CPU_SET(p,&mask);   
08.  // if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) <0) {perror("sched_setaffinity");}   
09.  
10.  if (p > d)                //FLAG HALT AS SATISFACT POINT                               
11.  {   
12.   i = c - 1;   
13.  }  
   // vv pragma to dividE up current team into sections 
14.  #pragma omp sections nowait           
15.  {   
   // ^^ brace to begin scope of sections
   // (and implicit first section)
   // -- first thread of team reaching sections executes this section
   // *** CAUTION ****
   // *** this sections has nowait AND is enclosed in a loop
   // *** making it possible that a thread may enter this sections
   // *** on iteration n+1 _prior_ to other team member entering
   // *** sections on iteration n. This violates all threads must pass
   // *** through sections (although eventually they will)
   // *** the specification is mute on this as to if an implementation
   // *** can work around this
   // vv due to lack of statements between sections { and pragma omp section
   // vv the following section is the 1st section of the sections
   // vv and therefor redundant
16.   #pragma omp section
   // vv no brace (ok) therefore the following for statement is section 1
17.   for (j = 0; j <= la - 1; j++)   
18.   {   
19.    x = 0;   
20.    if (a[j] == b[i][0])   
21.    {   
22.     for (k = j; k <= j + lc[i] - 1; k++)   
23.     {   
24.      if (a[k] == b[i][x] && x <= lc[i] - 1)   
25.      {   
26.       x++;   
27.      }   
28.      if (x == lc[i])   
29.      {   
30.       noc[i]++;   
31.       pos[noc[i]][i] = k - x;   
32.       p++;   
33.      } // if (x == lc[i])   
34.     } // for (k = j; k <= j + lc[i] - 1; k++)   
35.    } // if (a[j] == b[i][0])   
36.   } // for (j = 0; j <= la - 1; j++)   
37.   // end #pragma omp section   
38.  } // end #pragma omp sections nowait   
39. } // for (int i = 0; i <= c - 1; i++)   
40.} // end #pragma omp parallel   
41.// begin serial code   
42.for (int i = 0; i <= c - 1; i++)

Comments

The sections with nowait inside a loop within a parallel region
(and without barrier) is operating under unspecified rules.


If you were to remove nowait (or add barrier) then only one
thread would perform productive work (as explained earlier)
This would be equivilent to making the sections and section into a single

If you want each thread to enter the for(j loop then you must resolve
the possibility that multiple threads may concurrently execute
"noc[i]++" with the same value for i, and under which case the result
is not determinant. A similar (but not quite same) issue exists with
"pos[noc[i]][i] = k - x" where multiple theads execute the statemen
at the same time with the same value of i, in which case you would
be using an indeterminant value in noc[i] as a subscript to store
a value of "k-x" which may also differ between threads.

Unless you want jibberish in noc and pos, the above code (as you wrote)
is senseless.

Jim Dempsey

Instead of trying to drown the fish with some nonsense literature
Please give the same (your style) code to same functionality with specific results
that use Time to reference for resolve this difference is an reality not literature.

-rwxr-xr-x 1 daemon staff 82094 Nov 7 06:35 ficat5 ICC COMPILED
real 0m0.202s Slow as 82094 must be read by system (is 4 * GNU size)
user 0m0.004s
sys 0m0.000s

-rwxr-xr-x 1 daemon staff 21410 Nov 7 06:37 ficat5 GNU COMPILED
real 0m0.097s
user 0m0.004s
sys 0m0.004s

Sorry,I have other task to make for less a time with you

Bustaf

I forget ...
Just compile exactly i have wrote without flag -parallel to not crossing ( vector)
and read your screen

ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcKZn3QH.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

also add if you want -par-report=3
ipo: remark #11001: performing single-file optimizations
ipo: remark #11005: generating object file /tmp/ipo_icpcEbE1oi.o
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(165): (col. 16) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: main
ficat5.cc(46): (col. 1) remark: OpenMP DEFINED SECTION WAS PARALLELIZED.
ficat5.cc(34): (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
procedure: count_array_occurs
procedure: __sti__$E
Maybe ... Compiler have use the ghosted threads for make ? and wrote an fictive literature
just only for that you are happy ???
Or as showing this compiler , is dummy ?? catastrophic result is not dummy..

I forget . also..

As writed openmp documentation; https://computing.llnl.gov/tutorials/openMP/#Introduction

OMP_IN_PARALLEL
Purpose:

C/C++, it will return a non-zero integer if parallel, and zero otherwise.

add an integer declared ipa

add after they part code:

pos[noc[i]][i] = k - x;
p++;

(add)

ipa= omp_in_parallel();
std::cout << " <-IPA-> " << ipa << std::endl;

run program and see if 0 or 1.
also move this control before #pragma omp sections nowait
run program and see if 0 or 1.

I think problem is that you make voluntary dummy as you having understand that i want allow each element array
to specific new thread machine I have wrote chunks as part of loop in my language not an individual thread..
I think i am not stupid to take this way that probably can only decrease performance.. also with used barrier side.
I don't know where can be interest to use thread machine aligned size arrays, to use same largely suffice.
to use barrier for same task with all separate thread just you extend each sem_wait receive answer 0 by semaphore
for wind.

You play your literature with my handicap bad contol your language
But i think that all serious programmer have understand that i want give.
and probably several have used already omp_in_parallel();to control as having given is true.

Bustaf,

Make these small changes to your code and run it. You may then begin to understand what is happening.

#pragma omp parallel           
{
  int iTeamMember = omp_get_thread_num();
  int nTeamMembers = omp_get_num_threads();
  printf("parallel iTeamMember = %d, nTeamMembers = %dn",iTeamMember,nTeamMembers);
  ...
  for (int i = 0; i <= c - 1; i++)
  {
    printf("for(i= with i=%d, iTeamMember = %dn",i,iTeamMember);
    ...
    printf("prior to sections iTeamMember = %dn",iTeamMember);
    #pragma omp sections nowait           
    {
      #pragma omp section
      { // ** add brace to include print in section
        printf("begin section iTeamMember = %dn",iTeamMember);
        for (j = 0; j <= la - 1; j++)   
        {
          ...
        } // for (j = 0; j <= la - 1; j++)   
        printf("end section iTeamMember = %dn",iTeamMember);
      } // ** add brace to close section
    } // end #pragma omp sections nowait   
    printf("following sections iTeamMember = %dn",iTeamMember);
  } // for (int i = 0; i <= c - 1; i++)   
} // end #pragma omp parallel   

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Bustaf,

Make these small changes to your code and run it. You may then begin to understand what is happening.

#pragma omp parallel           
{
  int iTeamMember = omp_get_thread_num();
  int nTeamMembers = omp_get_num_threads();
  printf("parallel iTeamMember = %d, nTeamMembers = %dn",iTeamMember,nTeamMembers);
  ...
  for (int i = 0; i <= c - 1; i++)
  {
    printf("for(i= with i=%d, iTeamMember = %dn",i,iTeamMember);
    ...
    printf("prior to sections iTeamMember = %dn",iTeamMember);
    #pragma omp sections nowait           
    {
      #pragma omp section
      { // ** add brace to include print in section
        printf("begin section iTeamMember = %dn",iTeamMember);
        for (j = 0; j <= la - 1; j++)   
        {
          ...
        } // for (j = 0; j <= la - 1; j++)   
        printf("end section iTeamMember = %dn",iTeamMember);
      } // ** add brace to close section
    } // end #pragma omp sections nowait   
    printf("following sections iTeamMember = %dn",iTeamMember);
  } // for (int i = 0; i <= c - 1; i++)   
} // end #pragma omp parallel   

Jim Dempsey

Jim
I see exactly as that i wait such as perfectly alternative 0 and 1.
even if you used i7 965 you see that omp_in_parallel() already return always 1,and never 0, with this functionality.
not necessary to add extend thread for the wind. If you think that Barrier better ,make
(your style) code to same functionality with specific results
that use Time to reference.

I think is this as wait users. not the fog that you use.
Bustaf

Bustaf,

Your complaint is the icc compiler has poor performance with this code as compared with gcc and when using OpenMP.

My efforts here are to show you that you have an error (bug)in your coding and a false expectation of what your code will do.

Your code, as written, and run with OpenMP enabled, will run the parallel region using all threads, however your sections only has one section and that section will run on one thread. All threads but one thread will be performing no productive work excepting for incrimenting and testing the i in the for(i= loop that is outside the sections. The code in parallel will have lower performance than this code in serial (one thread). You also have a flow control error with respect to the loop holding the sections with nowait and thus permitting a sections entry and exit sequencing problem.

Your complaint seems to be reduced to:

icc runscode with a bug in itslower than gcc runs code with a bug in it.

I suggest you fix the code first, then compare the performance second.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Bustaf,

Your complaint is the icc compiler has poor performance with this code as compared with gcc and when using OpenMP.

My efforts here are to show you that you have an error (bug)in your coding and a false expectation of what your code will do.

Your code, as written, and run with OpenMP enabled, will run the parallel region using all threads, however your sections only has one section and that section will run on one thread. All threads but one thread will be performing no productive work excepting for incrimenting and testing the i in the for(i= loop that is outside the sections. The code in parallel will have lower performance than this code in serial (one thread). You also have a flow control error with respect to the loop holding the sections with nowait and thus permitting a sections entry and exit sequencing problem.

Your complaint seems to be reduced to:

icc runscode with a bug in itslower than gcc runs code with a bug in it.

I suggest you fix the code first, then compare the performance second.

Jim Dempsey

Jim,
I ' have never had the slightest intention of remind in question the value of ICC compiler.
Just read part of my other exchange you understand, probably is unversed.
I have observed that some users show result less performance with
ICC.
I have use this tools only for see if an problem existing, also i have
add an function as real (not an dummy literature for hide solution.)
but for understand.
Personally I not use tool same OpenMP. (is wrote at top)
I write all threading manually with lib. and i have not problem. with ICC
GNU and an other type compiler.

Also me, I search seriously where problem resulting as time is less with ICC (OpenMp used with this function).
but is without your explain that have an bug i think ,and not the function having given.
I note that you are even unable to write a simple function mode Barrier,
such you not answer called several time.

you must proved as that you write with the concrete function better.

Observe the reality better that you invent an new dummy bug for to hide is existing
Bustaf

Bustaf,

From: http://www.openmp.org/mp-documents/spec30.pdf

>>>
2.5.2 sections Construct

summary

The sections construct is a noniterative worksharing construct that contains a set of
structured blocks that are to be distributed among and executed by the threads in a team.
Each structured block is executed once by one of the threads in the team in the context
of its implicit task.
<<<

Read the last sentence above: "Each structured block is executed once by one of the threads in the team in the context of its implicit task."

In your code, as written, the "context of the implicit task" is the context of the "#pragma omp parallel" region.
Within that context you have a for loop
within that for loop you have a "#pragma omp sections"
within the sections there is but one section

Literal translation of sentence "Each structured block is executed once by one of the threads in the team in the context of its implicit task." as it applies to your code is:

One of the threads of the thread team of the "#pragma omp parallel" region will execute the inner most section on the first iteration of the immediate encompassing for loop surrounding the sections but within the parallel region. As to which thread of the team executes the section this is dependent on implementation.

On the second iteration of the immediate encompassing for loop surrounding the sections but within the parallel region, all threads, even the thread which executed the section on the first iteration, will be inhibited from entering the section (because one thread of the team has executed the structured block once in the context of its implicit task).

This is not to say, that some implementation not literally following the standard, would permit the same thread of the team from re-entering a section that it had previously entered (as a result ofa subsequent iteration of the immediate encompassing for loop surrounding the sections but within the parallel region). But to do so would mean this compiler is non-compliant with the standard as written.

As your code is currently written, a compliant compiler would produce the following behaviour:

The "#pragma omp parallel" establishes a thread team of n threads
The enclosed for(i= loop is executed by all threads of the team
The "#pragma omp sections" is entered by all threads of the team on all iterations of the for(i= loop
On the first iteraton of the for(i= loop one thread, and only one thread of the team, will execute the single section that exists within the sections.
The "#pragma omp sections" is exited by all threads of the team, but none of which are blocked (due to nowait).
The threads of the team that are not the thread currently executing the one section will likely begin the next iteration of the for(i= loop during the execution of the section by whatever thread was chosen to execute that section.
On the second and subsequentiteration of the for(i= loop, all threads, even the thread that executed the section the first time, will be inhibited from executing that section (since it has already been executed once within the context of its implicit task.
The for(i= loop expires
The "#pragma omp parallel" exits terminating the thread team (but not necessarily the thread pool).

Other than for the section being executed once all the threads are performing a lot of unproductive work avoiding the section.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Bustaf,

From: http://www.openmp.org/mp-documents/spec30.pdf

>>>
2.5.2 sections Construct

summary

The sections construct is a noniterative worksharing construct that contains a set of
structured blocks that are to be distributed among and executed by the threads in a team.
Each structured block is executed once by one of the threads in the team in the context
of its implicit task.
<<<

Read the last sentence above: "Each structured block is executed once by one of the threads in the team in the context of its implicit task."

In your code, as written, the "context of the implicit task" is the context of the "#pragma omp parallel" region.
Within that context you have a for loop
within that for loop you have a "#pragma omp sections"
within the sections there is but one section

Literal translation of sentence "Each structured block is executed once by one of the threads in the team in the context of its implicit task." as it applies to your code is:

One of the threads of the thread team of the "#pragma omp parallel" region will execute the inner most section on the first iteration of the immediate encompassing for loop surrounding the sections but within the parallel region. As to which thread of the team executes the section this is dependent on implementation.

On the second iteration of the immediate encompassing for loop surrounding the sections but within the parallel region, all threads, even the thread which executed the section on the first iteration, will be inhibited from entering the section (because one thread of the team has executed the structured block once in the context of its implicit task).

This is not to say, that some implementation not literally following the standard, would permit the same thread of the team from re-entering a section that it had previously entered (as a result ofa subsequent iteration of the immediate encompassing for loop surrounding the sections but within the parallel region). But to do so would mean this compiler is non-compliant with the standard as written.

As your code is currently written, a compliant compiler would produce the following behaviour:

The "#pragma omp parallel" establishes a thread team of n threads
The enclosed for(i= loop is executed by all threads of the team
The "#pragma omp sections" is entered by all threads of the team on all iterations of the for(i= loop
On the first iteraton of the for(i= loop one thread, and only one thread of the team, will execute the single section that exists within the sections.
The "#pragma omp sections" is exited by all threads of the team, but none of which are blocked (due to nowait).
The threads of the team that are not the thread currently executing the one section will likely begin the next iteration of the for(i= loop during the execution of the section by whatever thread was chosen to execute that section.
On the second and subsequentiteration of the for(i= loop, all threads, even the thread that executed the section the first time, will be inhibited from executing that section (since it has already been executed once within the context of its implicit task.
The for(i= loop expires
The "#pragma omp parallel" exits terminating the thread team (but not necessarily the thread pool).

Other than for the section being executed once all the threads are performing a lot of unproductive work avoiding the section.

Jim Dempsey

Jim,

Jim,
Rectify the function that i have wrote , same you think better for that ICC
give same or better result that G++
Is only that i wait...

Time with function that i have write is:

GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s

GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s

ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s

ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s

Bustaf

Quoting - bustaf

Jim,

Jim,
Rectify the function that i have wrote , same you think better for that ICC
give same or better result that G++
Is only that i wait...

Time with function that i have write is:

GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s

GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s

ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s

ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s

Bustaf

When timing this example only one thread out of the thread team is performing useful work, the remainder of the threads are causing interference, unnecessary overhead, and intrusion upon other applications that may be running on the system. Therefore the measurement you have obtained is more of an indication of the adverse effect of ineffective programming more than the time it takes to perform the actual work. This is one problem with your statistics and expectations of performance measurements.

A second problem is you are measuring the wall clock time (real) and user and sys times of an application that runs for a very short time period. This is not to say that this measurement is not important because you may have a requirement to run a very short lived program many times. What this does mean is your focus on the problem is in the wrong place. That is you are measuring the time to establish and disband the OpenMP threading environment rather than timing the work done within that environment.

If I were to venture a guess as to the root cause for the descrepancy you are seeing is, is the time it takes to shut down the parallel system at the end of the program. And what effects this can be attributed to what is called the "blocking time" policy. This is the time a thread stays active waiting for the application to enter the next parallel region (and your application does not have a next parallel regions). Some implementations set this block time (run time to actively look for entry to next parallel region) to 0ms while other implementations set the block time on the order of 200ms. Is this 200ms on the order of timing descrepancy in your tests?

The blocking time policy can be altered by settings of environment variables. As to what the variable is and settings are depends on the version of OpenMP and vendor. KMP_BLOCKTIME is used by some (KMP_BLOCKTIME=200) while OMP_WAIT_POLICY is used by others.

Your program still has the problem of starting a parallel thread pool only to perform useful work by one thread and by that one thread on only one of its iterations. IOW you have an ineffective program.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - jimdempseyatthecove

Quoting - bustaf
Jim,

Jim,
Rectify the function that i have wrote , same you think better for that ICC
give same or better result that G++
Is only that i wait...

Time with function that i have write is:

GNU COMPILER nps=100
real 0m0.017s
user 0m0.004s
sys 0m0.000s

GNU COMPILER nps=1000
real 0m0.036s
user 0m0.000s
sys 0m0.000s

ICC COMPILER nps=100
real 0m0.202s
user 0m0.000s
sys 0m0.004s

ICC COMPILER nps=1000
real 0m0.202s
user 0m0.012s
sys 0m0.004s

Bustaf

When timing this example only one thread out of the thread team is performing useful work, the remainder of the threads are causing interference, unnecessary overhead, and intrusion upon other applications that may be running on the system. Therefore the measurement you have obtained is more of an indication of the adverse effect of ineffective programming more than the time it takes to perform the actual work. This is one problem with your statistics and expectations of performance measurements.

A second problem is you are measuring the wall clock time (real) and user and sys times of an application that runs for a very short time period. This is not to say that this measurement is not important because you may have a requirement to run a very short lived program many times. What this does mean is your focus on the problem is in the wrong place. That is you are measuring the time to establish and disband the OpenMP threading environment rather than timing the work done within that environment.

If I were to venture a guess as to the root cause for the descrepancy you are seeing is, is the time it takes to shut down the parallel system at the end of the program. And what effects this can be attributed to what is called the "blocking time" policy. This is the time a thread stays active waiting for the application to enter the next parallel region (and your application does not have a next parallel regions). Some implementations set this block time (run time to actively look for entry to next parallel region) to 0ms while other implementations set the block time on the order of 200ms. Is this 200ms on the order of timing descrepancy in your tests?

The blocking time policy can be altered by settings of environment variables. As to what the variable is and settings are depends on the version of OpenMP and vendor. KMP_BLOCKTIME is used by some (KMP_BLOCKTIME=200) while OMP_WAIT_POLICY is used by others.

Your program still has the problem of starting a parallel thread pool only to perform useful work by one thread and by that one thread on only one of its iterations. IOW you have an ineffective program.

Jim Dempsey

Sorry, Jim
Your answer is only the fog for you hide that you can not solve.
I think all real programmer , have understand.
I want ended subject same discuss for the wind with you,
just spectacle resulting.
Also i am very busy other more task important.
Bustaf

Bustaf,
Jim is correct that you are showing the effect of the default KMP_BLOCKTIME setting, which you could change yourself at the end of your application if it's important to terminate quicker. As you point out that the discussion you have started by hijacking the thread abruptly is a waste of time, perhaps next time you might consider confining your remarks to the subject at hand, or starting your own thread with a stated subject.
The original poster may have been looking for information, not to have the thread hijacked, and it would be common courtesy to act as if that were so.

Quoting - tim18
Bustaf,
Jim is correct that you are showing the effect of the default KMP_BLOCKTIME setting, which you could change yourself at the end of your application if it's important to terminate quicker. As you point out that the discussion you have started by hijacking the thread abruptly is a waste of time, perhaps next time you might consider confining your remarks to the subject at hand, or starting your own thread with a stated subject.
The original poster may have been looking for information, not to have the thread hijacked, and it would be common courtesy to act as if that were so.

Hi Jim ,Tim
Fortunately that Tim have add new .

Yes is true ,result now are +/- similar with KMP_BLOCKTIME to 10
I test with correct flag before to give result.
Sorry Jim ,with i am very busy i have not take time to read all correctly (required long time translating for me with my bad English)
I am very happy that problem is solved.
Congratulation to you Jim
But i persisting not accorded with you about Barrier and some other about your analyze.
When i have time I make other test as more complex for controlling better with this specific tools.
Kind regards

TIMES ADDED

With several repeat the better ICC is nps = 100;
real 0m0.016s
user 0m0.008s
sys 0m0.004s

With several repeat the better ICC is nps = 1000;
real 0m0.050s
user 0m0.008s
sys 0m0.004s

With several repeat the better GNU is nps = 100;
real 0m0.016s
user 0m0.000s
sys 0m0.004s

With several repeat the better GNU is nps = 1000;
real 0m0.049s
user 0m0.012s
sys 0m0.004s

Kind regards

Quoting - tim18
Bustaf,
Jim is correct that you are showing the effect of the default KMP_BLOCKTIME setting, which you could change yourself at the end of your application if it's important to terminate quicker. As you point out that the discussion you have started by hijacking the thread abruptly is a waste of time, perhaps next time you might consider confining your remarks to the subject at hand, or starting your own thread with a stated subject.
The original poster may have been looking for information, not to have the thread hijacked, and it would be common courtesy to act as if that were so.

Hi all; (Is still me....)
With several threads request (8 in sample just for show (not accorded this size array and size string))
As strange with default parameter
Run this sample GNU AND ICC you they show that ICC work differently
in level loop 3
affinity not respected and all threads as withe at this level 3.
Required to add and other absent or an specifically sized extra parameter?
Buffalo (GNU COMPILER) show all threads perfectly, aleatory ,different threads aligned with occurrences etc ....

Sorry joex26 require before having compilers accorded possible before you answer,
Already:
About -30%:
(Read the msys source, for you understand that to include Buffalo Bill in your test
compare have no value reference.) Buffalo Bill (mingw,tmingw) compiler is an extraordinary product.

About you have wrote:
Auto-parallelizer actually makes things slower than linear (half slower to be exact).
I think it is because my functions are small but called very often.
Obviously OpenMP had similar performance to auto-parallerizer.

About (is because my functions are small), i think no.
Probably more easy to use fork, or directly signal ext.. for task as hard type Carbone.

About a 4% difference:
Only show me where are work value with your compiler as preferably,i answer accorded with
your evaluation. Actually, here in my land, only ,an firewall syringes H1 N1 show real success....

About Auto:
See how many is improved for system global and not only the time result.

Buffalo Bill (Tmingw) that having download (binary) work also correctly in level 3,
just sched.h is (incomplete) for drive affinity, and the dll absent,require to compile with -static

I wait the magic parameter probably i have in less. (hope only as unique......received...)

I have also install last version ICC with new machine standard Intel 2 cores different distro friends team OpenSuse,
but problem thread Icc same, just change is Buffalo result largely better this sample with this machine as less older.

I have read this link but without better result.(affinity)
http://www.intel.com/software/products/compilers/docs/fmac/doc_files/sou...
Tim_17++ better that you review that you have wrote as (++) incorrect ,vulgar about (thread hijacked),
do not play this game with an old penguin , even if already a few teeth are in less....

Please, Jox26 ( hope as friend ) (or other user that want ) can you give me if resulting 0 all thread with Icc Bill same i see actually only with Icc Linux this sample
My opinion about Icc is unchanged I think always very very good compiler.
I forget.....
It would curious as can given this sample with U100 # Atom 1.??? core(s) 270..... (of course Buffalo compiler....)

Kind regards

SAMPLE START:
////////////////////////////////////////////////////////
//FOR BUFFALO COMPILER DEPEND #
///////////////////////////////////////////////////////
//g++ -ansi -O1 -Wno-write-strings -m64 -lpthread -fopenmp -mtune=core2 -march=core2 -fomit-frame-pointer ficat7.cc -o ficat7
//g++ -ansi -O3 -Wno-write-strings -ftree-vect-loop-version -m64 -lpthread -fthread-jumps -fopenmp -mtune=core2 -march=core2 -fomit-frame-pointer ficat7.cc -o ficat7
//g++ -ansi -O3 -mssse3 -Wno-write-strings -m64 -lpthread -fopenmp -mtune=core2 -march=core2 -fomit-frame-pointer ficat7.cc -o ficat7

////////////////////////////////////////////////////////
//FOR BUFFALO BILL COMPILER DEPEND # (DOWNLOAD tdm-mingw-1.908.0-4.4.1-2.exe)
///////////////////////////////////////////////////////
//g++ -O3 -Wno-write-strings -static -lgomp -lpthread -fopenmp ficat10.cc -o ficat10
//set omp_stacksize=6000K

//////////////////////////////////////////////////
// FOR INTEL COMPILER DEPEND #
//////////////////////////////////////////////////
//LANG=C; export LANG
//KMP_BLOCKTIME=0;export KMP_BLOCKTIME
//LD_LIBRARY_PATH=/opt/intel/Compiler/11.1/059/lib/intel64;export LD_LIBRARY_PATH
//./icpc -ansi -fast -static -O2 -axSSSE3 -lpthread -openmp -openmp-lib=compat -march=core2 -mtune=core2 ficat7.cc -o ficat7
//FOR BINARY AS SHARED ICC
//./icpc -fPIC -ansi -shared-intel -axSSSE3 -Os -fast -openmp -openmp-lib=compat -openmp-report -march=core2 -mtune=core2 -fomit-frame-pointer ficat7.cc -o ficat7

#include
#include
#include
#include
#include
#include
#include
#include
int
aleat (int u)
{
int r = rand ();
if (r < 0)
r = -r;
return 1 + r % u;
}
int rdtsc() //RESERVED
{
__asm__ __volatile__("rdtsc");
return(NULL);
}
////////////////////////////////////////////////////////////////////////////////
//VOID COUNT_ARRAY_OCCURS FOR COUNT ARRAY OCCURRENCES ASYNCHRONY IN AN STRING//
////////////////////////////////////////////////////////////////////////////////
int
count_array_occurs (char *a, char **b, int c, int d,int e)
{
// A IS GLOBAL STRING WHERE MUST COUNTED OCCURRENCE
// B IS ARRAY OF WORDS OCCURRENCES AS MUST COUNTED IN A
// C IS SIZE CALLED OF ARRAY
// D IS (ARG 1) NUMBER PROBABILITY RELATION REQUIRED TO AN DEDUCTION
// E IS (ARG 2) TO SIZE NUMBER THREAD USED (DEPEND MACHINE INCREASE IF SEG FAULT)

// ficat10.exe 20 8 or ficat10.exe 120 80 ficat10.exe 50 25 etc....
// FOR RUNTIME OTHER MACHINE WITHOUT BUFFALO BILL REQUIRED DOWNLOAD (mingwm10.dll and pthreadgc2.dll)
// REPEAT SEVERAL TIMES WITH SAME ARG TO DISCOVER EFFECT ALEATORY WITH DEPEND HALT POINT

int la = strlen (a); //SIZE STRING RECEIVED

int lc[c]; //ARRAY SIZED TO C
int noc[c]; //ARRAY SIZED TO C
int pos[la][c]; //ARRAY SIZED TO 2 CHUNKS: SIZE STRING & NUMBER ELEMENT REQUIRED FINDING
int j;
int x;
int k;
int p = 0;
int af = 0;
int gtn=0;
int v=0;
int i=0;

char * ve[c]; //NEW ARRAY FOR RECEIVE ALEATORY (REQUIRED FOR RECEIVED IS STATIC ELEMENT MEMORY POINTED)
// JUSTFIED BY THE WARN WITHOUT (-Wno-write-strings)
omp_set_nested (0); //
omp_set_dynamic (0); //
//INCREASE num_threads IF SEGMENTATION FAULT (ENLARGE OFFERT POSSIBLE ACCORDED SYSTEM (SHOW CR))
omp_set_num_threads (e); // 2 AS DEFAULT
//NUM THREAD DEPEND FOR FREE THAT YOU NOT HAVE USE (IMPROVE FOR ALL SYSTEM)
//kmp_set_blocktime (0); //FOR LEADED //UNKOWN FOR BUFFALO
#pragma omp parallel for //ordered private(j,k,x,v) // FOR ARRAY ALEATORY
for (v = 0; v <= c-1 ; v++) //LOOP TO INITIALIZE ARRAY MIXED SALAD (ALEATORY)
{
ve[i]= b[v];
lc[i] = strlen (ve[i]);
noc[i] = 0;
i++;
}
#pragma omp parallel shared(a,p) private(j,k,x)
{
for (int i = 0; i <= c-1 ; i++) //LOOP FIRST LEVEL
{
if (p >=d ) //FLAG HALT POINT if (p > d || gtn ==0 ) TO VERIFY ICC ALWAY REAL THREAD 0
{
i = c -1;
}

//WROTE COMMENT NOWAIT FOR QUANTITY THREADS IN THE WIND BUT INCREASE NPS TO 2000 OR ADD AN COMPLEMENT FLAG LESS LOOP SIMULATE

#pragma omp sections //nowait // MAKE TEST WITH AND WITHOUT NOWAIT
{
#pragma omp section
for (j = 0; j <= la ; j++) //SECONDARY LEVEL
{
//gtn = omp_get_thread_num (); //HERE THIS LEVEL INTEL ICC && BUFFALO SHOW THREAD
//std::cout << " " << gtn; // UNCOMENT TO SHOW ALL THREADS OK OR 0
x = 0;
if (a[j] == ve[i][0])
{
for (k = j; k <= j + lc[i]-1 ; k++) //THIRD LEVEL LOOP,HERE STRANGE WITH ICC [ 0 ] [ # 2 ]
{
if (a[k] == ve[i][x] && x <= lc[i]-1 ) // && FOR HALT BEFORE MORE OCCURENCES EXISTING
{
x++;
}
if (x == lc[i] && noc[i]<=d-1 && p<=d-1) // && FOR HALT BEFORE MORE OCCURENCES EXISTING
{
noc[i]++;
pos[noc[i]][i] = k+1 - x;
p++;
gtn = omp_get_thread_num (); //HERE INTEL SHOW ZERO THREAD, NOT A NAIL, BUFFALO SHOW NICE ALL THREAD ...
std::cout << " [ " << gtn << " ]"; //THREAD ACCORDED BY SYSTEM
//std::cout << " [ # " << omp_get_num_procs () << " ]";
// (REMOVED IN THIS SAMPLE) HERE BUFFALO GIVE QUANTITY 1 # AS CORRECT, BUT ICC GIVE 2 # FALSE , ALSO GHOST THREAD AS 0 ??
//(REMOVED IN THIS SAMPLE)AS GIVEN 1 # IS THREAD USING EACH PROCESSOR (CORE) ALTERNATE WITH BUFFALO.
// I ADD OTHER BETTER SAMPLE AFTER FOR LINUX SEVERAL # WHEN I HAVE TIMES
}

}

}

}

}

}
std::cout <<"n" << std::endl;
}
// std::cout <<"n" << std::endl;
for (int i = 0; i <= c-1 ; i++)
{
if (noc[i]!=0)
{
std::cout <<"n" << noc[i] << " <-OCC-> " << ve[i] << "n" << std::endl;
int m = 1;
while (pos[m][i] != 0)
{
std::cout << pos[m][i] << " <-POS-> " << ve[i] << " <-TO-> " << pos[m][i] + strlen (ve[i]) << std::endl;
m++;
}
}
}
for (v = 0; v <= c-1 ; v++) //LOOP TO INITIALIZE ARRAY MIXED SALAD (ALEATORY)
{
ve[v]="NULL";
}
return (0);
}

int
main (int argc, char *argv[])
{
char testocc[68096]; // USE THIS SIZE ICC OR GREATER TO INCREASE RESERVED SPACE (no Segmentation fault with trace)
char * strtab [22] = { "any","forum", "Non", "system", "otherwise", "vulgar", "not", "state", "computer", "on", "the", "server", "to","e","Intel", "faults", "pre", "ch", "na", "is", ",", "." };
int nps = atoi(argv[1]);
int nt=atoi(argv[2]);

strcpy (testocc,"Personal Non-Commercial Use: This Web Site is for personal and non-commercial use. Unless otherwise specified or as provided in these Terms, you may not modify, copy, distribute, transmit, display, perform, reproduce, publish, license, create derivative works from, transfer, or sell any information, software, products or services obtained from the MaterialsnNo Unlawful or Prohibited Use: You agree that you will not use the Web Sites or Material for any purpose that is unlawful or prohibited by these Terms of Use. You may not:n 1. upload, post, email, transmit or otherwise make available any content that is unlawful, harmful, threatening, abusive, harassing, tortuous, defamatory, vulgar, obscene, libelous, invasive of another's privacy, hateful, or racially, ethnically or otherwise objectionable;n 2. use the Web Sites, Materials, Services or activities to "stalk" or otherwise harass or harm another;n 3. impersonate any person or entity, including, but not limited to, an Intel ofcial, forum leader, guide or host, or falsely state or otherwise misrepresent your affiliation with a person or entity or collect or store personal data about other users in connection with the prohibited conduct and activities;n 4. forge headers or otherwise manipulate identifiers in order to disguise the origin of any content transmitted through the Web Site or Materials;n 5. upload, post, email, transmit or otherwise make available any content that you do not have a right to make available under any law or under contractual or fiduciary relationships (such as inside information, proprietary and confidential information learned or disclosed as part of employment relationships or under nondisclosure agreements);n 6. upload, post, email, transmit or otherwise make available any content that infringes any patent, trademark, trade secret, copyright or other proprietary rights ("Rights") of any party;n 7. upload, post, email, transmit or otherwise make available any unsolicited or unauthorized advrtising, promotional materials, "junk mail," "spam," "chain letters," "pyramid schemes," or any other form of solicitation;n 8. upload, post, email, transmit or otherwise make available any material that contains software viruses or any other computer code, files or programs designed to interrupt, destroy or limit the functionality of any computer software or hardware or telecommunications equipment;n 9. you may not use the Web Site or Materials in any manner that could damage, disable, overburden, or impair any Intel server, or network(s) connections, disobey any requirements, procedures, policies or regulations of networks connected to the Web Site or Materials or interfere with any other party's use and enjoyment of the Web Sites or Materials;n 10. you may not attempt to gain unauthorized access to any Web Site or Material, other accounts, computer systems or networks connected to any Intel server or Materials, through hacking, password mining or any other means or obtain or attempt to obtinany materials or information through any means not intentionally made available through the Web Sites or Materials;n 11. intentionally or unintentionally violate any applicable local, state, national or international law, including, but not limited to, regulations promulgated by the U.S. Securities and Exchange Commission, any rules of any national or other securities exchange, including, without limitation, the New York Stock Exchange, the American Stock Exchange or the NASDAQ, and any regulations having the force of law; and/orn 12. provide material support or resources (or to conceal or disguise the nature, location, source, or ownership of material support or resources) to any organization(s) designated by the United States government as a foreign terrorist organization pursuant to section 219 of the Immigration and Nationality Act.nUS Government Restricted Rights: The Materials including Intel Software are provided with "RESTRICTED RIGHTS." Use, duplication, or disclosure by the Governmnt is sbject to restrictions as set forth in FAR52.227-14 and DFAR252.227-7013 et seq. or its successor. Use of the Materials by the Government constitutes acknowledgment of Intel 's proprietary rights in themn");

std::cout << "nFunction count_array_occurs n " << std::endl;
count_array_occurs (testocc, &strtab[NULL], 22, nps,nt);
std::cout << "nIn sound coffe....: onSmokerOver isToFogLower()n 2,4,8 or ++ cores, you have probably an using exact for your money, No ?.. " << std::endl;
std::cout << "nOpenMp is very very nice product.I think , No ?.., when you having default parameters accorded !!!" << std::endl;

}

Several fixes have been done with the performance issue and the fixes are in the 15.0 release. Please give it a try and see how it goes.

Thanks,

Jennifer

Login to leave a comment.