Reducer's performance regression on windows laptop compared to linux

Reducer's performance regression on windows laptop compared to linux

The attachment is the code of the eigen sample.
I find something strange that when the program run on my windows laptop, the cilk version runs several times slower than the serial one, while the cilk version runs twice faster than the serial version on the Linux server.
I compare the environment as following:
OS                Compiler                            Core                                                            Time cost
Windows 7    Version 14.0.1    Intel(R) Core(TM) i5 M560 @ 2.67GHz x2    Serial:1.817 sec
                                                                                                                         Cilkfor:24.292 sec

Red Hat Enterprise 6.0    version 14.0.1 (Cannot find 13)    Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz x4    

                                                                                                                            Serial:1.028 sec
                                                                                                                            Cilkfor:0.622 sec

Command line:
icc JCBMatrixEigen.cpp timer.cpp -O3
icl JCBMatrixEigen.cpp timer.cpp /O3

I profiled this code in Vtune ,find the 90%'s runtime is spent on the Cilk's reducer's wait and spin time.

However there are no problem when compiled and run on the Linux machine.

AttachmentSize
Download src_0.zip5.64 KB
15 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

    Do you know how many worker threads you are using for each version?  I think those two machines have different numbers of cores, so it would be helpful to eliminate one of the variables.   If you explicitly set the CILK_NWORKERS environment variable to 2 (and 4), what numbers do you get from the two versions?

If the Windows laptop has only 2 cores, but Cilk Plus is using 4 threads, and there are other programs running in the background on the Windows laptop, and the turbo frequency scaling is kicking in, I could imagine that the running time might get substantially worse?   But the results do seem somewhat mysterious...
Thanks,

Jim

Follow Jim Sukha's suggestion to use CILK_NWORKERS=2 on the Core(TM) i7 machine.  Your program may be sensitive to workspace partitioning issues and number of threads.

In looking at your code it has other issues. In particular you are overusing the reducer (unnecessary number of reductions). Consider:


	cilk::reducer_max_index<int, double> fm_reducer;

	fm_reducer.set_value(-1,0.0);

	cilk_for (int i=1; i<=n-1; i++)

	{

	  int j=0;

	  double value=fabs(a[i*n+j]);

	  int index = i*n+j;

	  for (j=1; j<=i-1; j++)

	  {

	    double value2=fabs(a[i*n+j]);

	    if(value2 > value)

	    {

	      value = vlaue2;

	      index = i*n+j;

	    }

	  } // for (j=1; j<=i-1; j++)

	  fm_reducer.calc_max(index , value);

	} // cilk_for (int i=1; i<=n-1; i++)

	

Jim Dempsey

 

www.quickthreadprogramming.com

I built and ran the test on my 16-core Nehalem system running Windows 7.  When prompted for the implemenation, I chose "0" to run all the tests.

My first observation is that when runing test 3 (Array Notation + Auto-vectorized), memory usage goes through the roof.  To the point of making my system unresponsive.  Memory usage suddenly climbed to ~6GB and hung there.  Though it was hard to tell, since Windows Task Manager stopped updating until I aborted the process.  Since the system has 6GB of physical memory, clearly the system is being paged to death.

I'm troubled by the fact that inside JCB_EigenValue_an, you have a call to malloc for a_abs with no matching free.  I added a call to free the allocated memory after the statement

double value =a_abs[index];

Which appears to be the last use of a_abs.  Now the program completes without thrashing my system.  Memory usage tops out at about 1GB.

Anyway, here's my results:

==================== Test 1: SERIAL ====================
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 2.311766 sec

========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 1.126811 sec

======= Test 3: Array Notation + Auto-vectorized =======
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 11.774041 sec

The message "Out of max loop, cannot fit the eps requirement" is telling me that your algorithm is not converging properly, but since all of the implementations are exiting with the same message, I didn't look into that.

I don't see why you're allocating/freeing memory inside your while loop.  You really don't want to time how quickly the system can allocate large blocks of memory, do you? Pulling the malloc/free of a_abs out of the loop dropped Test 3's time to 5.976137 sec.

However, your question was about the performance of the cilk_for test, not the array notation version. I'm seeing a speedup on my Windows system (with 16 workers) of 2.05, which is certainly comparable to your speedup on RHEL6 of 1.65.  Without knowing the details of your test systems, I can't say for certain that this is good, but it's certainly better than the speedup you're reporting of 0.07. :o)

Given that you're looping over 500 elements in your cilk_for, you really aren't doing very much work to amortize the spawn/steal overhead.  You should probably review Jim's article on "Why is Cilk Plus not speeding up my program?"  The first section - "Parallel regions with insufficient work," covers exactly this issue.

    - Barry

Attachments: 

AttachmentSize
Download jcb.h7.58 KB

I too tried this test and froze my laptop, had to power off to recover.

I couldn't find a reference to the quoted model number; I too suspected that perhaps you had HyperThreading and should adjust CILK_NWORKERS to number of cores.  I suppose more people are looking to peg their performance meter than to optimize performance, so (unlike MKL threaded library) there's no point in cilk runtime trying to optimize number of workers automatically.

On my laptop, your cilk_for version gave about 60% speedup, without tinkering with CILK_NWORKERS and the like.  I wasn't ready to test much given the need to undo the trap you set.

Thanks for your comments .

1 This problem does not happen on the windows server machine(confirmed on Haswell Windows) or linux server machine ,which doesn't have the same macro-architecture with the laptop even they maybe have the same micro-arch name.

2 even i set CILK_NWORKERS=1 or =2 ,this doesn't come into effect during code's runtime when i run the binary in CMD or in Vtune. I think this is a defect for setting environments for runtime ,as i can also find a issue report about this unworked environments's setting.

C:\Users\qiaominq\Documents\src>set CILK_NWORKERS=1

C:\Users\qiaominq\Documents\src>JCBMatrixEigen.exe
The dimension of matrix is 500.
Please choose a specific implementation to be executed:
0) All test
1) SERIAL (Auto-vectorized)
2) CILK_FOR + Auto-vectorized
3) Array_Notation + Auto-vectorized
2
========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 17.362321 sec

Press any key to continue . . .

C:\Users\qiaominq\Documents\src>set CILK_NWORKERS=2

C:\Users\qiaominq\Documents\src>JCBMatrixEigen.exe
The dimension of matrix is 500.
Please choose a specific implementation to be executed:
0) All test
1) SERIAL (Auto-vectorized)
2) CILK_FOR + Auto-vectorized
3) Array_Notation + Auto-vectorized
2
========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 10.557831 sec

Press any key to continue . . .

Yes ,they seems a problem on a laptop chip with hyperthreading ,i want to confirm this point with 'set CILK_NWORKERS' ,which does not work as you can see.

My colleague also confirm this problem with his laptop, so anyone can rightly set CILK_NWORKERS to num of physical cores to reproduce and fix this performance regradtion?

Thanks,

Qiao 

Yes ,this 

More update.

Even i added the set workers in code as below and compile with ''icl JCBMatrixEigen.cpp timer.cpp -O2'' or -O3 

#ifdef __INTEL_COMPILER
if (0!= __cilkrts_set_param("nworkers","2"))
 {
    printf("Failed to set worker count\n");
    return 1;
 }
    printf("2) CILK_FOR + Auto-vectorized\n");
    printf("3) Array_Notation + Auto-vectorized\n");
#endif // __INTEL_COMPILER

you can still find perf degradtion in CMD

==================== Test 1: SERIAL ====================
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 2.337160 sec

Press any key to continue . . .

C:\Users\qiaominq\Documents\src>JCBMatrixEigen.exe
The dimension of matrix is 500.
Please choose a specific implementation to be executed:
0) All test
1) SERIAL (Auto-vectorized)
2) CILK_FOR + Auto-vectorized
3) Array_Notation + Auto-vectorized
2
========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 12.263659 sec

Press any key to continue . . .

My laptop is 2-physical core ,so set workers to 2 should be ok .

Thanks for reproducing this problem on laptop ,so here seems turning off hyperthread as i do in the code also incurs a lot of spinning time to access the reducer.

 

My

More update.

after i add  printf ("nworkers : %d \n", __cilkrts_get_nworkers());

i am told there are 2 nworkers created at runtime.

nworkers : 2
2) CILK_FOR + Auto-vectorized
3) Array_Notation + Auto-vectorized
2
========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 9.007638 sec

After confirm that there are really 2+1main threads at runtime ,and the cilk reducer is still the hostspot ,the overhead and spin time of which is 50% of the total cpu time ,i have attached the screenshot profiled in Vtune.

Please have a look at this degradation on windows laptop. 

The most interesting thing is even i set nworkers to 1,

nworkers : 1
2) CILK_FOR + Auto-vectorized
3) Array_Notation + Auto-vectorized
2
========== Test 2: CILK_FOR + Auto-vectorized ==========
loop time:10001
Out of max loop, cannot fit the eps requirement.
Algorithm runs 15.122207 sec   //serial time is 1.9xxxxx sec

Press any key to continue . . .

Vtune says reducer is still the main hotspot

 

Thanks ,Qiao

 

Attachments: 

AttachmentSize
Download vtune2thread.PNG143.12 KB

If the time on 1 worker is significantly slower than the time executing serially, then the problem may be that a compiler optimization being applied to the serial version is not being applied to the cilk_for loop.  The use of hyperthreading may be unrelated to problem at hand.

In particular, I would guess that for the slow version, the compiler is not hoisting the reducer lookup out of an inner for loop in your code, AND because it is not hoisting the lookup out, it is also not vectorizing the inner loop.   See items #5 and #6 from the article that Barry had linked to for some additional details.

Assuming that is the problem, there is still the question of why your Windows laptop is showing the problem but the Linux version isn't.   My first guess would be different header files for cilk/reducer.h or the max reducer.    Although if you are using the 14.0 compiler for both, that would seem a bit unlikely to me...   Does icc -V and icl /V give the same version information?

 

If you are looking to see if the compiler behaves differently between Windows and linux, or for missed vectorization, opt-report would be a place to start.  As the rest of us showed a gain from the posted cilk_for on Windows with the same compiler, that doesn't seem a likely answer,

Among the reasons why apparently fashions are shifting back from Cilk(tm) Plus to Openmp is the new provision of #pragma parallel for simd for the case where the same loop is to be both vectorized and parallelized. 

Apparently official advice has been given to write the strip mining explicitly in the source code in order to combine cilk_for with CEAN; for those cases where at least equally good results are obtained with OpenMP it seems like beating heads against walls.  Cilk(tm) Plus doesn't seem designed for many of those situations where OpenMP is suitable, and we've spent much time discussing this without moving forward.

Thanks Jim Dempsey and other people for suggesting valuable advices,

It turns out the latest 14.0 sp1 on windows has fixed this problem ,using 2 or 4 threads both outperform than the serial using this version.

Jim Dempsey's code suggestion is technically right for reducing the number of looking-up reducers. however ,its performance is the same with my original code. maybe the  'if(value>value2)' caused many branches is the problem .

Thanks for all your inputs again.

Regards,

Qiao

 

Here was the reasoning for my speculations:
The VTune data says that the top item for time spent is in the __cilkrts_hyper_lookup function, which is for reducer lookups.  
The data you had above suggests that the CILK_NWORKERS=1 time is much slower than the serial running time:
"Algorithm runs 15.122207 sec   //serial time is 1.9xxxxx sec"
This time difference shouldn't be attributable to runtime scheduling, because the Cilk Plus runtime does not create any additional threads or do anything interesting at all when CILK_NWORKERS=1.  Also, for CILK_NWORKERS=2, you said "Algorithm runs 9.007638 sec".   That is fairly close to the 2x speedup over the 15.1 second time, which is what we would expect.    Usually, in cases like this, the likely candidate for large performance differences between serial code and CILK_NWORKERS=1 is some compiler optimization which is not being applied to the cilk_for loop in the same way as in the serial code, or the reducer is interacting in an unexpected way. 

But if we've ruled out vectorization, then something else is going on, which is still a bit mysterious to me...
What happens if you replace the cilk_for with a normal for loop, but still use the reducer?   I noticed the serial version of the code and the cilk_for version may have slightly different inner loop bodies?

Cheers,

Jim

 

PS.  Of course OpenMP may be an option, if it works well for this situation.   But from my perspective, there is some unexplained behavior going on with a cilk_for loop, and it is useful for us in the Cilk Plus forum to track down what is going on.   Moreover, for this program, I have yet to see anything which is *fundamentally* different about Cilk Plus which makes it less suitable than OpenMP.   There might, however, be deficiencies in implementations which can cause performance differences.   I don't quite see why #pragma simd on a cilk_for loop shouldn't also be legal... 

The reason that I sometimes suggest doing ugly things like manually hoisting reducer lookups out of serial loops, etc., is that for reasons I don't fully understand, some versions of compilers are known to miss certain kinds of optimizations for cilk_for loops that should apply equally to both OpenMP and cilk_for loops.
 

Never mind, I see you've fixed your problem. :)
Cheers,

Jim

@Jim Sukha (Intel)

Thanks for your follow-up ,after i changed the compiler version to the latest 14.0 sp1 from the 13.0 ,the performance are good now on both windows and linux ,and also i observed cilk_for version's worse perf than the serial one if you add the '/Zi' to contain debug symbols to be further profiled in Vtune ,so in this case you would observe a lot of reducer-lookups in Vtune.

And as to @jimdempseyatthecove's hoisting out the fm_reducer.calc_max(index , value); from the inner for loop in the code is beneficial actually ,and it achieves a light better perf than my original code.

And i think if we can reduce the number of comparisons at 'if(value2 > value)' ,there would gain more perf .

 

Thanks,

Qiao

The compiler team knows about the missed opportunity to hoist the reducer view lookup. They have promised us that they'd fix it.

   - Barry

Leave a Comment

Please sign in to add a comment. Not a member? Join today