Strange cilkplus behaviour with gcc compiler?

Strange cilkplus behaviour with gcc compiler?

Hello,

I appreciate any help anyone can give. I have been using the cilkplus built-in to the gcc compiler (Red Hat 6.2.1-2) and have had some good results, but there is some strange behaviour I don't understand and could use some pointers as to if I'm doing something wrong. Here is a simple test:

#include <unistd.h>
#include <stdio.h>
#include <cilk/cilk.h>
#include <cilk/cilk_api.h>

int dummyfunc(){
	printf("This is an empty function ");
}

int main(){
	int p = 12289; int k=11;
	for(int j=0;j<4;j++){
		for(int i=k;i!=1;i*=k);
	}
	
	for(int i=0;i<4;i++) cilk_spawn dummyfunc();
	cilk_sync;
}

Then I compile like:

<machine>% gcc -fcilkplus -c test.c -o test.o
<machine>% g++ -fcilkplus -lcilkrts test.o -o test

And here are some resutls:

<machine>% setenv CILK_NWORKERS 1

<machine>% perf stat ./test

This is an empty function This is an empty function This is an empty function This is an empty function 
 Performance counter stats for './test':

      12922.468472      task-clock (msec)         #    0.998 CPUs utilized          
             1,308      context-switches          #    0.101 K/sec                  
                 1      cpu-migrations            #    0.000 K/sec                  
               321      page-faults               #    0.025 K/sec                  
    38,768,870,903      cycles                    #    3.000 GHz                    
    30,151,782,133      stalled-cycles-frontend   #   77.77% frontend cycles idle   
     9,083,979,840      stalled-cycles-backend    #   23.43% backend  cycles idle   
    21,524,149,285      instructions              #    0.56  insns per cycle        
                                                  #    1.40  stalled cycles per insn
     4,303,701,602      branches                  #  333.040 M/sec                  
            48,129      branch-misses             #    0.00% of all branches        

      12.943003310 seconds time elapsed

<machine>% setenv CILK_NWORKERS 8

<machine>% perf stat ./test

This is an empty function This is an empty function This is an empty function This is an empty function 
 Performance counter stats for './test':

     107029.355833      task-clock (msec)         #    7.984 CPUs utilized          
            10,882      context-switches          #    0.102 K/sec                  
                32      cpu-migrations            #    0.000 K/sec                  
               394      page-faults               #    0.004 K/sec                  
   308,709,974,102      cycles                    #    2.884 GHz                    
   108,096,162,060      stalled-cycles-frontend   #   35.02% frontend cycles idle   
    48,663,214,367      stalled-cycles-backend    #   15.76% backend  cycles idle   
   441,641,535,720      instructions              #    1.43  insns per cycle        
                                                  #    0.24  stalled cycles per insn
    90,218,344,192      branches                  #  842.931 M/sec                  
        51,095,827      branch-misses             #    0.06% of all branches        

      13.405537268 seconds time elapsed

<machine>% setenv CILK_NWORKERS 32

<machine>% perf stat ./test

This is an empty function This is an empty function This is an empty function This is an empty function 
 Performance counter stats for './test':

     392491.711496      task-clock (msec)         #   15.965 CPUs utilized          
       551,420,816      context-switches          #    1.405 M/sec                  
               367      cpu-migrations            #    0.001 K/sec                  
               546      page-faults               #    0.001 K/sec                  
 1,060,481,304,342      cycles                    #    2.702 GHz                    
   385,856,059,460      stalled-cycles-frontend   #   36.38% frontend cycles idle   
   232,571,157,589      stalled-cycles-backend    #   21.93% backend  cycles idle   
 1,404,659,473,232      instructions              #    1.32  insns per cycle        
                                                  #    0.27  stalled cycles per insn
   277,478,980,098      branches                  #  706.968 M/sec                  
       383,200,719      branch-misses             #    0.14% of all branches        

      24.583960051 seconds time elapsed

The machine has 16 threads. The point is that code that has nothing to do with cilk utilizes as much of the CPU as possible up to the number of workers, even when it is in areas of code with no cilk. It didn't seem like this was happening earlier, I can't figure out what is going on.

Any help would be great.

Matthew

 

 

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Without looking too deeply at the details, here are some general thoughts:

The compiler generates code to start up the Cilk runtime if it sees any cilk_spawn or cilk_for within the function. Regardless of how much spawning is going on, the runtime will spin up P threads, where P is the value of CILK_NWORKERS.  These workers will each saturate a CPU, given the chance, looking for work to do.  There is an exponential backoff, but I'm guessing that your program runs too quickly to notice that. Note, however, that if anything else is happening on the machine, the idle workers will yield do that other work.  CPU utilization is thus deceptively high on an unloaded computer.

The tradeoff (which we may not have gotten quite right) is between finding and executing work as aggressively as possible and saving energy/keeping CPU utilization small when idle.

Hello,
Thanks for the comment. I will try this on a few more compilers (clang cilk plus and cilk 5.4.6, not plus).

This test program runs for 12 second so it didn't seem to be spinning down. And the serial portion seems to take longer when these threads are looking for work to do?

In the original larger problem there is a bunch of serial work followed by parallel work, and i time both of them, and noticed the serial work has a performance hit when running with more than one worker.

Also I'm wondering if CPU turbo boost could come into play.

 

With clang cilkplus the same thing happens, but I found moving the cilk keywords into a called function solves the problem:

void cilk_func(){
	for(int i=0;i<4;i++) cilk_spawn dummyfunc();
	cilk_sync;
}

int main(){
	int k=11;
	for(int j=0;j<4;j++)
		for(int i=k;i!=1;i*=k);
	
	cilk_func();
	printf("\n");
	return 0;
}

Moving the work into another function doesn't seem to solve the problem.

Matthew

 

The latest open-source Cilk runtime alleviates this issue to some degree (not perfectly), so it is worth trying the mainline version of GCC.

Leave a Comment

Please sign in to add a comment. Not a member? Join today