why cilk_for slower that for ?

why cilk_for slower that for ?

Hello,

I'm trying to use cilk_for to parallel the following function :

void ROGERSSNP(double *Geno, int Markers, int Individuals, double *RogersDistance, int *nCommonMarkers)
{
double total = 0;
 int common;
int const NA = 999;
 int position = 0;
for(int line1 = 0; line1 < Individuals-1; ++line1)
 {
 for(int line2 = line1+1; line2 < Individuals; ++line2)
 {
// declare reducers
 cilk::reducer_opadd<double> reducer_total(0.0);
 cilk::reducer_opadd<int> reducer_common(Markers);
#pragma cilk grainsize = Markers / (4 * __cilkrts_get_nworkers())
 cilk_for(int marker = 0; marker != Markers; ++marker)
 { 
if ( (Geno[line1 * Markers + marker] != NA) && 
 (Geno[line2 * Markers + marker] != NA) ) 
 {
reducer_total += fabs( Geno[line1 * Markers + marker] - Geno[line2 * Markers + marker] );
 }
 else
 {
 --reducer_common;
 }
 }
RogersDistance[position] = 0.5 * reducer_total.get_value() / ((double)reducer_common.get_value());
 nCommonMarkers[position] = reducer_common.get_value();
 ++position;
 }
 }
}

the execution time with cilk_option is 4 times slower than its serial code.

I tried to play with grainsize but it does not do big changes.
the processor is 4 core i5 notebook

how to explain this phenomenon ?

how to accelerate this function ?

thank you for your aid ! 

11 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Do you know how big "Marker" and "Individuals" are, and how long on average each iteration of the inner loop is taking to run?

I suspect that the main reason is that you are seeing no speedups is that your innermost loop does not have a lot of parallelism.  How large is Marker?   There is almost no work in the body of the inner loop, so Marker would need to be fairly large to see speedups.   Also, while we do try to optimize reducers, there is some overhead associated with them.

One thing to check is to also compare the timings of the following 3 implementations:

(a) Serial implementation --- drop all the Cilk keywords, and replace reducers with normal variables.
(b) Cilk implementation, run with environment variable CILK_NWORKERS=1
(c) Cilk implementation, run normally.

If there is a significant performance difference between (a) and (b), then it is possible that there are some compiler optimizations that the Cilk code is missing or unable to do, but the normal serial code is doing.   If the difference is mostly in between (b) and (c), then the loop may not have enough parallelism and the scheduling overheads may be a factor.

The "#pragma cilk grainsize = Markers / (4 * __cilkrts_get_nworkers())" should not be necessary and is likely to only hurt you.   If you ever find yourself with code where calling __cilkrts_get_nworkers() is actually improving the performance of Cilk Plus code compared to not using it, please let us know.   In most all the cases that I've seen, there is usually an equally good or better way to write the code without it that is cleaner / more efficient.

Is there any reason why you can't parallelize the outermost loop instead, or both the outer loops?   It seems like the only dependence between loop iterations is on the value of "positions"?

How big / small is Individuals?      If you can afford to store twice as many sums, you may be able to calculate an 2-d index for positions based on "line1" and "line2".   (e.g., something like "position = line1 * Individuals + (line2 - line1-1)?  I haven't been careful in my calculation, so you will want to think about it more.  

Alternatively, if the particular value of "position" you have now is important, you might also consider appending the sums to a list reducer, and copying the answer back into the RogersDistance array.   I wonder if there is a way to quickly calculate "position" from "line1" and "line2", but the math is not coming to me right now.

Actually, your loops looks like an n-body kind of calculation, where you want to iterate over all pairs of (line1, line2) in parallel in a triangular region of the 2d grid?   (Except you are also storing output into an array?).    I can't find the link at the moment, but there are ways to traverse this kind of triangular region in parallel using a divide-and-conquer algorithm.   

Hope that gives you a few ideas to consider.   Let me see if I can find the appropriate links and get back to you.
Cheers,

Jim

 

 

 

Two additional comments:

1.  After some digging, I remembered that the formulas for triangular numbers can help you in calculating position.   Suppose you have a triangular region with points (i, j), arranged as follows:

(0, 0)
(1, 0), (1, 1)
(2, 0), (2, 1), (2, 2)
(3, 0), (3, 1), (3, 2), (3, 3)
(4, 0), (4, 1), (4, 2), (4, 3), (4, 4)
...

To flatten this region into a 1d array, you can use the formula "position = i*(i+1)/2 + j.   That gives you the positions
0
1,   2
3,   4,   5
6,   7,   8,   9
10, 11, 12, 13, 14

Notice that the first position in each row is a triangular number, and the formula i*(i+1)/2 calculates triangular numbers.   The number "j" represents an offset in each row.

2.  Just fyi, I found the link to the blog post describing the algorithm I was thinking about earlier for traversing a triangular region as shown above in parallel for an nbody problem, without races.    But I think it may actually be more complicated than what you need.
The blog post describes code that guarantees that no (i, j) in the triangle runs in parallel with another (x, y) if i == x or j == y. 

http://software.intel.com/en-us/articles/a-cute-technique-for-avoiding-certain-race-conditions

If your code was correct with all pairs (i, j) in the triangle running in parallel, then you could use two cilk_for loops.

Jim

Thank you Jim for your explanations, here you will find the answers on your questions :

Do you know how big "Marker" and "Individuals" are, and how long on average each iteration of the inner loop is taking to run?

-- Markers are in (50 000-10 000 000)
-- Individuals are in (2 000 - 400 000)
-- the CPU has 48 cores

I suspect that the main reason is that you are seeing no speedups is that your innermost loop does not have a lot of parallelism.  How large is Marker?   There is almost no work in the body of the inner loop, so Marker would need to be fairly large to see speedups.   Also, while we do try to optimize reducers, there is some overhead associated with them.

-- yes the innermost loop is quite simple, the most consoming thing would be sqrt(x) in the next realisations

One thing to check is to also compare the timings of the following 3 implementations:

(a) Serial implementation --- drop all the Cilk keywords, and replace reducers with normal variables.
(b) Cilk implementation, run with environment variable CILK_NWORKERS=1
(c) Cilk implementation, run normally.

-- while doing a simple tests : (Markers = 2000, Individuals = 1000, 4 cores CPU)
-- (a) is 4 times faster than (b)
-- (c) is 2 times faster than (b) 

If there is a significant performance difference between (a) and (b), then it is possible that there are some compiler optimizations that the Cilk code is missing or unable to do, but the normal serial code is doing.   If the difference is mostly in between (b) and (c), then the loop may not have enough parallelism and the scheduling overheads may be a factor.

-- perhaps that is, I did not know that the compiler is not able to do the same optimisation on sequential and cilk codes.

The "#pragma cilk grainsize = Markers / (4 * __cilkrts_get_nworkers())" should not be necessary and is likely to only hurt you.   If you ever find yourself with code where calling __cilkrts_get_nworkers() is actually improving the performance of Cilk Plus code compared to not using it, please let us know.   In most all the cases that I've seen, there is usually an equally good or better way to write the code without it that is cleaner / more efficient.

-- here all is correct, my modifications of grainsize does not improve the computing time

Is there any reason why you can't parallelize the outermost loop instead, or both the outer loops?   It seems like the only dependence between loop iterations is on the value of "positions"?

-- the only reason was I did not know how to parallelize the outmost loops instead because of line1, line2 dependences. thank you for your explanations of the method. I shall try to implement it.

Actually, your loops looks like an n-body kind of calculation, where you want to iterate over all pairs of (line1, line2) in parallel in a triangular region of the 2d grid?   (Except you are also storing output into an array?).    I can't find the link at the moment, but there are ways to traverse this kind of triangular region in parallel using a divide-and-conquer algorithm.   

-- yes, this is! I compute a triangular region of the 2d grid, the full matrix is symmetric. so I presented this matrix as a [n*(n-1)/2] size vector in order to store it efficiently in RAM. the diagonal of the matrix M(i,i) = 0 : i =1,Individuals
 

I did the modifications and it is still 3-4 times slower !!

the modified code is here

void ROGERSSNP(double *Geno, int Markers, int Individuals, double *RogersDistance, int *nCommonMarkers)
{
 // NA values
 int const NA = 999;
// temp values
 double total;
 int common;
 // position transformation
 int n = Individuals*(Individuals-1)/2; // compressed size
 int *pos_line1, *pos_line2; // (i,j) of current position storage
 int position; // counter
// precompute positions (i,j)
 pos_line1 = new int[n];
 pos_line2 = new int[n];
 position = 0;
 for(int line1 = 0; line1 != Individuals-1; ++line1)
 for(int line2 = line1+1; line2 != Individuals; ++line2)
 {
 pos_line1[position] = line1;
 pos_line2[position] = line2;
 ++position;
 }
// set zero diagonales
 cilk_for (int i = 0; i!=Individuals; ++i)
 {
 RogersDistance[i * Individuals + i] = 0.0;
 nCommonMarkers[i * Individuals + i] = 0;
 }
// compute distance for each position
 cilk_for(int position = 0; position != n; ++position)
 {
 // declare reducers
 cilk::reducer_opadd<double> reducer_total(0.0);
 cilk::reducer_opadd<int> reducer_common(Markers);
// On parcours chaque marqueur
 cilk_for(int marker = 0; marker != Markers; ++marker)
 { 
 // Si on a des na dans le marqueur etudié, il faut diminuer le nb total
 // de marqueurs en commun pour ce couple de 1 et ne pas calculer la diff 
 if ( ((int)Geno[pos_line1[position] * Markers + marker] != NA) && 
 ((int)Geno[pos_line2[position] * Markers + marker] != NA) ) 
 {
 // Si on a des valeurs on calcule 
 reducer_total += fabs( Geno[pos_line1[position] * Markers + marker] - Geno[pos_line2[position] * Markers + marker] );
 }
 else
 {
 --reducer_common;
 }
 }
// Dans le cas ou on a aucun marqueurs en commun on divise 0 par 0 ce qui donne un NaN
 common = reducer_common.get_value();
 total = 0.5 * reducer_total.get_value() / ((double)common);
RogersDistance[ pos_line1[position] * Individuals + pos_line2[position] ] = total;
 RogersDistance[ pos_line2[position] * Individuals + pos_line1[position] ] = total;
nCommonMarkers[ pos_line1[position] * Individuals + pos_line2[position] ] = common;
 nCommonMarkers[ pos_line2[position] * Individuals + pos_line1[position] ] = common;
 }
delete[] pos_line1;
 delete[] pos_line2;
}
 

Can you rig up a driver so we can run it here?  Without a working program we're only guessing.  Even if the data is just generated randomly, having something to work with would be helpful.

   - Barry

The first thing I would try is eliminating the inner cilk_for, changing the reducers back to normal variables, but keeping the outer cilk_for loop.    The outer loop seems to have quite a bit of parallelism already, and that might be enough?

Since you were seeing (a) as much faster than (b), I think there may be some optimizations that are not happening and some overheads for reducer accesses, etc. that are adding the overhead.      What do the times look like when CILK_NWORKERS=2 and CILK_NWORKERS=4?

Another thing to check is whether the difference is vectorization.   It could be that in the serial version of your code, the innermost loop is vectorizing, but when you use cilk_for and reducers, it is not.    That could easily explain a 3-4x performance difference.   (Are you using icc or gcc?)  There is a vectorization report flag (I think something like "--vec-report=3"?) that you can use with icc, which will report whether loops are vectorizing.

Is it possible to include the complete sample test program you are using and describe how you are compiling the program?  [Is it just "main" that is missing?]    Also, which version of the compiler are you using?  (icc -V or gcc -v)    If the suggestion above doesn't quite resolve your problem, we might be able to dig a little bit deeper with a complete program.

Cheers,

Jim

please, see code in attachements.

on my PC cilk_for 5 times slower than sequential version

thank you for your help and explanations !

here are the files

Allegati: 

AllegatoDimensione
Download rogers.cpp722 byte
Download rogerscilk.cpp2.03 KB
Download compare.cpp2.03 KB

I looked through the attached test code for a bit. I think I've spotted a few things that seem to be causing problems. In decreasing order of importance, here is what I've observed.

1. The loop in ROGERSSNP_Cilk2 looks like the following:


 cilk_for(int position = 0; position < n; ++position)

 {

        total = 0.0; common = *Markers;

        for(marker = 0; marker < *Markers; ++marker)

            if ( (Geno[pos_line1[position] * (*Markers) + marker] != 999) &&

                 (Geno[pos_line2[position] * (*Markers) + marker] != 999) )

                total += fabs( Geno[pos_line1[position] * (*Markers) + marker] - Geno[pos_line2[position] * (*Markers) + marker] );

            else --common;

        RogersDistance[position] = 0.5 * total / common;

        nCommonMarkers[position] = common;

}

There is actually a race on several of the loop variables: total, common, and marker. Currently, they are declared outside the body of the cilk_for loop, which means parallel iterations are all trying to access the same variable. Even ignoring the fact that the loop computes the wrong answer, this sharing will really hurt performance and most likely explain the slowdowns you are seeing.

If you declare the loop variables in the body of the loop, things should improve siginficantly.

By the way, I would consider trying out the Cilk screen race detector that is available on http://cilkplus.org. It is easy to make errors when experimenting and changing code back and forth, and the tool may help you catches errors like this more quickly. (You will probably need to scale the input size down a bit though when running through cilkscreen.)
http://www.cilkplus.org/download#block-views-cilk-tools-block-1

2. I noticed that you are using clock() for timing measurements. I have observed that different implementations of this function vary, as to whether they are reporting wall-clock time, or total processor-time. When I tried running the test program on Windows, it was reporting wall-clock time, which is what you wanted to measure. When I tried on my Ubuntu Linux machine, however, the test program seemed to be reporting processor-time,. Thus, if the program ran for 5 seconds using 4 processors, it reports 20. I'm not sure which platform you are using, but it is something to keep in mind. As a sanity check to figure out which way your platform works, you might "time" the running of the program from the command line, or perhaps use gettimeofday() to read in wall-clock time explicitly.

3. I did observe a 5x slowdown in ROGERSSNP_Cilk1 when running on 1 worker using cilk_for and reducers. The compiler seems to be doing a better job optimizing, and also manages to vectorize the innermost loop when it is serial. I would guess the presence of reducers is probably part of the reason.

Here is the final loop that I ended up with. You probably don't need to split the body of the cilk_for loop into a separate function, but it helped me catch all the variables that were being declared outside the loop body.


void ROGERS_serial_loop_body(double *Geno, int *Markers, int *Individuals,

                             double *RogersDistance, int *nCommonMarkers,

                             int line1, int line2, int position) {

    double total = 0.0;

    double common = *Markers;

    int marker;

    total = 0;

    common = *Markers;

    for(marker = 0; marker < *Markers; ++marker)

        if ( (Geno[line1 * (*Markers) + marker] != 999) && (Geno[line2 * (*Markers) + marker] != 999) )

            total += fabs( Geno[line1 * (*Markers) + marker] - Geno[line2 * (*Markers) + marker] );

        else --common;

    RogersDistance[position] = 0.5 * total / common;

    nCommonMarkers[position] = common;

}
void ROGERSSNP_Cilk3(double *Geno, int *Markers, int *Individuals, double *RogersDistance, int *nCommonMarkers)

{
    // precompute positions (i,j)

    int n = (*Individuals)*(*Individuals-1)/2;		// compressed size

    int *pos_line1 = new int[n];

    int *pos_line2 = new int[n];

    int pos = 0;

    for(int line1 = 0; line1 < *Individuals-1; ++line1)

        for(int line2 = line1+1; line2 < *Individuals; ++line2)

        {

            pos_line1[pos] = line1;

            pos_line2[pos] = line2;

            ++pos;

        }
    // compute distance for each position

    cilk_for(int position = 0; position < n; ++position)

    {

        ROGERS_serial_loop_body(Geno,

                                Markers,

                                Individuals,

                                RogersDistance,

                                nCommonMarkers,

                                pos_line1[position],

                                pos_line2[position],

                                position);

    }
    delete[] pos_line1;

    delete[] pos_line2;

}

Hope that helps. Cheers,

Jim

thank you so much !!!

Accedere per lasciare un commento.