private and shared variables with cilk plus

private and shared variables with cilk plus

Hello,

I need to know how the private and shared variables are used for cilk_for loop in cilk plus.
For example, this serial code

int z=0;
int a= 0;
int n=0;
....

for (i=0; ia = a + x[i];
y[z] = x[i];
z++;
}

With openMP, the parallel code :

#pragma omp parallel for private(i) \\
shared(x, y, n) \\
reduction(+: a, z)

for (i=0; ia = a + x[i];
y[z] = x[i];
z++;
}

Now, the parallel code with cilk plus is it correct ? :

cilk::reducer_opadd z;
cilk::reducer_opadd a;

# pragma SIMD private (i) \\ shared (x, y, n)
cilk_for (int i=0; ia = a + x[i];
y[z] = x[i];
z++;
}

The parallel code with cilk plus is it correct ?

Thank you for your help,

10 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hello Amina,
I don't think #pragma simd clausesare supported for cilk_for loops.

Thanks,

Balaji V. Iyer.

No, it's not correct.

As Balaji pointed out, you can't use #pragma SIMD with a cilk_for loop.

A private copy of the cilk_for loop index variable iis automatically created for you by the compiler. For other variables, the way to declare a private copy is to declare it inside the loop body. All other variables are shared.

As you guessed, using a reducer is the Cilk solution to using shared variables concurrently in a parallel region. They allow you to have separate views of a variable which are reduced to the correct value at the end of the parallel calculation.

In your sample code fragment, using z to index into y will be a problem. The value of a reducer is only correct at the end of the parallel region. If you attempt to access the value of a reducer within the parallel region you'll get a partial result.

Consider if you've got two workers. Both will start with their view of z set to the identity value for addition, which is 0. Now both workers start incrementing their views of z. When the parallel region is complete, the two views will be added together, so accessing z after the parallel region will be correct. But using z to index into y within the parallel region results in a race.

- Barry

Here are some points where I think the analogy is difficult or confusing:
It's hard to see what role you intend for z when you declare it as a reduction. It looks more like a firstprivate lastprivate variable (assuming you require a defined value after the loop) in OpenMP than like a reduction.
#pragma simd reduction is analogous to OpenMP syntax, and it's lumped under the Cilk advertising, yet it's currently a vector reduction directive, not one for threaded parallelism. simd reduction may be useful where you set -fp:source in order to get standards compliance, yet wish to specify specific vector reductions. #pragma simd private already includes facility resembling the vector analogy of OpenMP firstprivate lastprivate.
Where you have a vectorizable reduction big enough to benefit from threaded as well as simd parallelism, it's often useful to program explicitly a threaded outer/vector inner loop organization.

thank you for this good explanation.

So, as you said "the way to declare a private copy is to declare it inside the loop body" but my private variable should be initialized.

is it correct to initialize it inside the loop ?

All of the standard C/C++ initialization and scoping rules apply to variables declared in the loop body. So it's fine to initializethem inside the loop.

Perhaps it's easier to understand if I explain some of the magic. The compiler creates a lambda function from the loop body and passes it's address along with theiteration countand the context to the Cilk runtime. The Cilk runtime recursively breaks the range into halves, spawning onehalf and calling the other to allow other workers to steal portions of the loop. When the range is small enough, the runtime calls the loop body lambda function to iterate over the remaining range (also called a grain).

- Barry

Hello Barry,

My problem that i have many global variables and they are initialized.
So i can't declare them inside the for loop (they are used by other functions).

Can i use the holder reducer to declare my global variables as private for each thread ?

you can take a look for my program that i have to parallelize the first loop for (prince2. cpp, FPtree::DepthFGGrowth function).
this loop includes global variables such as gpprefix_itemset, gnprefix_len, gpitem_bitmap, .. that they should be declared as private variables.

Hi, I wanted to come back to your original code example, because I think it might help you in understanding some of the issues you might encounter in parallelizing your larger program. Suppose you wanted to parallelize the original loop:

for (i=0; i a = a + x[i];
y[z] = x[i];
z++;
}

Here is one way to do so using Cilk:

cilk::reducer_opadd a_reducer(a);
int z0 = z;
_Cilk_for(int i = 0; i < n; i++) {
a_reducer += x[i];
y[z0+i] = x[i];
}
if (n >= 0) {
z = z0 + n;
}
a = a_reducer.get_value();

In this case, each of the variables falls into a familiar programming pattern.

  1. The variable "a" is being updated with an associative operation "+" in each loop iteration. The parallel version above copies the value of "a" into a reducer variable "a_reducer" before the loop, updates the reducer variable within the loop, and then stores the value of "a" back after the loop. Alternatively, you might be able to eliminate the temporary and simply change "a" to a global reducer variable, depending on how "a" is used elsewhere in the code.
  2. The array "x" is only read in parallel, so the variable can be accessed as a normal global variable.
  3. The variable "z" in the original serial loop exhibits a loop-carried dependency, i.e., as written, each iteration in the loop depends on the value of z from the previous iteration. To parallelize the loop, however, you need to eliminate this dependency, so that you can execute loop iterations in parallel.

    In this case, since the original loop increments "z" once in each iteration, we can simply increment the initial value of z (stored in z0) by "i" to get the value of z in the ith iteration. Also, after the loop, we store the final value back into "z". We are able to eliminate the loop-carried dependence because it is easy to calculate the initial value of "z" for each iteration.

    Unfortunately, eliminating loop-carried dependencies may not always be easy, depending on the original program. If each iteration in the original serial loop incremented "z" not by 1, but by a random amount, then parallelizing the loop would be more difficult.

  4. The array "y" is being modified, but each of the iterations executing in parallel modifies a different place in the array, so there are no races, i.e., one strand is not trying to modify the same location another strand is trying to read or write in parallel. Thus, we can leave "y" as a normal global variable.

Holders are most often used as a form of "strand-local" storage, i.e., for variables that are conceptually local variables, but for which it is inconvenient or expensive toconvert it froma global to a local variable.
For example, suppose we had the following loop:

int q;
for (i=0; i q = 3*i + 2;
q++;
y[q] = x[q];
}

In this example, we can convert "q" into a holder and the "for" into a cilk_for.
We can use a holder since the manipulations of "q" are completely local to a parallel strand of execution (in this example, each loop iteration). Since q's initial value is set at the beginning of each strand, thereare no dependenciesbetween strands because ofq.

Alternatively, we could have just declared "q" as a local variable inside the loop instead of using a holder. But if creating the object "q" was expensive, for example, we might want to use a holder instead of creating and destroying it in each loop iteration.

From a quick glance through your example code, it seems that the function you are trying to parallelize is sufficiently complicated that you will likely need to understand / explain how each of the global variables is being accessed and modified. If you can categorize how each global variable is being used, it may be easier then to determine what changes you need to make, and determine whether each variable fits into a particular usage pattern for a hyperobject (e.g., reducer, holder, a read-only global variable, etc.).

One thing I did notice --- for the global variables in the FPtree::DepthFGGrowth function, it seems as though DepthFGGrowth is a function that recursively calls itself? If that is true, you may need to be more careful, since global variables might be modified at multiple levels in the recursion. In this case, a holder may not be sufficient, and you might need to convert some of your global variables into local ones by making copies at appropriate points.

For example, "gnprefix_len" seems to be incremented near the beginning of a loop iteration, and then decremented again at the end of the iteration. You may need to pass the value of this variable into DepthFGGrowth as an extra argument. This would create copies of "gnprefix_len" and eliminate races on this global variable.

On a side note, "gnprefix_len" may fit the pattern of a "splitter hyperobject." Splitters are another kind of hyperobject proposed along with reducers and holders in the original hyperobject research:
http://dl.acm.org/citation.cfm?doid=1583991.1584017
As far as I know, however, splitters are not implemented in Cilk Plus or any of the existing dialects of Cilk, in part because it is difficult to implement a generic splitter object efficiently.

As Jim mentioned, you need to take things on a case-by-case basis. And a good understanding of your existing, serial code is essential.

We've approached this with other customers as an iterative process by adding parallelism using the Cilk keywords,and then running the Cilkscreen race detector over the resulting parallel implementation. One of the things that sets Cilkscreen apart is that it uses metadata created by the compiler to understand the parallelism embodied by your application. This allows it to detect races for any possible scheduling of the parallel code, not just the one that ran in a singleinstance.

Once you've identified a race, you then have to choose how to deal with it:

  • Some (few) races are benign. For example, if you're setting a global flag, it doesn't matter which strand does it. You can use Cilkscreen's "fake locks" to tell Cilkscreen to ignore a benign race.
  • Many races can be modifed to use Cilk reducers. But it's important to remember that the in general you should not be using the intermediate state of a reducer.
  • Some races are loop-carried dependencies, and Jim gave an excellent example of handling these.
  • Some races can only be handled using locks. The usual warnings about locks for any parallel programming technique apply here. One of the nicest things about reducers is that they guarantee you get the same result regardless of the schedule. For example, a list reducer will give you the exact same result when run one 1 worker or 8, which makes verifying the results easy. Unfortunately that isn't true if you use locks.

As Jim noted, your use of gnprefix_len and the arrays it indexes into may be a problem. You'll need to carefully examine what you're trying to accomplish with it. We looked at implementing splitters in Cilk++, and counldn't come up with an efficient implementation. However, a custom reducer may be able to help you. Without a thorough understanding of your algorithm I can't say.

- Barry

Accedere per lasciare un commento.