Reducer and Lock.

Reducer and Lock.

I have been trying to use a cilk_for loop in my Cilk++ code without data races. The first thing I tried is using reducer because I felt after reading the Cilk++ manual that reducer is better than lock in regards toperformance.

The problem I have is that I cannot get rid of data races by reducer (code is given below). Maybe I am not using the reducer right?

$ cat Rng.3.cilk
#include
#include
#include
cilk::reducer_opadd iWorker;

void printThread()
{
iWorker.set_value( cilk::current_worker_id() );
std::cout << "Thread number is " << iWorker.get_value( ) << std::endl;
}

int cilk_main()
{
cilk_for(int i = 0; i < 8; i++)
{
printThread();
}
return 0;
}

$ ./Rng.3.exe
Thread number is 0
Thread number is 0
Thread number is 3
Thread number is 3
Thread number is Thread number is 3
1
Thread number is 7
Thread number is 2

This Cilk++ code above with reducer compiles fine, buthas data races which can be alsodetected by cilkscreen. Whereas the code below with lock does nothave data races which is verified bycilkscreen.

$ cat Rng.5.cilk
#include
#include
#include

void printThread()
{
std::cout << "Thread number is " << cilk::current_worker_id() << std::endl;
}

int cilk_main()
{
cilk::mutex mut;

cilk_for(int i = 0; i < 8; i++)
{
mut.lock();
printThread();
mut.unlock();
}
return 0;
}

$ ./Rng.5.exe
Thread number is 0
Thread number is 0
Thread number is 1
Thread number is 1
Thread number is 7
Thread number is 2
Thread number is 1
Thread number is 4

Question 1:Is there any way to fix my Cilk++ with reducer to avoid data races? How substantial is the difference in the performance between reducer and lock?

Question 2:I had an impression from that cilk::current_worker_id() does not actually represent thethread number. Is there something similar to omp_get_thread_num() in OpenMP? I am trying to eventually use Cilk++ to do the same task as the OpenMP code as below.

$ cat ranOpenMP.cpp
#include
#include "RngStream.h"
#include
using std::cout; using std::endl;
using std::cin;

int main(){

int nP = omp_get_num_procs();
omp_set_num_threads(nP);

unsigned long seed[6] ={1,2,3,4,5,6}; // initialial seed

RngStream::SetPackageSeed (seed); // initialize
RngStream RngArray[nP]; // create an array of objects

// create the same number of random numbers as the number ofthreads
// using the method RandU01() in the class RngStream

for(int i = 0; i < nP; i++)
{
cout << "For array element " << i << endl;
cout << "The random number is " << RngArray[i].RandU01() << endl;
}

cout << "----------------------------------" << endl;

int myRank;
RngStream::SetPackageSeed (seed);
RngStream RngArrayNew[nP];

// use OpenMP to create a parallel version so each thread creates a randomnumber

#pragma omp parallel for private(myRank)
for (int i = 0; i < nP; i++)
{
myRank = omp_get_thread_num(); // get the thread number
#pragma omp critical
{
cout << "For thread " << myRank << endl;
cout << "The random number is " << RngArrayNew[myRank].RandU01() << endl;
}
}

return 0;
}

$ g++ -c -fopenmp RngStream.cpp
$ g++ -c -fopenmp ranOpenMP.cpp

$ g++ -fopenmp ranOpenMP.o RngStream.o -o rOMP
$
./rOMP
For array element 0
The random number is 0.0010095
For array element 1
The random number is 0.701702
For array element 2
The random number is 0.476142
For array element 3
The random number is 0.0469012
For array element 4
The random number is 0.667972
For array element 5
The random number is 0.860096
For array element 6
The random number is 0.60096
For array element 7
The random number is 0.740441
----------------------------------
For thread 0
The random number is 0.0010095
For thread 3
The random number is 0.0469012
For thread 2
The random number is 0.476142
For thread 4
The random number is 0.667972
For thread 1
The random number is 0.701702
For thread 7
The random number is 0.740441
For thread 5
The random number is 0.860096
For thread 6
The random number is 0.60096

Thank you very much for your time and help.

Hailong

6 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

You are assuming that use of an ostream is thread safe. It's not. Which is probably what Cilkscreen is reporting. Try using the ostream reducer instead. This will buffer your output until all work that logically comes before it is written and then write it out.

Some implementations of rand() are thread safe, but you can't predict which thread a Cilk++ strand will execute on. If you expect to be able to set the seed of a random number sequence and repeat it, you'll be disappointed.

You're using the intermediate values of a reducer. A properly written reducer guarantees that you'll have the same output at the end of a parallel region as you would if the code were run serially. There are no guarantees that the intermediate values will be the same, since the Cilk runtime may schedule the strands in the parallel region in any order. In this particular case, your use of the op_add reducer will be OK since you're basically using it as a temporary, but it's a bad habit to get into.

RE: reducer vs. lock performance - By using a lock, you're serializing your code. So you'll get no performance benefit from using Cilk++. Or any other threading package, for that matter. In fact, your performance will be even worse due to the overhead of contention for the lock.

Cilk++ workers are implemented using threads, but there's no guarantee that a worker will remain on a given thread. The worker number is just what the documentation states; a small integer assigned to the worker. You really shouldn't be depending on it or using it to index into arrays or the like. Reducers are a much better answer in most cases, since they are lock-free and usually remove false-sharing issues.

The true magic of reducers is that a properly written reducer gives you the same result at the end of the parallel region as running the region serially. Review the example in the documentation on the use of reducers to gather a list of letters in a cilk_for loop. The list ends up in order, a-z.

This is a very powerful result. It means that you can verify that the code is working properly by running it once serially, and again in parallel, and simply diff the result. But it does take a bit of work to get your mind around.

- Barry

Thanks Barry. I think you are right about ostream. Your detailed explanations gave me a much better understanding of how Cilk++ threads work. I have also looked at some of reducer examples in the documentation but have not found what you meant by'using the ostream reducer instead'.

Below is a code that I wrote to simply output the loop variable values in two different ways.

#include
#include
using namespace std;

void f(int j)
{
cout << "i = " << j << endl;
}

int cilk_main()
{
cilk_for(int i = 0; i < 8; i++)
{
f(i); // output i value by calling a fuction
cout << "i = " << i << endl; // output i directly
}
return 0;
}

There exsits a data race due tocout I think. Could you please let me know how to useostream reducer? I am sorry that I keep throwing codes at you. Any advice will be greatly appreciated.

Regards,
Hailong

If you look in the Cilk++ include directory, there should be a copy of reducer_ostream.h. You'd use it in place of your direct uses of cout. For example:

#include 
#include #include using namespace std; // Initialize global cout_reducer to write to std::cout cilk::reducer_ostream cout_reducer(std::cout); void f(int j) { cout_reducer<< "i = " << j << endl; } int cilk_main() { cilk_for(int i = 0; i < 8; i++) { // output i value by calling a function f(i); // output i directly cout_reducer << "i = " << i << endl; } return 0; }

- Barry

Great, thanks Barry. It works! I ran the program over and over again and did not see datra races any more. The result is consistently correct as below.

$ ./Rng.6.exe
i = 0
i = 0
i = 1
i = 1
i = 2
i = 2
i = 3
i = 3
i = 4
i = 4
i = 5
i = 5
i = 6
i = 6
i = 7
i = 7

One werid thing is that when I use cilkscreen, it gives me someerrors. I am not sure what the issue is, because apparently there is no race in the results above.

$ /usr/local/cilk/bin/cilkscreen ./Rng.6.exe
i = 0
i = 0

Race condition on location 0x7f1f84df3b38
write access at 0x7f1f84b75ef1: (/home/build/64/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:1166, _ZNKSt9basic_iosIcSt11char_traitsIcEE5widenEc+0x61)
read access at 0x7f1f84b75eb0: (/home/build/64/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:869, _ZNKSt9basic_iosIcSt11char_traitsIcEE5widenEc+0x20)
called by 0x401a2d: (__cilk_loop_c_001+0x5d)
called by 0x7f1f84e22631: (__cilk_spawn_001+0x51)
called by 0x7f1f84e22720: (_Z15cilkscreen_loopIPFQQvPvmmEmEQbT_S0_T0_+0xb0)
called by 0x7f1f84e22c6e: (_ZN4cilk13cilk_for_loopEQvPFQQvPvmmES0_mm+0x15e)
called by 0x40184d: (_Z9cilk_mainQiiPPc+0x4d)
called by 0x402463: (_ZN4cilk9main_wrapEQiPv+0x53)
called by 0x7f1f84e215d1: (_Z20cilk_run_wrapper_intQvPv+0x51)
called by 0x7f1f84e2370a: (__cilkrts_init_helper+0x5)

Race condition on location 0x7f1f84df3b59
write access at 0x7f1f8434fb6b: (memcpy+0x10b)
read access at 0x7f1f84b75f32: (/home/build/64/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:870, _ZNKSt9basic_iosIcSt11char_traitsIcEE5widenEc+0xa2)
called by 0x401a2d: (__cilk_loop_c_001+0x5d)
called by 0x7f1f84e22631: (__cilk_spawn_001+0x51)
called by 0x7f1f84e22720: (_Z15cilkscreen_loopIPFQQvPvmmEmEQbT_S0_T0_+0xb0)
called by 0x7f1f84e22c6e: (_ZN4cilk13cilk_for_loopEQvPFQQvPvmmES0_mm+0x15e)
called by 0x40184d: (_Z9cilk_mainQiiPPc+0x4d)
called by 0x402463: (_ZN4cilk9main_wrapEQiPv+0x53)
called by 0x7f1f84e215d1: (_Z20cilk_run_wrapper_intQvPv+0x51)
called by 0x7f1f84e2370a: (__cilkrts_init_helper+0x5)

Race condition on location 0x7f1f84df3b43
write access at 0x7f1f8434fb44: (memcpy+0xe4)
read access at 0x7f1f84b75f32: (/home/build/64/build/x86_64-unknown-linux-gnu/libstdc++-v3/include/bits/locale_facets.h:870, _ZNKSt9basic_iosIcSt11char_traitsIcEE5widenEc+0xa2)
called by 0x401a2d: (__cilk_loop_c_001+0x5d)
called by 0x7f1f84e22631: (__cilk_spawn_001+0x51)
called by 0x7f1f84e22720: (_Z15cilkscreen_loopIPFQQvPvmmEmEQbT_S0_T0_+0xb0)
called by 0x7f1f84e22c6e: (_ZN4cilk13cilk_for_loopEQvPFQQvPvmmES0_mm+0x15e)
called by 0x40184d: (_Z9cilk_mainQiiPPc+0x4d)
called by 0x402463: (_ZN4cilk9main_wrapEQiPv+0x53)
called by 0x7f1f84e215d1: (_Z20cilk_run_wrapper_intQvPv+0x51)
called by 0x7f1f84e2370a: (__cilkrts_init_helper+0x5)
i = 1
i = 1
i = 2
i = 2
i = 3
i = 3
i = 4
i = 4
i = 5
i = 5
i = 6
i = 6
i = 7
i = 7
3 errors found by Cilkscreen
Cilkscreen suppressed 39 duplicate error messages

Thank you very much for all the help! Hope you have a great afternoon.

-Hailong

Hailong,

The races you are seeing are caused by the fact that Microsoft's implementation of stream I/O does a lazy initialization of the locale on the first translation of an integer to the stream. It is a bug in the ostream reducer in that we don't take that into account. The race is real and must be prevented.

There are two simple work-arounds. If you don't mind a little extra output, all you need to do is put:

std::cout << "Performing " << 8 << " iterationsn";

(or something similar that does numerical output) before entering the cilk_for loop.

If this extra output is not acceptable, you can add the following lines before the first cilk_for loop (e.g., you can do this first thing in cilk_main()):

{
    std::ostringstream oss;
    oss << 0;
}

This forced initialization needs to be done only once for the entire program, regardless of how often you enter Cilk.
-Pablo

Faça login para deixar um comentário.