debugging a program with a large amount of races detected

debugging a program with a large amount of races detected

I wrote a program on Visual Studio 2005 c++ that worked fine in serial mode. When I converted it according to Cilk's programmer's guide I removed all but a few global variables and had carefully checked every instance of the remaining ones to make sure that none of them is ever 'written to' in the program. But the race detector still reported over 200 warnings or errors. I have the following specific questions regarding this problem:

1. How do I extract more information from the report generated by the race detector? All I got is a bunch of memory addresses that I apparently have no way to monitor in Visual C++ 2005. I heard that there's a 'data access breakpoint' that can do this but didn't find it.

2.I compared the output of the program with that of a serial version and found them to be identical. If the results are deterministic, doesn't it mean there is no race, or maybe they're of the benign kind?

3. I can't debug the parallel version. I added the cilk_stub header to the source file but found upon building the project errors that say something about non-cilk routine cannot call cilk routine. They occur every time a function is called in the source file. What have I missed?

Any idea would be appreciated.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I do not know the answer but I am also very interested in the answer. Always do not know how to read cilkscreen output. I am using Linux. I thought it would be easier for Windows. As in the example of the documentation, if you saw something in Windows like:

Race condition on location 004367C8
write access at 004268D8: (c:\sum.cilk:8, sum.exe!f+0x1a)
read access at 004268CF: (c:\sum.cilk:8, sum.exe!f+0x11)
called by 004269B4: (c:\sum.cilk:14, sum.exe!cilk_main+0xd0)

It means something suspicous data race is happending in line "8" and line "14" of program "sum.cilk" (those numbers in the brackets).

I am using its Linux version so I do not have those numbers in the brackets. Does that give you any information?

I would suggest you do not print anything in the spawned program. That would be a race of different threads to the I/O.

Thanks, but what I'm getting do not contain any convenient line number information. There are actually only three varieties:
"unnamedImageEntryPoint+0x22b2b"
"_cilk_s::for_loop_closure::operator()+0x1f5"
and something about stack overflow. Again, there's no mentioning of any immediately readable location.

I believe, after testing my parallel version with a serial equivalent and found the output to be identical, that the race detector has stumbled upon trivial or irrelevant things that looked like race bugs. Note that my serial program did have some global variabes thathad beeninadvertently redefined and used locally in functions, which are genuine potential race bugs. After I rectified these mistakes and checked every single instance of all my few global variables in my entire program I still got warnings from racescreen. As a matter of fact, I got 211 warnings instead of 200 after correcting all the mistakes.

I suspect that this is aminor problem with racescreen.

Also, do you by any chance knowhow to deal with the 'non-cilk routine cannot call cilk routine'error while compiling a serial debugging version of the parallel program?

Sorry. I have not met this problem. I usually use the runtime option -cilk_set_worker_count=1 to debug.

I don't know of any bugs in the Cilkscreen Race Detector. If you don't mind sharing your code, we may be able to give more information. According to your description, it would seem as if there may be a conflict between iterations of a loop when calling something that invokes operator().

It is possible that the race is benign (in which case you can suppress the warning). However, just because you get the same output on several runs does not suggest that you would never see different output. A race condition by definition may generate different output depending on what order the tasks are scheduled, and it may be very unlikely (though not impossible) that different results appear.

You should be able to eliminate the "non cilk routine" error by commenting out the "extern cilk" lines.

steve

Leave a Comment

Please sign in to add a comment. Not a member? Join today