Another week and another customer shared how messed up their test suites were because of parallel programming. So where are we going wrong?
First, it is natural to feel confused and disoriented when a failure "escapes" into the wild. In other words, when a bug is not caught by our test suites it is natural to DEMAND an plan to correct this failure!
Second, it is common to fall back on a simple belief: an "escape" means we need MORE tests.
This is where parallelism trips us up.
From what I've seen, parallel programming can cause "escapes" for two reasons:
- The program has timing dependent bugs in it (due to concurrency), and the program does not behave the same from run to run. This non-deterministic behavior, makes the approach of using a test suite somewhat invalid.
- The program has new code, introduced by parallelism (concurrency), that needs test coverage.
Okay, number (2) is straightforward- this is when we need more tests. Having tests to cover all the program is just a standard thing to do.
Number (1) on the other hand is a nasty one.
Unfortunately, this is where I see more and more "testing experts" going haywire.
Would you believe that one company's solution was to run their test suite more often - hoping to get lucky and catch a timing-dependent error on at least one of the runs. But what if it only fails one time in a hundred? Customers won't like it, but your test suite probably will. And the company was upset at how much longer testing took!
I kid you not. Running a test suite over and over is something I've seen tried by multiple companies. When I hear this I ask "does that work for you?" And the answer always is "NO." I've been told longer versions, always "NO," and they come down to "NO" for two reasons: (a) timing dependencies can't be forced to show up when you run under test, (b) if it fails, good luck reproducing the failure in a debugger to figure out how to fix what you caught.
I hate bugs that don't reproduce when you go to debug them... and that's what generally happen to you if a test suite catches a timing dependent error for you.
The bottom line - despite investing in your application test suites for years, with multicore processors you are going to have very serious problems ("escapes") unless you are using a direct method to find and eliminate timing dependent errors (such as potential data races and dead lock).
Since there is only one solution I know of, that is able to root out timing-dependent failures in a real program (complete with libraries static and dynamic) and make fixing easy because it isolates the issues to source code. I can only recommend one product - it's a product that is easy to recommend, because it has helped so many people. That's Intel® Parallel Studio for Windows* (or Intel® Thread Checker for Linux*). There just is no other solution we've seen come close to solving this, except Intel Parallel Studio which has been proving extremely effective!
No matter how great your test suite is, once multicore processors came into the mix you need Intel Parallel Studio's Inspector.
Threaded programs give way to the potential for race conditions and dead locks.
Intel Parallel Studio has proven to be able find potential race conditions and dead locks, show the precise cause in your source code, and thereby let developers eliminate these nasty problems.
Once eliminated, a tried and true test suite is as useful as it ever was.
Without doing this, you can no longer trust your test suites to protect you, or to assure you that you are ready to release to your customers.
For Linux developers, Intel Thread Checker does the key things needed. On Windows, it is Intel Parallel Studio with the key tool in it being Intel Parallel Inspector.
Okay, this ended up reading a bit like an advertisement. My apologies. This product deserves that from me, and you don't have to buy a copy to see if you agree. Evaluation copies are free - and you'll find bugs in your application during the evaluation period.
Without it, your test suites aren't what the used to be.
And please - don't just run the tests over and over hoping they'll fail. It won't work, and you'll get headaches debugging the failures you do detect.