The First Rule of Programming Club is "NEVER use a variable before it has been initialized." Case in point:
From time to time I'm asked, as part of my job, to optimize/parallelize some code. I had been profiling an open source application when I found out from the authors that they had a new beta version in the works. I asked if I might be allowed to try that one, since my work hadn't gotten too far on the release version. I received a URL to download the code and I started my attempt to build it.
Once I'd gotten past the problems I created for myself, I started to run the same test case I'd been using thinking I could compare the release version with the newer software to judge how much performance improvement had been found (and if they'd already implemented some of the ideas on code improvement I had formulated). There is a multi-step process to go through and after some preprocessing steps on the data, the computationally intensive parts of the execution (starting with Step 2.5) are reached. The first time I attempted a full run, Step 2.5 crashed quickly after getting started. After carefully reviewing the parameters I had used and comparing them to the instructions and example scripts, I found I’d messed up something in the early steps. Accordingly, I created a new data and results directory and started again.
This run-through Step 2.5 ran just fine. I was feeling good, but skeptical; thus, I tried it again.
In fact, no matter how different I tried things, all executions of Step 2.5 resulted in a crash. One thing did work and that was when I used the GNU compiler to build the software. However, I wanted to see how the Intel compiler (and Intel MPI) worked with the code. Plus, eventually I want to vectorize some of the computations and need the Intel compiler so that I have the latest vector code generation technology.
While I looked over my runs again, I wrote back to the authors with some of the selected data files that were generated in the previous and current application steps to see if they could better pinpoint what might have gone wrong (including user error on my part). After some back and forth, they identified the cause of the Step 2.5 crash in the second iteration as being wildly inflated data that was output from the first iteration (the processing typically runs 25 iterations within Step 2.5).
We started delving into the code and, using that oldest of debug tools (the print statement), we tracked the cause of these excessive values to a random “radius” value being generated and reset. The authors supplied new random number generation software in case their old Numerical Recipes-based code might have some of the algorithm’s defined constants overwritten by something else. I installed the new routines, recompiled, reran, and still got crashes. Both sets of the pseudo-random number generation routines worked just fine.
I looked at the point where the random number is set during the class initialization routine and realized that the called function is returning a float while the class member (radius) is a double. Could there be a problem with data size conversion leading to only half of the bits in the double radius being populated by the returned float? Changing the data type returned by the random number generation routine still resulted in crashes.
I tracked the use of the radius value forward and backward through the code trying to determine where it is set and where something might overwrite a portion of the bits in some accidental way. By Friday afternoon, after banging away at the problem for five days (stretched out longer than it might have been since the authors are on the other side of the Atlantic Ocean and I had other projects that I needed to be making progress on), I finally looked deeper into the organization of the class structure. Maybe, if I changed the order of things around, whatever was inadvertently trashing part of the radius value might trash something else? No soap.
I then looked beyond the class declaration to the constructor. It was empty. Not a problem since this just means a default constructor is used for all the elements of the class when it is instantiated. Along with the radius I’m tracking the class contains some std::vector elements, a couple more doubles, and a few other things. Above and beyond, even with the default constructor, there is an initialization method that sets values into the elements of the class.
(And now, dear reader, if you’ve stuck with my ramblings this far, your patience is about to be rewarded since I’ve now reached the point in my narrative where the whole purpose is revealed.)
When I reviewed the parts of the initialization method all looked well. The last line is a function call to reset the radius element with a random value. I once more looked at that function and then it hit me like a Monty Python 16-ton weight. The “reset” code is not replacing the contents of the radius, it is adding a new randomly generated value to the current value stored there. For this one part of the class, the “initialization” code is USING the value in the variable BEFORE initializing that variable to a known value. A violation of the First Rule of Programming Club. Thus, if the garbage polluting the bits of the allocated space of the radius element are interpreted as some immense value (along the order of 6.9e+253), then adding a small random number between zero and one won’t have much effect on the value used later in the computation.
Adding a single line to the constructor to set the radius to zero fixed the problem. The first call to the reset function will now simply store the first generated random value. Other calls to “reset” the radius will work as expected since the value is known to be constrained within an acceptable range.
I went through a lot of (unnecessary) exotic scenarios trying to track down a simple programmer mistake. If you’re confronted by a mysterious value or error, take the advice of Dr. Theodore Woodward, professor at the University of Maryland School of Medicine, who instructed his medical interns: "When you hear hoofbeats, think of horses not zebras". Programming Club members should first eliminate simple “horses” as the cause, regardless of how "they would never do that" this assumption might seem; after you've done that, then start looking for stripes. Also, be sure your printer/monitor/computer is plugged in before you call tech support.