Graduate Intern at Intel - Parallel Mandelbrot

Three years ago, as I was finishing my undergraduate degree, I was hired as an intern at Intel to work with the Intel Software College. During those six months, I gained an appreciation for the usefulness of and the growing need for parallel programming. When I decided to go back to school for a graduate degree, I actively sought out the professors who focused on parallel computing, and I took the courses they offered. Since parallelism can be applied to pretty much all areas of computer science, I decided to narrow my focus to parallelism in 3D graphics, rendering, and visualization. I am currently finishing my master’s degree at Brigham Young University, and I am again working as an intern for the same group at Intel.

The first project my mentor handed me was a small demo, just to wet my appetite a bit. The project was a parallelized visualization of the Mandelbrot set written in DirectX, which we needed to convert to a more platform independent technology. The original demo worked by creating 2D dots in a 3D space, so to make the conversion as straight forward as possible, our first attempt was with OpenGL.

Converting the project to OpenGL was fairly simple. However, we had to download, compile, and install freeglut on the Many-core Testing Lab. Installing freeglut wasn’t too difficult, and we soon had a working serial version. However, as soon as we tried to parallelize the project we ran into trouble: OpenGL allows only the master thread to draw, and all other draw commands from non-master threads are ignored. We were left with a project that drew only a portion of the picture (the portion drawn by the master thread), and figuring out that OpenGL was at the heart of the issue took several hours of digging online.

Once we discovered the OpenGL drawing dilemma, my mentor hit upon a temporary solution: instead of having each thread draw as soon as it finished computing, he simply had each thread store the draw commands it generated. Then, once all the threads had finished computing their portion of the set, he had the master thread draw the set in the order it was computed. This solution had the effect of showing graphically how the threads divided the work among themselves, which was a key aspect of the demo; however, I was still bothered by the fact that the rendering happened after the fact.

After having left the project for a while, I eventually remembered the OpenCV libraries. OpenCV stores its images as an array of bytes, and it’s more than happy to give you a pointer to that array so you can manipulate the image yourself using pointer arithmetic. OpenCV also provides a very simple interface for displaying an image. Using OpenCV, each thread in a multi-threaded environment would be able to manipulate its own piece of an array without interfering or interacting with the other threads, so OpenCV seemed like a great solution. However, that meant I’d have to download, compile, and install it.

Installing OpenCV on the Ubuntu clone of the Many-core Testing Lab proved quite simple. However, the same was not true for the original RedHat machine. Eventually, however, we were successful in installing a version of OpenCV in RedHat. I converted the Mandelbrot project to use an OpenCV image instead of rendering to OpenGL, and the project became much simpler and smaller.

My final step in the Mandelbrot project was to parallelize it. Using OpenMP was easy (a single #pragma for the outer for-loop, and voila!). However, parallelizing for CILK was a bit trickier because of compatibility issues between our build of OpenCV and the CILK. However, we only had difficulty with parallelizing the project in a Linux environment. In Windows we didn’t have to compile the libraries or technologies ourselves, so we experienced no compatibility issues. I have not yet, however, tested the project on the Windows MTL clone (I used a six-core desktop machine).

I did make one small change after parallelizing the project, simply as a good design idea. I separated out all the code the three versions share in common (so I wouldn’t have to maintain three code files with a large section of identical code in each) and created three *.cpp files to house the main method that includes the parallelized for-loop (or un-parallelized for-loop in the case of the serial version). Hopefully any future changes will only need to be made to those three *.cpp files. After making this final change, I compared the average runtimes for the project using 1, 2, 4, 8, 16, and 32 threads. The time from one to two threads is sub-linear because of the added overhead of creating and destroying the two threads, though they are only created and destroyed once. However, from that point on the speedup is very nearly linear. In the Linux version I had to use OpenMP’s omp_get_wtime() function to get the runtime since ctime’s clock() function counted the cumulative time of all the threads together.

In the end, my main take-away from this project was that OpenGL can be tricky to parallelize-and I didn’t even end up using OpenGL. However, I was quite pleased to have a 32-core machine at my disposal to work with. The super computer at the university is used quite frequently and uses a job queue, so if a professor submitted a large job that will take hours to run then you’re out of luck. However, Intel’s MTL machine is always available and requires no special batch file syntax or job submission-it’s just straight Linux (or Windows).

For more complete information about compiler optimizations, see our Optimization Notice.