My First Experience Programming….in Parallel!

If you had ever told me that in the course of one summer I would be able to transform a piece of serial code into parallel code, I probably would have stared at the sky waiting to see a flying pig. Though a slight exaggeration, my attitude towards parallelism, as an undergraduate student (having only coded in serial), was nothing short of intimidated. Nonetheless, at the beginning of summer when I was presented the opportunity to code up an Intel Threading Challenge problem in parallel, I took the challenge. At this point, I had a very basic understanding of parallel programming consisting of a definition and the benefits of parallelism, while actually creating parallelized code was a foggy mystery to me.

The specific programming challenge problem I was given was to create a program that detects all “small brain” numbers in a given range set by the user. A small brain number is an n-digit number where the sum of each individual digit taken to the power of n is equal to the original number. Take for example the number 1634:

1^4 + 6^4 + 3^4 + 4^4 = 1634.

Thus, 1634 qualifies as a small brain number.

If we test the value 436:

4^3 + 3^3 + 6^3 = 307

Since 436 ≠ 307, 436 is NOT a small brain number.

Gravitating towards what I knew to do, I began by coding the problem up in C++ and in serial. Here is the basic structure of my code organized by method.

numdigits:

  • To get the ball rolling, I started by creating a helper method, called numdigits to count the number of digits in a number. This was accomplished by taking the result of the log of the number and adding one.



test_value:

  • Now that I had a method to calculate the exponent, the next step was to calculate the base for each component of the sum. This meant creating code to break the value into individual digits.


    • This was achieved by repeatedly dividing the n-digit value by 10, and storing each remainder in a vector. The end result of this was a vector containing each digit of the base-10 integer. In order to signify the end of the digits I inserted a random 2-digit value, 56, in the index after the last digit (because you could never have a two digit remainder when dividing by 10).




  • I had a vector with each individual digit in the n-digit value and a variable which stored n. At this point, I created a loop to calculate the sum of each digit raised to the power of n.


    • I carried this calculation out with a loop that calculated a running sum of each index (each digit) in the vector to the power of n. The loop was stopped by the random two-digit value I described in the last step.




  • Now, to determine whether the value is a small brain number I checked the sum calculated in the previous step with the original value.


    • If identical, I printed the value. If not, I proceeded to repeat the steps above on the next value in the range until the upper value in the range is reached.



Main:

  • Now that I had the heavy lifting portion of the code completed, I filled in the main method with code to ask the user for a lower and upper limit. A timer was used to allow me to easily compute execution times.


The above was the basic structure of my code and thought process as I coded the problem up in serial. When I ran my code, I was delighted to find that I was producing the correct results.

Below you will find my serial source code:


  

// Smallbrain_numbers.cpp : Defines the entry point for the console application.

//

#include "stdafx.h"

#include < math.h >

#include <omp.h >

#include <iostream>

#include <vector> 

#include <ctime>


using namespace std;


//Initialize variables 

int min_value =0;

int max_value =0;

int i =0 ;

int count = 0;

int lower;


//Method to return the number of digits in an integer

int numdigits(int n)

{

          return log10((double)n) +1;

}


//Method to test each value in given range and print the integer if it is a small brain number

void test_value(int a, int b, int m)

{

          while (a<b) 

          {

                    int *calculations;

                    calculations = (int*)malloc((m+1)*(sizeof(int)));

                    for(int temp = a; temp<a +10 ; temp++)

                    {

                             int sum =0;

                             int pos=0;

                             int j =0;

                             int number = temp;

                             if(number>b)

                             {

                                      break;

                             }

                             int num_digits = numdigits(number);

                             while(j<num_digits)

                             {

                                calculations[pos++] = (number %10);

                                number = number/10;

                                j = j+1;

                             }

                             calculations[pos] = 56;

                             pos = 0;


                             while(calculations[pos] != 56)

                             {

                                sum = sum + pow((double)calculations[pos], num_digits);

                                ++pos; 

                             }

                             if(sum == temp)

                             {

                                      cout<<temp>>"n" ;

                             }

                             a=a+1;

                    }

                    free(calculations);

          }	


}


int _tmain(int argc, _TCHAR* argv[])

{

	cout << "Please enter a lower range n" ;

	cin >> min_value;

	cout << "Please enter a higher range n";

	cin >> max_value;


	cout <<"Smallbrain numbers between " << min_value << " and " << max_value << ": n";

	double start = omp_get_wtime(); //clock();

	test_value(min_value,max_value, numdigits(max_value));

	double stop = omp_get_wtime(); //clock();

	double time = stop - start;

	cout << "The time taken to execute is: "<< time <<"n" ;

}



At this point, I was faced with the task of converting this code from serial to parallel. Not sure where to start, I began by researching the broad topic of parallel programming. I realized the first decision I needed to make was what programming model to use. The options I was considering were Intel® Threading Building Blocks, OpenMP, and Intel® Cilk Plus. From my research, I decided on OpenMP, frankly, because the model seemed to click with me more than the others. From what I was able to gather, the changes this model demanded from my serial code were minimal and with my serial code containing many loops, OpenMP seemed to capitalize on parallelizing loops. The learning curve was rather quick due to the fact that I realized OpenMP consisted of adding compiler directives or pragma statements, rather than a complete remodel of my serial code (which was the impression I was under). I was starting to see the light at the end of the tunnel.

Noticing that the bulk of my code is in a large while loop, I deduced that tasks would be an excellent way to parallelize the code. Thus, I followed the cookie cutter format I observed in my research and translated my code to the parallel version you will find below, inserting multiple pragma statements surrounding the large while loop in my code. The #pragma omp parallel signifies the beginning of a parallel region and the creation of multiple worker threads by the master thread. The single construct indicates that only one thread will execute the rest of the code in test_value() to avoid multiple threads altering the data at the same time and creating potential errors.

As all seemingly perfect code eventually turns out, I soon found out there was a bug somewhere in my code. When I ran my code, I was not producing any results. After a few attempts at rearranging, changing, and even renaming parts of my code with futile results, I was stumped. Then, the suggestion was given to me to try using Intel® Parallel Inspector to solve my problem. Having only used, a classic debugger to solve my serial programming problems in the past, I tried to avoid it. But having tried all options I could think of (some even ridiculous), I ran Inspector Threading Analysis, looking specifically for deadlocks and data races. I was floored by the ease of reading the report it produced. From this report I saw where the problem was stemming from down to the line number. The variable that was causing the problem was the variable named a, which indicated a lower range value at the beginning of test_value() and provided the value to test until the upper range is hit. The solution to my problem was to provide each thread with its own local copy of the variable ‘a’ to avoid dead locks and data races.

I ran my code to find that I was still not getting the correct values. At this point, I ran the Inspector analysis again, this time looking for memory problems. Once again, I was pointed to a section of my code, where, sure enough, I found an error. I was trying to assign values to the vector, named calculations, after “freeing it”. The solution to this problem was as simple (once I found it, of course) as moving the call to free the vector ‘calculations‘ to the outside of the loop. I ran Inspector once again checking for data races and memory problems, and found no errors. As expected, my program finally executed with the correct results. I was thrilled!

But could I do better? Having warmed up to the Intel® Parallel Studio tools, I turned to Amplifier for an answer. The answer was, yes. Once again the report produced by Amplifier was extremely useful showing me the part of my code that was causing poor performance. I turned to the method test_value() . My goal became to reduce overhead cost and granularity. With some guidance I was given the suggestion to change my approach from a single task conquering single values to a single task conquering a sub-range of numbers. The reasoning behind this being that the cost of setting up a task should not be inefficient in comparison to the work completed through that task. Essentially, the goal is to get the bang for the buck required to set up a task. To change this section of my code, I added a for-loop to assign 10 values to each task for testing. I ran Amplifier again, thrilled to find that performance improved greatly.

Below you will find my parallelized version of the source code with the performance enhancements:


    

// Smallbrain_numbers.cpp : Defines the entry point for the console application.

//

#include "stdafx.h"

#include < math.h >

#include <omp.h >

#include <iostream>

#include <vector> 

#include <ctime>


using namespace std;


//Initialize variables 

int min_value =0;

int max_value =0;

int i =0 ;

int count = 0;

int lower;


//Method to return the number of digits in an integer

int numdigits(int n)

{

     return log10((double)n) +1;

}


//Method to test each value in given range

void test_value(int a, int b, int m)

{

   #pragma omp parallel

   {

        #pragma omp single 

        {

             while (a<b) 

             {

                  #pragma omp task firstprivate(a) 

                  {

                     int *calculations;

                     calculations = (int*)malloc((m+1)*(sizeof(int))) 

                     for(int temp = a; temp<a +10 ; temp++)

                     {

                        int sum =0;

                        int pos=0;

                        int j =0;

                        int number = temp;

                        if(number>b)

                        {

                           break;

                        }

                        int num_digits = numdigits(number);

                        while(j<num_digits)

                        {

                           calculations[pos++] = (number %10);

                           number = number/10;

                           j = j+1;

                        }

                        calculations[pos] = 56;

                        pos = 0;


                        while(calculations[pos] != 56)

                        {

                           sum = sum + pow((double)calculations[pos], num_digits);

                           ++pos; 

                        }

                        if(sum == temp)

                        {

                           cout<<temp>>"n" ;

                        }

                      }	

                      free(calculations);

                    }

                    a=a+10;

               }

          }

     }

}

int _tmain(int argc, _TCHAR* argv[])

{

	cout << "Please enter a lower range n" ;

	cin >> min_value;

	cout << "Please enter a higher range n";

	cin >> max_value;


	cout <<"Smallbrain numbers between " << min_value << " and " << max_value << ": n";

	double start = omp_get_wtime(); //clock();

	test_value(min_value,max_value, numdigits(max_value));

	double stop = omp_get_wtime(); //clock();

	double time = stop - start;

	cout << "The time taken to execute is: "<< time <<"n" ;

}




At this point, I took a moment to pause and appreciate that it was the same person who wouldn’t have touched a parallel computing program with a ten-foot pole that had just completed their first parallelized piece of code. The beauty of this moment was that I didn’t want to just stop there! This project equipped me with not only skills but, rather, I found myself with a new found curiosity and appreciation towards the power and potential of parallel computing. More than the piece of code I had just written, I felt my biggest accomplishment was breaking down the barriers of intimidation parallel programming presented to me. In my opinion, it is important for all students to have a similar eye opening experience and opportunity to break down the barriers of intimidation parallel computing may present. This familiarization can start anywhere from converting serial code from an old course assignment, personal project or even taking your own stab at the small brain numbers problem. So, what are you waiting for?



For more complete information about compiler optimizations, see our Optimization Notice.