Calling C Code (Uses OpenMP) from Python

Calling C Code (Uses OpenMP) from Python

Hello all,

I'm a high school student working on running some Python code. It's a huge computational task so we actually have to run it on the Stampede super computer to make sure it completes in time. We will be taking advantage of the MIC Architecture there. My major problem is the following: because it takes so long to do there, we were thinking to port some of the Python over to C in order to take advantage of OpenMP parallelization. Somebody has successfully done this using Fortran already but I am doing this in C because we think there is better support for C for some of the stuff we are doing at Stampede. I am having a lot of trouble linking together the C code and the Python stuff. I have been working on just basics so far in order to get an idea of it. I am uploading all my files (it won't accept a .py file so I'm just attaching source in .txt instead.) and will include my command line building options later on in this post.

I am using ctypes to import the shared object file and then get the functions out of there. The error message I'm getting when trying to execute sayhello() from Python is:

Traceback (most recent call last):
File "calltestpython.py", line 3, in <module>
lib = cdll.LoadLibrary('./libtest.so')
File "/usr/lib/python2.7/ctypes/__init__.py", line 443, in LoadLibrary
return self._dlltype(name)
File "/usr/lib/python2.7/ctypes/__init__.py", line 365, in __init__
self._handle = _dlopen(self._name, mode)
OSError: ./libtest.so: undefined symbol: __kmpc_end_serialized_parallel

I'm guessing that this is being introduced simply by using the OpenMP stuff since commenting out sayhello and then running everything else works no problem. Can somebody give me some guidance as to how to avoid this error?

The commands I use to compile are the following:

icc -fPIC -openmp --vec-report=3 -O3 -c test.c -o test.o

icc -shared -lc -o libtest.so test.o

If somebody could also give me guidance as to what these flags are actually doing, that would also be wonderful.

Thank you in advance for all the support!!

AttachmentSize
Download test.c438 bytes
Download pythonsource.txt439 bytes
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Add -openmp to the second icc command line as in:     icc -shared -lc -o libtest.so test.o -openmp

You can refer to the Intel® C++ Compiler XE 13.1 User and Reference Guide for details on the compiler options.

For the options you are using, here’s a brief summary.

-fPIC - Instructs the compiler to generate position independent code
-openmp - Enables OpenMP support
-vec-report=3 - Enables the vectorization report level 3 (NOTE: You have an extraneous leading hyphen)
-O3 - Enables Level 3 optimizations
-c - Instructs compiler to compile to object and stop
-o - Names the resulting output file (object file when used with -c, or shared library when used with -shared)
-shared - Instructs compiler to produce a shared library (instead of the normal executable like a.out)
-lc - Instructs the linker to link libc (NOTE: This is not needed for your simple case here at least)

Hi Kevin,

Thank you very much for your help!! Everything seems to be up and working normally. Just to confirm, this does prove that parallelization through OpenMP and other benefits such as vectorization will work even when calling the C code from Python? Also, as a sidenote, I seem to get an unusual thing printed at the end of the execution. It will go through and print the number of threads and say "Hello" and "World" from each of the threads, but at the end of execution it outputs a strange number. There does not seem to be any pattern to this number. Any insight as to why this is the case? Sample numbers: -791840512, 1124002048, -624273152,  -1335559936, 1740424472.

Yes, your case shows you are able to compile your functions using the Intel C/C++ compiler and call those functions from python. You should be able to exploit the Intel compiler’s features like parallelization w/OpenMP, high-level optimization, vectorization, and even offloading to the Xeon Phi™ from within your C/C++ functions.  With a variant of your small example, I was able to offload some simple code to the Xeon Phi™ from within sayhello().

Re: the random number output, I don’t exactly know. Yours is the first case of python that I’ve toyed with. It is related to calling your function from within the print(). If you only call as:

lib.sayhello()        (instead of:  print(lib.sayhello()) )

then the random number is not produced. I’d have to dig a bit deeper into the python for a more complete/accurate answer.

PS - The omp_get_num_threads() call is sayhello() is not positioned where I believe you may want it. As it is originally, it will always print 1 thread even when running on more threads inside the parallel region. I think you want to restructure as shown below.

void sayhello()
{
   #pragma omp parallel
   {
      #pragma omp master
      {
         int total = omp_get_num_threads();
         printf("We have %d threads!", total);
      }
      int ID = omp_get_thread_num();
      printf("Hello (from %d)n", ID);
      }
}

Ok awesome, that was the goal. Would you be able to post that offload code onto this site? Although I have an idea as to do doing that, it'd be great if I could have an example that deals with code that I have written myself. Ok, I will suffice with just returning whatever value I wanted.. Yeah, I realized that could potentially be an issue after making the changes you described. I decided to use single instead since I don't need it to be specifically the master thread to call the method.

As for another point, I decided to follow this intro to OpenMP (http://openmp.org/mp-documents/omp-hands-on-SC08.pdf) and tried doing the calcPi example just as an example of doing parallel/vector code. Two issues I'm running into: the rangeRand function and the timing issues. My first issue is because I'm most accustomed to Java but the syntax is not too different, so I've never formally learned C/C++. When I put rangeRand above calcPi(), I encounter no problems, but if it goes after,  I get the following error message when compiling:

declaration is incompatible with previous "rangeRand" (declared at line 42)

If you could help me out with this even though it is not strictly for the MIC architecture, I would very much appreciate it. Second issue, I have included some timing code as you can see. When testing it without the parallel pragmas, I get relatively high times (0.40313220024108887, 0.33976197242736816, 0.29760003089904785, 0.2806398868560791, 0.26195597648620605) when including the parallel pragmas but I get much lower times when I do not include the parallel pragmas (0.004934072494506836, 0.00485992431640625, 0.004122018814086914, 0.004211902618408203, 0.004331111907958984). My understanding is that I should be getting lower times when going in parallel since I would be able to do all those iterations in parallel and the number of iteration in each of those parallel threads is reduced. What can I do to correct this problem?

Thanks again for all your help!

Attachments: 

AttachmentSize
Download test.c1.12 KB
Download pythonsource.txt545 bytes

Pardon the delayed reply and length.

Regarding the warning, “as is” the code does not specify a function definition/declaration or prototype for rangeRand() before the first use at line 41 (i.e. x = rangeRand(-1.0, 1.0); ); therefore, the compiler provides an implicit prototype/interface based on this use (at line 41) and the implicit prototype is incompatible with the actual rangeRand() function definition/declaration appearing later at line 52 (i.e. double rangeRand(double fMin, double fMax) ).  In C, when a definition/declaration is not found first, it assumes an implicit prototype of type int (e.g. int rangeRand(...) in your program) and nothing is assumed about the arguments. “int” is not compatible with “double” in your program; hence the warning.

The use should come after either the actual function definition/declaration or a function prototype definition. So, either move the rangeRand() definition/declaration above calcPi() (as you noted doing to silence the error earlier) or leave the rangeRand() definition/declaration appearing after calcPi() (just as it is now) and add a function prototype declaration for rangeRand() either after the #include statements or just above the function calcPi() definition/declaration; whichever is your preference. For example you could do something like the following:

double rangeRand(double fMin, double fMax);     // add this function prototype
double calcPi()
{
<body of calcPi>
}
double rangeRand(double fMin, double fMax)
{
<body of rangeRand>
} 

The Monte Carlo method for estimating the value of Pi is a popular choice to use with OpenMP as is the use of rand(). Before discussing runtimes, I first noticed the program produced incorrect results when run with OpenMP. That is caused by not using a reduction on the variable Ncirc with OpenMP. Review foil #127 from the presentation you cited and note the reduction clause on the #pragma omp parallel for. It is also unclear why the iterations are limited to num_trials / omp_get_num_threads(). Again, consider replicating the code as shown on foil #127 and allowing the for loop to iterate for num_trials and increasing num_trials to 10000 (or greater).

With regard to runtimes, the longer runtimes with OpenMP vs. no OpenMP relates to using rand(). There are numerous discussions (Google: Monte Carlo simulation Pi openmp rand -OR- Google: openmp rand slower site: intel.com ) about rand() not being threadsafe and that some implementations provide thread safety via a lock in the function; thus invocation from various threads are serialized.

There are a variety of alternatives to rand(). By using the threadsafe version of random() from the cited presentation (foil # 131) you will observe a large speedup vs. rand(), with OpenMP vs. no OpenMP, and when increasing the number of threads (via setting OMP_NUM_THREADS). Try setting num_trials = 100000 (for a more dramatic effect use 1000000) when using your rand() and an increasing the number of threads from 8, to 32, to 64. Repeat when using LCG_rangeRand(). You should observe a large speedup when using different RNGs, and further speedup when increasing the thread count when using the threadsafe LCG_rangeRand() function.

I thought about offering an example using an MKL RNG and then found the existing Monte Carlo method for estimating the value of Pi post in our Forum already that combines an MKL RNG with offload. I briefly discussed the user’s code w/Developerment who indicated it was a valid approach for parallelization / offloading. They thought some fine tuning may be possible and will share that in forthcoming post(s) to the thread. You can correspond with Anwar via this thread if his code is of interest to you or you have any questions about his method.

My modified version of your code is posted below. Its includes the new function, LCG_rangeRand() (i.e. the random() from foil #131), changes to use that, and the trivial offload code in sayhello().
 
I hope this helps.

#include <stdio.h>
#include <time.h>
#include "omp.h"
double rangeRand(double fMin, double fMax);
double LCG_rangeRand();
double mean(double a, double b)
{
        return (a+b)/2.0;
}
double geomean(double a, double b)
{
        return pow(a * b, 0.5);
}
void sayhello()
{
   #pragma omp parallel
   {
      #pragma omp single
      {
        int total = omp_get_num_threads();
        printf("We have %d threads!n", total);
      }
      int ID = omp_get_thread_num();
//-orig      printf("Hello (from %d)n", ID);
      printf("%d ", ID);
   }
   printf("n");
   #pragma offload target(mic:0)
   {
       #pragma omp parallel
       {
          #pragma omp single
          {
             #ifdef __MIC__
                printf("MIC: thread_id: %d of %d threads!n", 
                       omp_get_thread_num(), omp_get_num_threads());
                fflush(0);
             #else
                printf("HOST: thread_id: %d of %d threads!n",
                       omp_get_thread_num(), omp_get_num_threads());
                fflush(0);
             #endif
          }
       }
  }
}
double calcPi()
{
//-orig static long num_trials = 1000;
        static long num_trials = 100000;
        long i;
        long Ncirc = 0;
        double pi, x, y;
        double r = 1.0;
   #pragma omp parallel for private(x,y) reduction(+:Ncirc)
//-orig for(i = 0; i < num_trials/omp_get_num_threads(); i++) {
      for(i = 0; i < num_trials; i++) {
//-orig       //printf("Current ID: %dn", omp_get_thread_num());
//-orig       x = (double)rangeRand(-1.0, 1.0);
//-orig       y = (double)rangeRand(-1.0, 1.0);
         x = LCG_rangeRand();
         y = LCG_rangeRand();
         if(x*x + y*y <= r*r) {
           Ncirc++;
          }
        }
     pi = 4.0 * ((double) Ncirc) / ((double) num_trials);
     return pi;
//-orig     //printf("With %lu trials, we have calculated pi as %f.n", num_trials, pi);
}
double rangeRand(double fMin, double fMax)
{
   double f = (double)rand() / RAND_MAX;
   return fMin + f * (fMax - fMin);
   printf("%f NUMn", f);
}
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0;
#pragma omp threadprivate(random_last)
double LCG_rangeRand()
{
   long random_next;
   random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
   random_last = random_next;
   return ((double)random_next/(double)PMOD);
}
double mult(double a)
{
        return a * 30;
} 

No problem with the time! I appreciate the effort of putting together all this material for the length of the post. Thank you very much for clarifying about the function prototypes! I have done what you suggested with putting the function prototype after the include statements and it works perfectly.

With regards to doing num_trials/omp_get_num_threads(), I think this was to exploit the parallelism, no? Say num_trials is 100 and we have 10 threads. Instead of 1 thread doing 100 trials, I could instead have each thread do 10 trials and accumulate those results. Am I missing something in that interpretation? I think one possible explanation is that the parallelism takes this into account when dividing up the chunks of the loop. Let me know if this is correct. I tried replicating the code produced on slide #127 of the presentation I sent you. I got an error of saying that the symbol seed is undefined. How can I access that function?

I have been testing with n = 100. I seem to be getting the same results for OpenMP vs non OpenMP code while using the new random method does give faster run times.

(Using LCG_rangeRand(), OpenMP):  [0.2852480411529541, 0.009171009063720703, 0.009347915649414062, 0.013112068176269531, 0.008533000946044922]

(Using rangeRand(), OpenMP): [0.34800100326538086, 0.01278996467590332, 0.012365102767944336, 0.015981197357177734, 0.012634992599487305]

(Using LCG_rangeRand(), non OpenMP): [0.00013899803161621094, 0.00013518333435058594, 0.00015997886657714844, 0.00013494491577148438, 0.00013494491577148438]

(Using rangeRand(), non OpenMP): [0.0002269744873046875, 0.0002579689025878906, 0.0002148151397705078, 0.00021386146545410156, 0.00021409988403320312]

Interesting to note that it seems as though using rangeRand() vs LCG_rangeRand() does still matter in "non OpenMP" runs; that should help in debugging. In order to do OpenMP vs non OpenMP, I just comment out the pragma lines in calcPi and LCG_rangeRand; is this the correct way to copmare performance? I will look into that post and ask questions as necessary once I finish working out these issues.

The offload compilation does not seem to work from my side as I get error messages of the following when compiling with my script compilescript.sh (which I am attaching to this post):

Compiling with argument -n being "test"
Compiling first step.......
test.c(66): (col. 2) remark: loop was not vectorized: existence of vector dependence.
icc: warning #10362: Environment configuration problem encountered.  Please check for proper MPSS installation and environment setup.
test.c(2): catastrophic error: *MIC* cannot open source file "stdio.h"
  #include <stdio.h>
                    ^

compilation aborted for test.c (code 4)
Done with first step!
Compiling second step.......
Done with second step!

Is there something in my compile script that is incorrect? Or does the problem lay in my C source code?

Attachments: 

AttachmentSize
Download compilescript.txt476 bytes
Download test.c2.46 KB

It is unclear where the seed()/random() references were resolved exactly. You do not need them since you have your own RNG.

To compile/run without OpenMP, just exclude the -openmp option on each icc command-line.

The #pramga omp for divides the work (all loop iterations) among the team of threads.

I think N=100 is too small to overcome the overhead of OpenMP and I don’t believe the answer (pi) will be correct as you increase the number of threads.

The offload compilation fails because you appear to not be using a system with MPSS installed, so perhaps you are not yet compiling on Stampede. Offload compilation won’t be of much use on a system w/o a Xeon Phi™.

Ok, I will just omit those references from code.

Ok, I edited my compile script so that I can also compile without OpenMP; I then changed num_trials to 100,000. It then seems as though I am still getting the same performance:

With OpenMP: [0.12504792213439941, 0.12320494651794434, 0.12214112281799316, 0.12220382690429688, 0.12228703498840332]

Without OpenMP: [0.12225794792175293, 0.12264800071716309, 0.12170004844665527, 0.12144899368286133, 0.12245392799377441]

Is 100,000 also still too small of a test size? Also, this may help clarify me; why did changing the RNG we were using (LCG_rangeRand() vs rangeRand()) change the performance so drastically (2x in terms of time) even when we weren't using OpenMP?

Yes, sorry, this was my mistake. I am able to compile with icc on my local machine so I was trying to do offloading there too, but yes, I should be doing this on Stampede. I compiled it fine, and the offload code is working great. Thanks! I did have a couple questions about it though. First, why do we have the else clause to the if ("#ifdef __MIC__")? Isn't it guaranteed that we would be on the MIC if this piece of code runs since we have offloaded it to the MIC? Also, what is the __MIC__ thing that you are checking? Why does that if statement work? Lastly, why do we have the fflush(0) line there? Wouldn't it be printed either way?

Thanks again for going through all this with me!!

"Thanks again for going through all this with me!!"

You’re very welcome. It’s my pleasure.

So, lots to cover here.

100000 is not too small. Stick w/that value.

Congratulations, you helped identify a bug in our compiler!  (See supportive details below) When compiling w/o -openmp the presence of the #pragma offload in sayhello() re-enables OpenMP and all code appearing after the #pragma offload is compiled w/OpenMP enabled. Ouch!    So the 2x slowdown when using rangeRand() vs. LCG_rangeRand() is what I mentioned earlier where rand() is made threadsafe likely using a lock. This also explains why you are not seeing any performance difference when compiling with and w/o -openmp. I confirmed this is already fixed in our Composer XE 2103 SP1 (14.0) release (next month) so you will want/need to work around this for now when using the current Composer XE 2103 (13.1) release and not using OpenMP.

Depending on the OpenMP API calls used, when compiling without OpenMP enabled, one can explicitly disable offload (and avoid this bug) using the -offload=none option for the non-OpenMP build.

Unfortunately my modified sayhello() prevents you from adding -offload=none option for the non-OpenMP build because the calls to omp_get_num_threads() & omp_get_thread_num() are unresolved. Since sayhello() is really no longer needed you could completely comment it out or remove it from your code (and python script). If you want to keep it and only enable the body of it when using OpenMP, then you can #ifdef the body of the code with the _OPENMP predefine as shown below so that the body is only active when compiling with -openmp.

void sayhello()
{
#ifdef _OPENMP
    <body of sayhello>
#endif
} 

Details of the offload/openmp bug.

Given t4.c contains:

#include "omp.h"
void foo()
{
   #pragma omp parallel num_threads(1)
   { }
   #pragma offload target(mic:0)
   { }
   #pragma omp parallel num_threads(2)
   { }
}
        
void bar()
{
   #pragma omp parallel num_threads(3)
   { }
} 

Compilation without -openmp should report all #pragma omp directives as unrecognized; however, with the offload compilation (enabled by default due to the presence of the #pragma offload) it only disables OpenMP for the first #pragma omp directive appearing before the #pragma offload. Only when offload is explicitly disabled (via -offload=none) are all the #pragma omp directives flagged as unrecognized.

$ icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.1.3.192 Build 20130607
$ icc -c t4.c
t4.c(4): warning #161: unrecognized #pragma
     #pragma omp parallel num_threads(1)
             ^
$ icc -c t4.c -offload=none
t4.c(4): warning #161: unrecognized #pragma
     #pragma omp parallel num_threads(1)
             ^
t4.c(6): warning #161: unrecognized #pragma
     #pragma offload target(mic:0)
             ^
t4.c(8): warning #161: unrecognized #pragma
     #pragma omp parallel num_threads(2)
             ^
t4.c(14): warning #161: unrecognized #pragma
     #pragma omp parallel num_threads(3)
             ^

Regarding your offload questions:

"First, why do we have the else clause to the if ("#ifdef __MIC__")?   Isn't it guaranteed that we would be on the MIC if this piece of code runs since we have offloaded it to the MIC? Also, what is the __MIC__ thing that you are checking? Why does that if statement work? Lastly, why do we have the fflush(0) line there? Wouldn't it be printed either way?"

In the current Composer XE 2013 (i.e. 13.1 compiler) release offload is “optional” by default, which means code designated for offload is allowed to “fallback” to the host CPU and execute there when a Xeon Phi™ card is unavailable; thus the else clause will indicate when the offload code block executes on the host. In the CXE 2013 SP1 (i.e. 14.0 compiler) release (next month), the offload default changes to “mandatory”; where an offload application by default will fail when a Xeon Phi™ card is unavailable.

__MIC__ is a predefine the compiler defines when the offload compilation is enabled. The offload compilation is enabled automatically when the compiler detects the presence of offload keywords (e.g. #pragma offload) in the user's source code. One can use this predefine to conditional compile code specifically for the MIC card such that code guarded by the predefine will only be included in the MIC-side executable. For my usage in sayhello(), the if block appears in the MIC-side executable and the else block appears in the host-side executable; thus you receive a different printed message depending on where the offload block executes. I must add that my demonstrated usage inside the #pragma offload is not permitted due to problems that may occur if not properly used under a #pragma offload. The supported usage of the predefine is within a function that is completed offloaded. So, don’t take my lead and violate that restriction. I was just trying to help demonstrate when code was running on the MIC card vs. the host.

fflush(0) is necessary to ensure all I/O buffers associated with the MIC-side execution are flushed back to the host process. It is needed as a by-product of the underlying MPSS/uOS design. Without it you can lose output from within offloaded code. The fflush(0) is not needed in the else block since that executes on the host. Sorry, just a cut-n-paste oversight on that.

PS - I touched base with one of the earlier cited paper’s authors and he directed me to the source code for the examples. This Code Exercises link under the Tutorials and Technical Articles section contains the exercises including seed()/random(). He also noted a mistake on the foil showing seed() takes three arguments where in actuality it only takes two arguments.

Hope that helps!

Leave a Comment

Please sign in to add a comment. Not a member? Join today