How do I know in which core my thread is running

How do I know in which core my thread is running

Hello guys.

I'm trying to scale a for loop but I'm getting even worse results.

My serial code runs in 30s but my openmp version completed in 200s.

This is my pragma.

int procs = omp_get_num_procs();
#pragma omp parallel for num_threads(procs)\
shared (c, u, v, w, k, j, i, nx, ny) \
reduction(+: a, b, c, d, e, f, g, h, i)

And this are my openmp exports :

export OMP_NUM_THREADS=5
export KMP_AFFINITY=verbose,scatter 

And this is my verbose running in 1 node 8 cores

OMP: Info #149: KMP_AFFINITY: Affinity capable, using global cpuid instr info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #159: KMP_AFFINITY: 2 packages x 4 cores/pkg x 1 threads/core (8 total cores)
OMP: Info #160: KMP_AFFINITY: OS proc to physical thread map ([] => level not in map):
OMP: Info #168: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 4 maps to package 0 core 1 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 5 maps to package 1 core 1 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 3 maps to package 1 core 2 [thread 0]
OMP: Info #168: KMP_AFFINITY: OS proc 7 maps to package 1 core 3 [thread 0]
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {7}

Why is there such a difference between the serial and openmp version?

I need to know where each thread is running cause I think that the first core is spawning 5 threads to itself but I wanted each core to have 1 thread.

Am I doing the pragma before for right?

 

thanks in advance!

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

General questions about optimization strategy seem to be frowned upon here, particularly when they imply a need to refer to the optimization reports of the compiler of your choice.

If your question is whether the setting of num_threads for a parallel region over-rides the environment variable, that is the expectation.

If you are calculating 9 sums in parallel, using a smaller number of threads, it would appear to make sense to localize 1 or 2 sums to each thread, even if your compiler is able to perform simd reduction in the case where you have spread all sums across all threads, particularly if you have a reason to assign the threads alternating among CPUs as you showed.

You would want to examine the compiler reports, at least to the extent of assuring that you have simd optimized reduction in each thread.

If you are suggesting that 5 threads may be as good or better than 8 when calculating 9 sums in parallel, that may be so if you localize 1 or 2 sums to each thread.

Tim is correct, this forum is for questions about the implementation of the runtime, You have also posted on StackOverflow, and I have already commented there :-)

Ok. My apologies for posting it here. 

I've already posted it there.

Thanks a lot!

I can't think of a search strategy to find this on stackoverflow.  Better titles are important.  Not that I'm too happy with the climate on stackoverflow where people try to discourage questioners from reading my responses.

Look for OpenMP tagged questions on StackOverflow http://stackoverflow.com/questions/tagged/openmp . (You can also have StackOverflow send you a list of OpenMP tagged questions every day, should you be so inclined [which is why I saw it in the first place]).

Although this question was not precisely related to the topic of the implementation of the Intel OpenMP runtime library, it is a question that is certain to come up when trying to understand what the OS does to an OpenMP program if you don't pin the threads.  I don't know of any "software" solutions to the question, but there is a delightfully elegant hardware solution that has been supported by Linux on x86 processors for the last few years.

Intel processors starting with Nehalem and that other vendor's processors going back to at least Family 10h support a new variant of the RDTSC ("Read Time-Stamp-Counter") instruction called "RDTSCP".   The RDTSCP instruction reads the Time-Stamp-Counter, but also returns an additional register with the contents of the Machine-Specific-Register MSR_TSC_AUX, which contains a value that is unique to each logical processor.  Linux kernels since about 2.6.34 have defined the this MSR to include the logical processor number and (in multi-socket systems), the socket number where that logical processor is located.   

User-mode execution of the RDTSCP instruction is enabled by default on all Linux distributions that I have checked, so this enables a user to get the value of the Time-Stamp-Counter, the logical processor number, and (for multi-socket systems) the socket number in a single instruction. 

The appended C function provides a simple interface to this functionality, returning the Time-Stamp-Counter as the function return value and overwriting the two integer input arguments with the current chip (socket) number and current core (logical processor) number.

unsigned long tacc_rdtscp(int *chip, int *core)
{
   unsigned long int x;
   unsigned a, d, c;

   __asm__ volatile("rdtscp" : "=a" (a), "=d" (d), "=c" (c));
    *chip = (c & 0xFFF000)>>12;
    *core = c & 0xFFF;

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

 

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today