White Rabbits - part 2

By jimdempseyatthecove (8 posts) on November 25, 2009 at 4:22 pm

This is a continuation of my White Rabbits blog. Let’s go looking for some White Rabbits.

The above represents a series of 10 test runs, with average, of a program running 5 different tests per run. Run on a quad core Intel Q6600.

Test 1 is single threaded

// t1 single thread for loops
__int64 t1 = __rdtsc();
for(int i=0; i<arraySize; ++i)
A[i] = ((double)i) / 1000.0;
for(int i=0; i<arraySize; ++i)
B[i] = sin(A[i]);
for(int i=0; i<arraySize; ++i)
C[i] = cos(A[i]);
t1 = __rdtsc() - t1;
std::cout << " " << t1;

Where arraySize is 1024*1024

Test 2 is multi-threaded (4 threads) using OpenMP

// t2 multi-threaded for loop
// using all threads
// without background thread
__int64 t2 = __rdtsc();
#pragma omp parallel for
for(int i=0; i<arraySize; ++i)
A[i] = ((double)i) / 1000.0;
#pragma omp parallel for
for(int i=0; i<arraySize; ++i)
B[i] = sin(A[i]);
#pragma omp parallel for
for(int i=0; i<arraySize; ++i)
C[i] = cos(A[i]);
t2 = __rdtsc() - t2;
std::cout << " " << t2;

Test 3 is multi-threaded (4 threads) using OpenMP with the OpenMP chunk size

// t3 loops with chunk size
// using all threads
// without background thread
__int64 t3 = __rdtsc();
#pragma omp parallel for schedule(dynamic, chunk)
for(int i=0; i<arraySize; ++i)
A[i] = ((double)i) / 1000.0;
#pragma omp parallel for schedule(dynamic, chunk)
for(int i=0; i<arraySize; ++i)
B[i] = sin(A[i]);
#pragma omp parallel for schedule(dynamic, chunk)
for(int i=0; i<arraySize; ++i)
C[i] = cos(A[i]);
t3 = __rdtsc() - t3;
std::cout << " " << t3;

Tests 2 and 3 are run with the system relatively unloaded. Some instances of other applications are loaded but not busy working.

Tests 4 and 5 are run with a load of one non-OpenMP thread within the application

// startup non-OpenMP thread
// simulate one compute thread
// running other program running
done = true;
HANDLE h = (HANDLE)_beginthread(&tieUpThread, 0, 0);

with

void tieUpThread(void* context)
{
done = false;
while(!done)
_mm_pause();
}

Test 4 is the same as test 2 but run with the additional load.
Test 5 is the same as test 3 but run with the additional load.

In observing the charted lines, the separations of the lines are the White Rabbits.

Test 2, in 4 of the 10 runs, showed scaling between 3.5 and 4x that of single threaded program. This is good scaling. However, test 2, in 5 of the 10 runs, showed scaling between 2 and 3x that of single threaded program. Not as good scaling. And in test 2, 1 of the 10 runs, showed about 1.5x that of the single threaded program. The average for test 2 being 2.87x that of the single threaded program. This is likely a benchmark that you will not find posted to promote someone’s threading prowess. I am posting it here, to illustrate White Rabbits.

Test 3, showed tighter grouping and at greater than 3x improvement over serial performance.

Test 4, has a precipitous drop when the additional thread load comes in to play and when you make no programming consideration for this possibility.

Test 5, mitigates some of the performance hit by using the OpenMP schedule(dynamic, chunk) where chunk is set to:

int chunk = arraySize / maxThreads / 4;

Test 3, (prior to additional thread load) also used schedule(dynamic, chunk) and the average time benefited over coding without chunk. Your experience may vary.

The above runs were of a synthetic program that generally will not be typical of your OpenMP programs. However, the test data does illustrate the White Rabbits in the program.

In a more complex OpenMP application you may use nested levels. The following is an expansion on the test program

Test 6 is the same as test 4, except run one Nest Level deeper (and with additional thread load).
Test 7 is the same as test 5 (with chunk), except run one Nest Level deeper (and with additional thread load).
Test 8 the same as test 6, except run with one less thread (with knowledge of busy thread).
Test 9 the same as test 7, except run with one less thread (with knowledge of busy thread).

As you can see from the chart, using OpenMP chunking can improve performance. And more significantly, knowledge of activities of other threads within your application can yield significant improvement in performance. You will have to construct a rule for determining how many threads to remove from your parallel for loops. Perhaps at least one per level. (More White Rabbit management.)

Let’s see what happens when we run the second test program, but configured to use a threading toolkit that provides enhanced programming capabilities. We will run the same functional test using QuickThread.

As you can see from the chart plots, even with QuickThread, the program still has White Rabbits. But you can also see, with appropriate programming for the run-time environment, you can corral the White Rabbits within a narrow band. Test 2 and 3 for lightly loaded system, and test 6 and 7 for moderately loaded system (one additional compute bound thread).

Let’s see what the QuickThread code looks like

Test 1 is the same serial code

Test 2

// t2 multi-threaded for loop
// using all threads
// without background thread
__int64 t2 = __rdtsc();
parallel_for(
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
A[i] = ((double)i) / 1000.0;
});
parallel_for(
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
B[i] = sin(A[i]);
});
parallel_for(
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
C[i] = cos(A[i]);
});
t2 = __rdtsc() - t2;
std::cout << " " << t2;

With Intel C++, or any compiler with C++0x Lambda function capabilities, you can convert your OpenMP program relatively easy by using Lambda functions.

Test 3 (QuickThread equivalent to OpenMP chunking)

// t3 loops with chunk size
// using all threads
// without background thread
__int64 t3 = __rdtsc();
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
A[i] = ((double)i) / 1000.0;
});
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
B[i] = sin(A[i]);
});
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
C[i] = cos(A[i]);
});
t3 = __rdtsc() - t3;
std::cout << " " << t3;

The QuickThread statement to launch compute bound thread within thread pool of application

// Create a QuickThread thread
// within this application
// to simulate other task running
// using our thread pool
parallel_task(tieUpThread, 0);

Test 8 (non-chunking but reduced thread count)

// t8 reduced thread count without chunking
__int64 t8 = __rdtsc();
parallel_for(
AllWaitingThreads$,
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
A[i] = ((double)i) / 1000.0;
});
parallel_for(
AllWaitingThreads$,
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
B[i] = sin(A[i]);
});
parallel_for(
AllWaitingThreads$,
0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
C[i] = cos(A[i]);
});
t8 = __rdtsc() - t8;
std::cout << " " << t8;

Special note about 8.

Note the use of AllWaitingThreads$ as a parameter to parallel_for. When you use this on the outer nest level, all threads are available, and the iteration space is divided up by the number of threads. When use of AllWaitingThreads$ as a parameter is nested, the parallel_for loop is partitioned by the number of waiting threads + current thread (assuming you do not request for the thread pool to be selected elsewhere (e.g. on/in different socket).

Test 9, reduced thread count with chunking

// t9 reduced thread count with chunking
__int64 t9 = __rdtsc();
parallel_for(
AllWaitingThreads$,
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
A[i] = ((double)i) / 1000.0;
});
parallel_for(
AllWaitingThreads$,
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
B[i] = sin(A[i]);
});
parallel_for(
AllWaitingThreads$,
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
C[i] = cos(A[i]);
});
t9 = __rdtsc() - t9;
std::cout << " " << t9;

Integrating QuickThread into this OpenMP program was relatively easy when used with Intel C++ Lambda function support.

Now consider what you could do with QuickThread to make the last two parallel_for loops run in parallel. For this we use parallel_invoke to fork and join the thread stream.

// t10 all of our pool threads, with chunking
// and with second two loops in parallel
__int64 t10 = __rdtsc();
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
A[i] = ((double)i) / 1000.0;
});
// run two parallel for loops in parallel
parallel_invoke(
AllThreads$,
// Lambda task 1
[&]() {
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
B[i] = sin(A[i]);
});
},
// Lambda task 2
[&]() {
parallel_for(
chunk, 0, arraySize,
[&](int iBegin, int iEnd)
{
for(int i=iBegin; i<iEnd; ++i)
C[i] = cos(A[i]);
});
}
);
t10 = __rdtsc() - t10;
std::cout << " " << t10;

And chart with data to add test 10

With the additional level of parallel programming control within QuickThread it is a relatively trivial programming job to corral the White Rabbits and to produce significantly faster code.

Visit me at The Parallel Void.

Jim Dempsey

Categories: Parallel Programming

For more complete information about compiler optimizations, see our Optimization Notice.

Comments (9)

November 25, 2009 4:47 PM PST


Jim Dempsey
Aaron,

When I preview the sample pieces of code the code is indented for better readability. Something is bung-ing up the indents when the code goes through your finalization process. Also, this post is not the last revision, close enough. A sentence below the first chart was missing. Something to the effect

The seperation of the chart lines is/are observatons of White Rabbits. (add) also the curves do not look more like a square wave. Idealized values for 1st chart 1, 4, 4, 3, 3. The deviation from this is due to other White Rabbits.

From first article White Rabbits are due to

Thread teams not starting at same time,
Thread teams not ending at same time
Default SpinWait time at end of OpenMP region.
Thread preemption by other applications on system

Jim Dempsey




Jim
November 25, 2009 11:57 PM PST

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,814
Black Belt
Awesome!
It would be interesting to see what QuickThread can do for a larger NUMA system. For example, if I run 2 separate jobs in parallel on a 2 processor NUMA system, I really would like to keep first job on first processor and second job on second processor. But I guess that Cilk-style randomized schedulers found in most parallel support libraries will tend to squash everything into a total mess...
November 26, 2009 7:22 AM PST

jimdempseyatthecove
jimdempseyatthecoveTotal Points:
73,324
Black Belt
With the current version of QuickThread you can do that with the addition of a little user code. A future release of QuickThread will integrate this into the thread scheduler. This will make it transparent to the programmer and make more efficient use of the processor.

Assume the user has a 2 socket NUMA system. Dual 4-core processors with HT. Assume further, as you suggest, your interest is in running only 1 or 2 instances of an application that run in one socket. You can use the existence of or absence of a file as an indication of presence of other thread.

if( CreateFirstAppFile() )
{
// we are the first app
YourEntryPoint();
CloseFirstAppFile();
return;
}
// we are the second or later file
parallel_task( M1$, YourEntryPoint);
return;

Then in YourEntryPoint and all deeper functions use any of the qtPlacements from L0$ through M0$.

Note, the M0$ of the task launched with parallel_task( M1$, YourEntryPoint); is the “other’ socket.

parallel_for( M0$, iBegin, iEnd, fn, arg);

Will restrict the for loop to the current socket. So will using L3$ until the point of a socket having multiple L3 caches.
November 26, 2009 10:44 PM PST

Andy Idsinga (Intel)
Andy Idsinga (Intel)
Hey Jim, when you embed your source code in your posts, put it all inside an html pre block - The HTML PRE tag. This will preserve your formatting for the code block.

Cheers :)
November 27, 2009 2:41 AM PST

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,814
Black Belt
Andy, unfortunately it does not work for ISN blogs. Moreover code looks perfectly formatted just before publication, and when it goes live you see such a mess.
November 27, 2009 2:49 AM PST

Dmitriy Vyukov
Dmitriy VyukovTotal Points:
43,814
Black Belt
Jim, it's cool. However I was talking about a bit simpler situation - I explicitly run 2 independent jobs inside of single application - for example I need to sort 2 equally sized arrays. I guess it may be expressed with something along the lines of:
parallel_invoke(OneEach_M0$, job1, job2);
And then each job issues subtasks with L3$.
November 27, 2009 4:54 AM PST

jimdempseyatthecove
jimdempseyatthecoveTotal Points:
73,324
Black Belt
I misunderstood jobs as being seperate applications. Now as to your suggestion of use of parallel_invoke.
My suggestion would be to use parallel_task instead

. // schedule job2 to L3 with most available threads
. // excepting L3 where my thread runs
. parallel_task( L3$ | ExcludeMyCacheLevel$, job2);
. // returns here while job2 runs in other L3
. job1(); // run job1 inparallel with job2
. parallel_wait(); // wait for job 2

(or you could "parallel_task( L3$, job1);" if you want to run other code between the job1 and the parallel_wait)

The reason for suggesting parallel_task over parallel_invoke is that parallel_task uses a qtControl object (task level qtControl is implicit when programmer does not explicitly supply their own qtControl object). The parallel_invoke is designed to be a low weight fork and join operation with the current thread always being a team member, and it always blocks till completion. With these restrictions, I can bypass some of the asychronous tasking code within the scheduler. The bypassing of this code, and the lack of the qtControl object comes with this tradeoff: The cache/memory selector on paralllel_invoke is the prefered affinity qualifier, not the required affinity qualifier. Whereas enqueue with qtControl (implicit or explicit) provides for the ability to pass on a required affinity DeQueue qualifier.
November 27, 2009 5:01 AM PST

jimdempseyatthecove
jimdempseyatthecoveTotal Points:
73,324
Black Belt
In the above post I tried using

(dot)(space)(space)(space))(space)(space)(space)text

as an attempt at inserting an indentation to the code snip.
Here is inserting HTML code in "Leave a comment"

<Pre>
left
ten spaces in
left
</Pre>


See what comes out

Jim
November 30, 2009 10:07 AM PST

Andy Idsinga (Intel)
Andy Idsinga (Intel)
Jim, Dimitry - re the PRE block - I used this in my recent post re javascript objects (check it out formatted and colored). Weird, not sure why it doesn't work for you :(
Looks like it was stripped from the comments too.

Trackbacks (0)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*