Compiler Support for the Workqueuing Model Extends OpenMP*


Introduction

Intel's workqueuing model extends OpenMP* to parallelize a far greater range of applications.

Programs with irregular patterns of dynamic data structures or with complicated control structures like recursion are hard to parallelize efficiently. The workqueuing model allows the user to exploit irregular parallelism, beyond that possible with OpenMP*. The model is now an integral part of the Intel® C++ high-performance compiler, more particularly the OpenMP* parallelizer. In this article, I'll provide an overview of the workqueuing model and describe the multithreaded code generated by the compiler.


Workqueuing Model

The workqueuing model allows the user to parallelize control structures that are beyond the scope of those supported by the OpenMP model, while attempting to fit into the framework defined by OpenMP. The workqueuing model is a flexible mechanism for specifying units of work that are not pre-computed at the start of the worksharing construct.

For single, for, and sections constructs, all work units that can be executed are known at the time the construct begins execution. The workqueuing pragmas taskq and task relax this restriction by specifying an environment (the taskq) and the units of work (the tasks) separately. Many control structures exhibit the pattern of separated work iteration and work creation; these structures are naturally parallelized with the workqueuing model. Some common cases are C++ iterators, while loops, and recursive functions.

The taskq pragma specifies the environment within which the enclosed units of work (tasks) are to be executed. From among all the threads that encounter a taskq pragma, one is chosen to execute it initially. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread, and then the code inside the taskq block is executed in a single-threaded mode.

All the other threads wait for work to be enqueued on the conceptual queue. The task pragma specifies a unit of work, potentially executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is conceptually enqueued on the queue associated with the taskq. The conceptual queue is disbanded when all work enqueued on it finishes, and when the end of the taskq block is reached.

To preserve sequential semantics, there is an implicit barrier at the completion of the taskq. The user is responsible for ensuring either that no dependencies exist or that dependencies are appropriately synchronized, either between the task blocks, or between code in a task block and code in the taskq block outside of the task blocks.


Workqueuing Constructs

The syntax, semantics, and clauses are designed to resemble OpenMP worksharing constructs:

#pragma intel omp taskq [clause[[,]clause]...]

structured-block

 

where clause can be:

private ( variable-list )

firstprivate ( variable-list )

lastprivate ( variable-list )

reduction ( operator : variable-list )

ordered

nowait

#pragma intel omp task [clause[[,]clause]...]

structured-block

 

where clause can be:

private ( variable-list )

captureprivate ( variable-list )

 

The clauses have similar meanings to those for OpenMP worksharing constructs. The only new clause is captureprivate, which is similar to firstprivate, except that the original objects referenced retain their value. However, lastprivate, firstprivate, and private objects in a taskq are also implicitly captureprivate on each enclosed task.

One can also combine the OpenMP parallel construct with taskq:

#pragma intel omp parallel taskq [clause[[,]clause]...]

structured-block

 

where clause can be:

if ( scalar-expression )

num_threads ( integer-expression )

copyin ( variable-list )

default ( shared | none )

shared ( variable-list )

private ( variable-list )

firstprivate( variable-list )

lastprivate ( variable-list )

reduction ( operator : variable-list )

ordered

 


Compiler Implementation

We tightly integrated the implementation of the workqueuing model into the compiler's existing OpenMP framework, reusing code as much as possible. As a result, the back-end support was completed in merely three man-weeks.

Most of the new code went into the multithreaded code generator that lowers workqueuing constructs into equivalent IL0 (the compiler's intermediate representation) statements and calls to the Intel® OpenMP run-time library. We also ensured that the compiler technologies that we had already developed for OpenMP are equally applicable to the new code.


Code Generation Example

Let's see how the compiler generates multithreaded IL0 code given the while loop below:

void test1(LIST p)

{

#pragma intel omp parallel taskq shared(p)

{

while (p != NULL)

{

#pragma intel omp task captureprivate(p)

{

do_work1(p);

}

p = p->next;

}

}

 

The paralleltaskq pragma specifies an environment for the while loop in which to enqueue the units of work specified by the enclosed task pragma. The loop's control structure and the enqueuing are executed single-threaded, while the other threads in the team participate in dequeuing the work from the taskq queue and executing it. The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued, hence preserving the sequential semantics.

First, the compiler's front-end generates an IL0 representation of the workqueuing code as shown below, where the while loop has been lowered into if and goto statements, and each workqueuing pragma has been converted into a pair of IL0 begin/end directives that define the boundaries of the construct:

void test1(p)

{

DIR_PARALLEL_TASKQ SHARED(p)

if (p != 0)

{

L1:

DIR_TASK CAPTUREPRIVATE(p)

do_work1(p)

DIR_END_TASK

p = p->next

if (p != 0)

{

goto L1

}

}

DIR_END_PARALLEL_TASKQ

return

}

 

Next, the OpenMP back-end creates new data structures to handle private and shared variables. Two struct types are defined for each taskq construct. The first one, shared_t, holds the pointers to its shared, firstprivate, lastprivate, and reduction variables.

The other struct, thunk_t, has a pointer to shared_t, in addition to fields that hold the private copies of variables listed in the taskq as private, firstprivate, lastprivate, captureprivate, or reduction. At compile-time, a pointer to each struct is created, while the actual objects they point to are instantiated at run-time by invoking workqueuing library routines.

For this example, the symbol table entries generated are as follows:

typedef struct shared_t {

...                    // fields for internal use

p_ptr;                 // pointer to shared variables p

};

auto struct shared_t *shareds;

typedef struct thunk_t {

...                    // fields for internal use

struct shared_t *shr;  // := shareds

p;                     // captureprivate p

};

auto struct thunk_t *taskq_thunk;

 

In this example, the automatic pointers shareds and taskq_thunk are allocated outside of the taskq's threaded entry. In addition to the private variables, thunk_t holds enough information about the code and data of a taskq so that its execution, if suspended due to a full queue, can be resumed later.

Finally, the OpenMP back-end lowers the IL0 directives into multithreaded IL0 code that explicitly calls Intel OpenMP run-time library routines to enqueue/dequeue the tasks and to manage and synchronize the threads. As a result of the code transformations, three threaded entries, or T-entries¹, are inserted into the original function test1(): test1_par_taskq() (lines 6-16 below) corresponds to the semantics of the "parallel" portion of the combined parallel taskq pragma; test1_taskq() (lines 18-50) corresponds to the "taskq" portion of the combined pragma; and nested within it is test1_task() (lines 36-40), which corresponds to the enclosed task construct:

void test1(p)

{

__kmpc_fork_call(test1_par_taskq, &p)

goto L3



T-entry test1_par_taskq(p_ptr)

{

...

}



T-entry test1_taskq(taskq_thunk)

{

...

T-entry test1_task(task_thunk)

{

...

}

...

}

L3:

return

}

 

The Intel OpenMP library routine __kmpc_fork_call() invoked in line 3 creates a team of threads at run-time. All threads execute test1_par_taskq(), but only one thread proceeds to execute test1_taskq(), which is like a skeletal version of the while loop and whose main purpose is to enqueue, on every iteration, the work specified in test1_task(). All the other threads become worker threads that dequeue the tasks and execute them.

Let's look at these T-entries, starting with test1_par_taskq():

T-entry test1_par_taskq(p_ptr)

{

taskq_thunk = __kmpc_taskq(test1_taskq, 4, 4, &shareds)

shareds->p_ptr = p_ptr

if (taskq_thunk != 0)

{

test1_taskq(taskq_thunk)

}

__kmpc_end_taskq(taskq_thunk)

T-return

}

 

The parameter p_ptr (line 6) is a pointer to the shared variable p. All threads call the library routine __kmpc_taskq() in line 8 to instantiate shared_t (in shareds) and also to attempt to instantiate thunk_t (for taskq_thunk), but only one of the threads will succeed in that attempt and only it will proceed to the T-entry test1_taskq() (line 12). The other threads fail the test in line 10 and call __kmpc_end_taskq()in line 14 to become worker threads. The T-entry test1_taskq() is shown below:

T-entry test1_taskq(taskq_thunk)

{

if (taskq_thunk->status == 1)

{

goto L2

}

if (*(taskq_thunk->shr->p_ptr) != 0)

{

L1:

task_thunk = __kmpc_task_buffer(taskq_thunk, test1_task)

task_thunk->p = *(taskq_thunk->shr->p_ptr)

if (__kmpc_task(task_thunk)!= 0)

{

__kmpc_taskq_task(taskq_thunk, 1)

T-return

}

...

L2:

*(taskq_thunk->shr->p_ptr) =

(*(taskq_thunk->shr->p_ptr))->next

if (*(taskq_thunk->shr->p_ptr) != 0)

{

goto L1

}

}

__kmpc_end_taskq_task(taskq_thunk)

T-return

}

 

The main purpose of test1_taskq()is to enqueue each task as it runs through the loop's control structure. The loop's test appears in lines 24 and 43, and its pointer update in line 42. Access to the shared p is accomplished through the expression *(taskq_thunk->shr->p_ptr).
Before enqueuing the task for an iteration, the library routine __kmpc_task_buffer() is called (line 27) to create task_thunk and initialize it based on taskq_thunk. The address of the T-entry test1_task() is also stored in task_thunk for the worker thread that dequeues it to run the task. To satisfy the captureprivate semantics, line 28 copies the value of the shared variable p into the task's private copy of p stored in its task_thunk.
The actual enqueuing of a task is done by the library routine __kmpc_task() in line 29. A return value of zero means that the queue is not full, allowing the next task to be enqueued. A non-zero return value indicates that the queue is becomi ng full; the execution of the taskq is suspended (the thread becomes a worker thread) and will be resumed later by potentially a different thread.

For this purpose, the library routine __kmpc_taskq_task() in line 31 enqueues the taskq_thunk itself. Later, the worker thread that dequeues it will resume execution of the taskq at the location L2 in line 41. To accomplish this, a jump table value (integer 1 in the example) is passed in as an argument to __kmpc_taskq_task().

This value is stored in taskq_thunk and must uniquely identify this call site, because there may be several tasks enclosed in the same taskq. This value must also be non-zero, because zero is reserved for the first execution of the taskq.

Based on this value, the jump table (lines 20-23) determines whether the current execution of test1_taskq() is the first one or a continuation of a suspended run, and accordingly transfers execution to the correct location.

After a worker thread dequeues a task_thunk, it extracts the address of test1_task() and proceeds to execute the task contained. This T-entry is shown below:

T-entry test1_task(task_thunk)

{

do_work1(task_thunk->p)

T-return

}

 

 

¹ Conceptually, a T-entry is similar to a function entry, despite some subtle differences.


Conclusion

The workqueuing model is Intel's simple yet powerful extension to OpenMP that greatly increases the range of applications that can be parallelized with OpenMP. Please refer to http://www.caspur.it/ewomp02/PAPERI/EWOMP02-03.pdf* for more information on this topic, including a more detailed description of the compiler implementation and the run-time libraries, as well as performance results.


Additional Resources

Related Articles

Rich Gerber's three-part OpenMP* series:

 

Related Developer Centers

Services and Products

  • The Intel® Software Partner Program provides software vendors with Intel's latest technologies, helping member companies to improve product lines and grow market share: http://www.intel.com/cd/software/partner/asmo-na/eng/index.htm
  • Intel® Solution Services is a worldwide consulting organization that helps solution providers, solution developers, and end customers develop cost-effective e-Business solutions to complex business problems:

 


For more complete information about compiler optimizations, see our Optimization Notice.