Would you take the OpenMP 3.0 related training described in this design document? Why or why not?

Would you take the OpenMP 3.0 related training described in this design document? Why or why not?

Thereare some new and exciting features of Openmp 3.0 that are just hitting the big name compilers (gnu, MS, Intel, etc). The biggest new feature in OpenMP is TASKS!

I am looking at restructuring our existing OpenMP to also incorporate how to use the task contructs in OpenMP.

Below is the instructional design document I think targets our objective, audience, etc -

BUT - this is only my idea.

Would you take the class described below (higher level dedscriptions near top of doc - actual questions & answers planned forthe class near the bottom).Would you consider incorporating the described 3 hours lecture plus 1.5 lab materials into your parallel programming curriculum? Why or why not.

OpenMP* 3.0 Programming

Instructional Design Document

Version: Internal OpenMP 3_0 Programming.doc

1. Course Name: OpenMP* 3.0 Programming

2. Writers: Bob Chesebrough, Jay Hoeflinger

3. Targeted availability: Q4 2008

4. Brief Course Description

Proposed Course Duration: 3 hours lecture, 1.5 hour of lab

During this hands-on module an experienced C/C++ programmer learns how to get started using OpenMP* directives to parallelize common functions and loops, thereby simplifying the introduction of threads in applications.

The first section of the module introduces you to the most common feature of OpenMP: work sharing for loops. The second section shows you how to exploit non-loop parallelism, including the new task constructs in OpenMP 3.0.

The final section discusses the usage of synchronization methods, library functions, and environment variables.

After successful completion of the course, the participants will be able to modify C/C++ code to achieve parallelism using the new OpenMP 3.0 features, available in various compilers (from gnu*, Microsoft*, Intel, and others): but this course will focus exclusively on the Intel compiler for labs and demos.

5. Needs Analysis

OpenMP* 3.0 has released a new, powerful feature called tasks, which is an easy way to parallelize irregular problems in your code, such as unbounded loops, recursive algorithms, and producer/consumer code.

Tasks will simplify coding for developers, and this course provides a focused and effective introduction to tasks, as well as other powerful and more traditional OpenMP* features.

With one or two additional lines, programmers wil
l now be able to parallelize patterns in their code that would have otherwise required 20 or so more lines of code using traditional, complicated explicit threading code.

This new power of OpenMP* greatly expands effective usefulness of xxxxxxxxxx.

6. Subject Matter Experts (SMEs)

a. Jay Hoeflinger, Xinmin Tian, Bob Chesebrough, others

7. Learner Analysis

The ideal student for this module is an adult learner at a university, who in addition to exhibiting the learning characteristics of adult learners, has also the following traits:

  1. a programmer in the C/C++ or Fortran compiled programming languages, who has between 1 and 3 years (or equivalent) programming experience in one or more of those three languages
  2. Beginner to intermediate programming levels
  3. could be freshman, sophomore or junior level programmer (advanced 1st, 2nd, or 3rd year college)
  4. A programmer who routinely writes simple sorting or computation programs (between 10 and 50 lines) from scratch in a day or less, with no difficulty whatsoever
  5. these short programs routinely compile with few or no problems, and the student is well able to solve the problems to a successful compile
  6. if working with Linux code: is comfortable as a standard user of Linux; understands basic Linux file and directory permissions; is able to successfully compile and link Linux code; can identify and stop Linux system processes; is familiar with a favorite Linux shell (similar to Bourne or C shells); understands basic Linux command line commands such as ls, lf, tar, sar, etc.
  7. if working MS Windows* code: is comfortable with using the process viewer; familiar with MS Visual Studio Developer Environment; is comfortable changing environmental variables
  8. probably has no knowledge of compiler optimization strategies; may or may not have wondered if the compiler could provide more benefit
  9. may or may not understand the needs for quickly parallelizing code that is largely serial

8. Context Analysis

The purpose of a Context Analysis is to identify and describe the environmental factors that inform the design of this course. These environmental factors include:

a. Media Selection

i. No Tapes, CDs, or DVDs are available or provided

ii. Electronic files are provided

1. Can be printed out for classroom use if desired

2. Lecture presentation is .PPT format

a. includes instructor notes

3. Lab Guide is .DOC format

a. includes all planned hands-on labs

b. Document is labeled Student Workbook

4. Instructor Guide

a. 5-10 pages

b. homework labs with solutions

c. classroom questions with answers

d. tips on teaching material

5. Suggested class binaries included in tar format

a. instructor or students can substitute their own binaries for suggested ones

b. substitution may be optimal, in particular if student is using code they wrote from scratch

b. Learning Activities

i. Lectures include optional demos for the instructor

ii. Hands-on labs for students

1. Labs are designed as student homework but can be also done during class time if preferred

iii. Class Q+A

c. Participant Materials and Instructor/Leader Guides

; i. There is a short Lab Guide with this module

ii. There is a short Lecture presentation with this module

1. Minimal instructor notes are included in PPT Notes sections

iii. An archive of class binaries, if no customized or student binaries are available

d. Packaging and production of training materials

i. Materials are posted to the Multi-Core Courseware Content from Intel site, for worldwide use and alteration:


ii. Aside from typical programming courses, these materials would be well suited as modules for the following curricula:


parallel versions of all functions

Algorithms for computational biology & similar courses with some emphasis on programming

programming for any applied science

Courses already teaching win32 or posix threading implementation models

e. Training Schedule

a. If taught in a 3 day faculty training setting, them materials ma
y be reduced to accommodate time constraints

i. The full size package downloadable module is 3 hours of lecture and 1.5 hours of hands-on labs currently

1. For Instructor lead faculty training this may be reduced to 1 hour lecture & 1 hour of labs

ii. Class size is not restricted in any way by the course materials themselves:

1. Students require access to a recent many core system, running supported Linux OS

2. that system has Intel (VTune Analyzer, Compilers, TBB) or equivalent software installed on it

3. Students require access to either instructor-provided or their own binaries of interest on that server

f. Other References

a. Parallel Programming in C with MPI and OpenMP
ISBN-10: 0072822562
ISBN-13: 978-0072822564
Michael J. Quinn

b. Using OpenMP - Portable Shared Memory Parallel Programming
ISBN: 0262533022
Barbara Chapman, Gabriele Jost, Ruud Van der Pas, David J. Kuck

c. Parallel Programming in OpenMP
ISBN-13: 9781558606715
by Rohit Chandra

d. OpenMp.org (for 3.0 spec)
see OpenMP 3.0 spec

9. Task Analysis

The relevant Job/Task Analysis for this material is defined by the Software Engineering Body of Knowledge (SWEBOK) and can be viewed in detail here:


The primary Bodies of Knowledge (BKs) used include, but are not limited to:

  • Software Design BK

    • Key issues in Software Design
    • Data persistence, etc.
  • Software Construction BK
    • Software Construction Fundamentals
    • Managing Construction
    • Practical Considerations (Coding, Construction Testing, etc.)

Relevant IEEE standards for relevant job activities include but are not limited to:

  • Standards in Construction, Coding, Construction Quality IEEE12207-95

(IEEE829-98) IEEE Std 829-1998, IEEE Standard for Software Test Documentation, IEEE, 1998.

(IEEE1008-87) IEEE Std 1008-1987 (R2003), IEEE Standard for Software Unit Testing, IEEE, 1987.

(IEEE1028-97) IEEE Std 1028-1997 (R2002), IEEE Standard for Software Reviews, IEEE, 1997.

(IEEE1517-99) IEEE Std 1517-1999, IEEE Standard for Information Technology-Software Life Cycle Processes- Reuse Processes, IEEE, 1999.

(IEEE12207.0-96) IEEE/EIA 12207.0-1996//ISO/IEC12207:1995, Industry Implementation of Int. Std. ISO/IEC 12207:95, Standard for Information Technology-Software Life Cycle Processes, IEEE, 1996.

10. Concept Analysis

Demonstrate how an instance of the compiler can take advantage of new core 2 processor features.

  • Parallel vs. Serial construction and implementation
  • OpenMP* strategies:
    • Tasks
    • Work sharing
    • Synchronization
    • Library functions/APIs
  • Rapid parallelization of recursive Fibonacci; quick sort; and processing link lists
  • Rapid parallelization of bounded loop-centric code
  • Parallelization of what unbounded loops
  • USING an optimizing compiler
    • Intel Compiler
    • GNU Compiler (gcc)
    • MS VISUAL STUDIO (MS Compiler)

11. Specifying Learning Objectives

Given course materials, hardware and software, students will examine sample function hierarchies, data structures/arrays, and loops operating on those arrays, and learn how to parallelize them using tasks or work sharing.

tudents will study examples and will learn how to use simple techniques to parallelize:
1) for loops
2) unbounded while loops
3) parallel sections
4) processing linked lists
5) recursive functions.

Students will learn how to use simple scheduling clauses to improve parallel performance on irregular computation patterns, such as those found in Mandelbrot computations, to divide work among threads evenly.

12. Constructing Criterion Items

Q :What are the advantages of using OpenMP directives rather than using explicit threading to parallelize an application?

A: any of the following could be considered valid answers
1) Portability OpenMP based programs are source compatible and can be compiled on windows, linux or mac (and other) platforms. This takes a re-compile step on the new platform. With Explicit threads there would generally be syntax changes required
2) Ease of implementation The parallelism inherent in many applications can be leveraged with a few simple openmp pragma added to the code. Explicit threading, particularily in the case of domain decomposition, requires a redesign of the region of interest.
3) Incremental approach to parallelism. Small sections of code can be parallelized with a pragma and then tested. If the application does not behave properly all that would be required for most openmp apps would be to re-compile WITHOUT the /openmp compiler switch and the original sequential code behavior is restored. With explicit threads this is not possible because whole sections of code would be written differently for explicit threads than for the sequential version.

Q: In a few lines of added code (added openmp directives), demonstrate how to parallelize the following linked list processing code

node * p = head;
while (p) {
p = p->next;

A: Below is one solution (extra { } added for clarity)
#pragma omp parallel
#pragma omp single
node * p = head;
while (p) {
#pragma omp task firstprivate (p)
sp; processwork(p);
p = p->next;

Q: In one or two lines of added code (added openmp directives), demonstrate how to parallelize the following pi calculation code using a worksharing construction and reduction clause

static long num_steps=100000; double step, pi;
void main()
int i;
double x, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0; i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0 + x*x);
pi = step * sum;
printf(Pi = %f

static long num_steps=100000; double step, pi;
void main()
int i;
double x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(i,x) reduction(+:sum)
for (i=0; i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0 + x*x);
pi = step * sum;
printf(Pi = %f

Q: What are OpenMP tasks?

A: Tasks are independent units of work which:

may be deferred

may be executed immediately

either at the discretion of the runtime system

Q: What three main items is a task composed of:

A: Tasks are composed of:
code to execute, data environment, internal control variables (ICV)

Q: OpenMP tasks can be used to paralleliz
e producer/consumer types of code example, name one other type example that can be parallelize with OpenMP Tasks?

A: OpenMP tasks can be used to parallelize irregular problems in your code, such as unbounded loops, recursive algorithms, and producer/consumer code.

Q: What clause can be used to force threads to wait until all child threads complete?

A: #pragma omp taskwait

Q: How can a developer indicate to an openmp program that a particular variable needs to be copied to each thread executing the program

A: Using the Openmp private clause, OR Using the Openmp firstprivate clause

Q: How can a developer indicate to an openmp program that a particular variable needs to be shared among each thread executing the program

A: Using the Openmp shared clause

Q: What issues can arise from stipulating that a variable is shared among threads?

A: Data Race conditions can arise that may (will) cause your application to operate incorrectly.

Q: What advantages are there in multi-threading an application

A: 1) Improved user responsiveness, 2) improved performance

Q: What is a pragma?

A: Any instruction to the compiler to compiler your program in a certain way.

Q: What value should you receive when you call OMP_get_num_threads if it is called outside a parallel region?

A: 1

Q: Assuming youve added OpenMP pragmas to your code, what compiler switch would you use to enable OpenMP?

A: -openmp or /Qopenmp

Q: Assuming youve added OpenMP pragmas to your code, what header file do you need to include to use OpenMP?


Q: How can a developer determine if his application is utilizing all available cores?

A: Use a process monitor (like Windows task manager or Windows Perfmon) or a profiling tool such as VTune Anayzer.

Q: How can a developer quickly analyze unfamiliar code to identify performance bottlenecks ?

A: Use profiling tools (such as VTune Performance Analyzer) to identify performance bottlenecks using either sampling technology or call graph technology.

Q: How can a developer determine what portions of his application would get the greatest benefit from parallelism?

A: Use profiling tools (such as VTune Performance Analyzer) to identify lops or functions that take the greatest percentage of time to execute and analyze these areas for opportunities for parallelization.

Q: How can a developer determine if his threading implementation is efficiently using all cores (is the workload fairly balanced among the cores)?

A; Use tool, such as the VTune Performance Analyzers Thread view, to identify load imbalances. Visually inspect the thread viewer to determine if all the threads are doing equal work or if a small fraction of threads is doing the majority of the work.

A: How can I correct an obvious load imbalance in my threaded implementation of Mandelbrot code?

A: Experiment with different openmp scheduling strategies (Dynamic, Guided, Static) to mitigate thread imbalance and achieve best performance.

13. Expert Appraisal: Live meeting capture of a SME demo walkthrough of material will be available Dec 1, 2008. URL is TBD.

Materials will be reviewed by the chief architect of the OpenMP 3.0 spec, Jay Hoeflinger for technical accuracy. Materials will be reviewed by one external Academic reviewer

14. Developmental Testing: alpha and betas of this material posted to ISC content page by the Nov 1, 2008

IPA Is target for Sept 24, 2008

15. Production:

Blueprint Target Date: 6 weeks from now, Oct 30, 2008

Approval by PDT required pending general availability

All materials will be posted to the Multi-Core Courseware Content from Intel site by Dec 3, 2008 (Target POR).

, for worldwide use and alteration:


1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.