Would you use this introductory material on parallelism in your classes? Why or why not?
Introduction to Parallel Programming
Version: October 17, 2008 v.2
1. Module Name: Introduction to Parallel Programming
2. Writers: CAT Staff
3. Targeted availability: [Intel internal date] (Coincident with Faculty Training 2.1)
4. Brief Module Description
Proposed duration: two hours
This lecture-only module is the briefest possible introductory survey to parallel programming, providing a lexicon of essential vocabulary and important basic concepts. For learners who have little to no exposure to parallelism in compiled programming languages, it is a focused and accelerated way to provide a much-needed prerequisite foundation for further study.
5. Needs Analysis
For decades, the vast majority of software developers trained both in Universities and in business have been trained to program sequentially. With the recent and ubiquitous release of many core processors, the software industry and academia are facing a large paradigm shift, of substantial importance. To continue to teach only sequential programming and design going forward is really to teach the History of Computer Science.
All developers going forward in the 21st century must have a working vocabulary of parallelism design and implementation concepts well within their grasps; and further, they need practical, hands-on experience in interacting with these powerful, parallel platforms.
Why is the hardware industry adopting multi-core processors now, and why are we taking this approach to modern CPU design? Simply, a significant thermal barrier has been reached. In the past, scaling CPU frequencies higher and higher had been the way to reach more optimized performance. But increasing (or ramping) the CPU frequencies came at a cost: the power consumption of the hardware ramped as the square of the frequency. Effectively, continued ramping of CPU frequency is no longer sustainable.
Now, instead, performance improvements must be achieved by taking advantage of multiple cores running at a reasonable frequency. This translates into the need for software developers to be trained in various methods of parallel design, implementation, and testing.
This new generation of software designers and developers need to be trained to take advantage of parallelism. This course lays down a foundation with concepts and coursewares that faculty can take back to their university curriculum to generate up-to-date materials to train this new generation.
6. Subject Matter Experts (SMEs): [Intel Confidential List]
7. Learner Analysis
The ideal student for this module is an adult learner at a university, who in addition to exhibiting the learning characteristics of adult learners, has also the following traits:
- Should have between 1 and 3 years (or equivalent) of programming experience in a compiled language such as C/C++. For attendees with little C/C++ experience, 1-3 years with Java or C# will suffice.
- A programmer who routinely develops short algorithmic modules, that are integrated into a larger application, with no difficulty whatsoever
- These modules routinely compile with few or no problems, and the student is well able to solve the problems for a successful compile
- Working with MS Windows* code: is comfortable with using the process viewer; familiar with MS Visual Studio Developer Environment; is comfortable changing environmental variables
- May already be actively seeking ways to use current available resources more effectively to solve problems related to better software performance.
- Has the ability to learn from lecture/discussion environment only.
- Has an ability to generalize from examples.
- Demonstrates a willingness to tackle a difficult concept and deal with complexity.
- May or may not have an understanding of the issues of parallel programming and are at least familiar with one concurrent programming method
- Currently instruct or plan to instruct adult students who fit in the learner description earlier in this section
- Currently using a successful programming curriculum, or intend to soon create or teach one
The purpose of a Context Analysis is to identify and describe the environmental factors that inform the design of this module. Environmental factors include:
- Media Selection: lecture presentation will be in Microsoft* Power Point* format including speaker notes and references to more detailed content. Since there are no labs involved with this module, no lab guide or document will be provided.
- Learning Activities: Lecture-only presentation; discussion of all content and included analysis is encouraged between students and between students and instructor. Thought experiments, other classroom activities or exercises as warranted.
- Participant Materials and Instructor/Leader Guides: Instructor notes are included in Power Point Notes sections. Recorded presentation and lecture notes for the slides, narrated by course author, will be made available to internal Intel instructor candidates (and may be made available to external academics through the Intel Academic Community website).
- Streaming video of expert delivery posted to web; transcript of expert delivery included with .ppt
- Packaging and production of training materials: Materials are posted to Intel Academic Community WIKI website, for worldwide use and alteration. Archived version sent to Courses Available web site.
Training Schedule: The module is two (2) hours of lecture
Class size is not restricted in any way by the course materials
References: Quinn, Michael J., Parallel Programming in C with MPI and OpenMP, McGraw-Hill, 2003.
9. Task Analysis
The relevant Job/Task Analysis for this material is defined by the Software Engineering Body of Knowledge (SWEBOK) and can be viewed in detail here:
The primary Bodies of Knowledge (BKs) used includes, but are not limited to:
- Software Design BK
- Key issues in Software Design (Concurrency)
- Data persistence, etc.
- Software Construction BK
- Software Construction Fundamentals
- Managing Construction
- Practical Considerations (Coding, Construction Testing, etc.)
Relevant IEEE standards for relevant job activities include but are not limited to:
- Standards in Construction, Coding, Construction Quality IEEE12207-95
- (IEEE829-98) IEEE Std 829-1998, IEEE Standard for Software Test Documentation, IEEE, 1998.
- (IEEE1008-87) IEEE Std 1008-1987 (R2003), IEEE Standard for Software Unit Testing, IEEE, 1987.
- (IEEE1028-97) IEEE Std 1028-1997 (R2002), IEEE Standard for Software Reviews, IEEE, 1997.
- (IEEE1517-99) IEEE Std 1517-1999, IEEE Standard for Information Technology-Software Life Cycle Processes- Reuse Processes, IEEE, 1999.
- (IEEE12207.0-96) IEEE/EIA 12207.0-1996//ISO/IEC12207:1995, Industry Implementation of Int. Std. ISO/IEC 12207:95, Standard for Information Technology-Software Life Cycle Processes, IEEE, 1996.
10. Concept Analysis
- Serial vs. parallel computing
- Domain and Task Decomposition
- Shared-memory Programming Model, Threads
- OpenMP and its scheduling clauses
- static, dynamic, guided
- Race condition; lock; semaphore; mutex; critical section; deadlock
- Loop Transformations
11. Specifying Learning Objectives
- Using the vocabulary and examples from the training, define parallel computing and its importance in modern software code
- Describe domain decomposition and describe at least 2 situations that are best suited for its use
- Describe task decomposition and describe at least 2 situations that are best suited for its use
- Describe pipelining
- Using the vocabulary and examples from the training, describe the features of a shared-memory programming model
- Dependency analysis
- Provide a detailed overview of how threads can be used to implement domain and task decomposition, as well as pipelining.
- Describe the roles of shared and private variables in a shared-memory programming model.
- Describe the role and importance of OpenMP.
- Using the vocabulary and examples from your own code, define the following: race condition, locks, semaphore, mutex, critical section, and deadlock
12. Constructing Criterion Items
Q: What is parallel computing?
A: Parallel computing is the simultaneous use of more than one set of instructions, processes, threads, or processors (or all of these) to execute your program. Said another way, multiple activities of some work to be done are performed at the same time, instead of sequentially, to speed up the overall execution time of the code.
Q: To implement parallelism in your code, what you must you do first?
A: Study your code to identify independent operations or activities that can be performed simultaneously, and then balance the computational load while minimizing redundant computations and unneeded synchronization events.
Q: Three ways of implementing parallelism are discussed in class. What are those ways?
A: Domain decomposition, task decomposition, and pipelining.
Q: Define domain decomposition, task decomposition, and pipelining.
A: Domain decomposition is used to divide data elements of your code into pieces, and then assigns thread to the pieces. Task decomposition is used to divide the tasks that your code is doing between available computer resources. Pipelining involves the use of the output of a function as the direct input of the next function.
Q: When parallelizing your serial code, when is it best to use the domain decomposition methodology?
A: If your program contains repetitive tasks that are performed on large data sets.
Q: When parallelizing your serial code, when is it best to use the task decomposition methodology?
A: If your program contains many tasks, and those tasks are independent of one another.
Q: When parallelizing your serial code, when is it best to use the pipeline model?
A: If your program contains many sequential tasks, where the output of one task becomes the input for the next.
Q: What is meant by the phrase shared-memory model?
A: A shared-memory model enables multiple program activities to access a shared primary memory location on the computer. This sharing allows activities to communicate and synchronize with each other. Further, some parts of a parallel program may require exclusive access to certain data elements; and this model allow you to privatize such data elements, for the needed exclusive access.
Q: What is meant by synchronization?
A: Synchronization is the mechanism that prevents simultaneous and therefore conflicting updates to shared information, which ensures data reliability and integrity.
Q: What is a process?
A: A process is an instance of a computer program that is being sequentially executed. A process is a program that is in some state of execution. The operating system is responsible for generating processes on the system.
Q: What are two kinds of processes?
A: User-defined processes, and system-defined processes.
Q: What is a thread?
A: A thread (or thread of execution) is a way for a program to fork (or split) work to be done into two or more simultaneously running tasks. Further, a thread is a unit of code execution within a process.
Q: Which requires less system overhead, threads, or processes?
A: Threads are created more quickly, and can interact with each other with less overhead. (Referencing shared-memory locations takes less time than sending messages.)
Q: What are some advantages of the Fork/Join Model for Threading?
A: Only the master thread can create additional threads which reduces the need for synchronization overheads during program execution; the model support incremental parallelization; less code rewrite is typically requires.
Q: What is a shared variable? What is a private variable?
A: A shared variable is one that can be accessed by each thread within a process. A private variable is one that is specific to an individual thread (no other thread can access the private variable).
Q: What is OpenMP?
A: OpenMP is an API for shared-memory parallel programming consisting of a set of compiler directives, library routines, and environmental variables that are used with a shared-memory computer to specify shared-memory parallelism. The OpenMP standard was designed in 1997 as a FORTRAN-based standard for writing portable, multithreaded applications. The standard for C and C++ first appeared in 1998.
Q: When using OpenMP, how do you add parallelism to your source code?
A: By adding the required pragmas to your source code, and, if needed, by using a set of OpenMP runtime routines.
Q: What is a pragma?
A: A pragma is a C/C++ compiler directive, or instruction. OpenMP pragmas typically tell the compiler to parallelize the code immediately following the pragma.
Q: What is a typical OpenMP pragma?
A: #pragma omp
Q: OpenMP runtime routines can be used to set and retrieve environment information. Give an example of a header file that you might include in your project to use these runtime routines?
Q: What makes OpenMP well-suited for implementing parallel programming?
A: It is available for a wide variety of computers platforms and compilers; it is available for UNIX as well as Windows; it enables scalability of applications by using shared-memory parallel programming; it simplifies fork/join programming, making it well suited for domain decomposition; it provies a widely supported cross-platform API.
Q: Is the following statement true or false? Explain your answer.
OpenMP does not check for errors such as deadlocks or race conditions.
A: True. Software analysis tools are required to check for these and other errors.
Q: What is a race condition?
A: Nondeterministic behavior of a program as a result of multiple threads accessing a shared variable. If multiple threads attempt to write to a shared variable, the program can yield incorrect output. Further, programs with race conditions may execute correctly on trivial data sets and a small number of threads. (In some cases the possibility of erroneous output occurs when the number of threads increases along with the execution time.)
Q: How can you prevent a race condition in your code?
A: By ensuring that only one thread at a time references and updates shared data.
Q: What is meant by synchronization, or mutual exclusion?
A: Using a locking mechanism of some kind to ensure that only one thread at a time references and updates shared data.
Q: What is a lock?
A: A lock is a static shared variable that, depending on its status, indicates that the area of code is in use by another thread and must not be accessed yet by other threads.
Q. What is a critical section?
A: A region of code that threads execute in a mutually exclusive manner. Critical sections typically involve synchronization overheads. Critical sections should not be large sections of code, as a result.
Q: Do Win32 and POSIX APIs provide a set of locks that enable you to preempt race conditions in a program using the general threading model?
A: Yes. These locks include semaphore, signal, mutex, and critical section.
Q: What is a semaphore?
A: A synchronization object that has an associated count. This count is used to limit access of a particular critical section to a specified, certain number of threads. It can synchronize two or more threads from multiple processes.
Q: What is a mutex?
A: Mutual exclusion object. An object that is created so that multiple program threads can take turns sharing the same resource, but not simultaneously.
Q: What is a critical section?
A: The Windows critical section is an intra-process mutex. It synchronizes threads within the same process, but is faster than semaphores or mutex.
Q: What is a deadlock?
A: A deadlock state is a situation in which a process is blocked and waits for a condition that will never become true.
Q: What is a dependence graph?
A: A methodology used to identify possible parallel regions in a program, to determine if task or domain decomposition is possible for a given region of code.
13. Expert Appraisal: Live meeting capture of a SME demo walkthrough of material will be available by [Intel internal date].
Materials will be reviewed by [Intel internal date]. Materials will be reviewed by at least one external Academic reviewer via ISC Curriculum Advisory Council.
14. Developmental Testing: alpha and betas of this material posted to ISC content page by [Intel listed date]
[Intel Internal Dates]