Beta Course Design Document: Practical Hands-on Architecture for Applications Programmers -- YOUR FEEDBACK WELCOMED!

Beta Course Design Document: Practical Hands-on Architecture for Applications Programmers -- YOUR FEEDBACK WELCOMED!

Hi All,

Below please find a beta version of the Course Design Document for a short module that introduces Intel Core 2 architecture from a software developers persepctive. ALL FEEDBACK IS FAIR GAME! Please read and comment freely.



Practical Hands-on Architecture for Applications Programmers

Course Design Document

Version: CD_AAP_BPR_1_0.doc

1. Course Name: Practical Hands-on Architecture for Applications Programmers

2. Writer: [Intel Confidential list]

3. Targeted availability: Q2 2008 (TBD)

4. Brief Course Description

Proposed Course Duration: 3 hours lecture, and 3 hours of homework

This new three-hour architecture module is written with experienced software developers specifically in mind, and demonstrates how developers can take advantage of the hardware features already at their disposal by using different and innovative programming techniques, or compiler optimizations, or both. The homework is designed as hands on labs that unify the course objectives with the students own code.

The module interleaves compiler and performance analyzer topics with essential architecture topics. General architecture topics include exploiting the architecture - taking advantage of: SSE how to get instructional parallelism in a single core; Multi-core architecture; program organization via cache utilization, and by using efficient data structures and loop performance as well as other topics important to best-performance computing.

After successful completion of the course lecture and lab materials, the student will be able to articulate the importance of many core programming. Further, they will be able to identify the number of cores their application is using, how efficiently they are using those cores, and modify places in their own code to take advantage of their platform feature set.

5. Needs Analysis

Until the recent release of many core processors, programmers have been taught to program generally within higher levels of abstraction: simply, they didnt need to care about the hardware they were writing to. In the past, programmers have been taught to pay little attention to their compilers and operating systems, focusing instead on their own applications. In this approach the programmer relied on the existing abstraction layers to generically handle the extant hardware.

Now in the 21st century, there have been many hardware changes with which the professional lower level software writer has not yet caught up. Specifically, this includes virtually all programmers.

Why is the hardware industry taking this approach to modern CPU design? Simply, the thermal barrier has been reached. In the past, scaling CPU frequencies higher and higher has been the way that more optimized performance has been reached. Now, instead, performance improvements must be achieved by successfully implementing various parallelization strategies.

This has resulted in a very large gap between the richness of features offered by the hardware and the capability of most developers to fully take advantage of that rich feature set. The business opportunity here is to inform a subset of budding developers who can then influence the industrys lower tier of software, so that they can more easily bring the currently hidden hardware feature set to a broader, larger programming community.

Additionally, the current list of CPU product differentiators at large includes faster clock speeds, more cores, and better power management to name a few. If the bulk of the programming community (including the lower level software developers) doesnt gain any advantage with 8 cores over 1 or 2 cores, then there is no further perceived benefit and the sales value proposition crumbles. This must be avoided.

6. Subject Matter Experts(SMEs)

Subject Matter Experts who contributed material and time to the content of this courseware including structural, language, and technical edits are:

a. [Intel Confidential list]<

Important notes about SMEs

An SME is a difficult person to locate. By definition, a SME is able to:

Demonstrate/coach any performance to a mastery level

Determine what is necessary and sufficient to reach mastery

Explain the importance of any task or topic important to mastery

Identify training deficiencies that can solve a training problem

Show/explain contextual needs of the learning task or topic

Identify any relevant attributes of any important concepts

Articulate the levels of mastery and standards needed to solve a training problem

7. Learner Analysis

Learner characteristics influence the very core of class design, including the choices of language and terminology, prerequisites, selected learning activities, how to gain learner feedback and how to evaluate the learners, the format of the instructional materials, the training schedule itself, and the choices of media and instructional equipment, to name a few.

The ideal student for this module is an adult learner at a university, who in addition to exhibiting the learning characteristics of adult learners, has also the following traits:

n an experienced programmer in the C/C++ or Fortran compiled programming languages, who has between 1 and 5 years (or equivalent) programming experience in one or more of those three languages

o could be advanced freshman, sophomore or junior level programmer (advanced 1st, 2nd, or 3rd year college)

n A programmer who routinely writes simple sorting or computation programs (between 10 and 100 lines) from scratch in a day or less, with no difficulty whatsoever

o these short programs routinely compile with few or no problems, or the student is well able to solve the problems to a successful compile

n a programmer who may or may not have an interest in writing compilers, libraries, operating systems, or web services

n may or may not have previous experience with application profiling tools

n if working with Linux code: is comfortable as a standard user of Linux; understands basic Linux file and directory permissions; is able to successfully compile and link Linux code; can identify and stop Linux system processes; is familiar with a favorite Linux shell (similar to Bourne or C shells); understands basic Linux command line commands such as ls, lf, tar, sar, etc.

o does not require an understanding of X Windows

n if working MS Windows* code: is comfortable with using the process viewer; familiar with MS Visual Studio Developer Environment; is comfortable changing environmental variables,

n may or may not have knowledge of compiler optimization strategies; may or may not have wondered if the compiler could provide more benefit

n may have expressed an interest in learning more about all the resources that current micro architecture can provide them to achieve tangible performance improvements in their code; not content to write a program on an 8 core machine and having only one core do all the work, with 7 idle cores.

n may or may not have a general or vague curiosity about how the CPU works

n may have already tried to improve on 30 frames per second with regard to graphics output, and is able to improve that rate when necessary based on current knowledge,

n may already be actively seeking ways to use current available resources more effectively to solve even more challenging problems

a. Special notes for Train the Trainer learners/attendees

Train the Trainer (TTT) attendees are special cases wherein they likely have more experience than the usual target audience for this class, and, they have the immediate goal of teaching this class in a live classroom environment with targeted students.

Ideal TTT candidates for this material have the following traits:

i. Currently instruct or plan to instruct adult students who fit in the learner description earlier in this section

ii. Currently using a successful programming curriculum, or intend to soon create or teach one

iii. NOTE: There are no inherent limitations for instructors based on the experience or lack of it with regard to these objectives and content

Further, the course materials will use Intel software tools to easily illuminate important concepts, but those concepts can be explained and exploited using many other tools. This training is NOT just for Intel tools: for example, any modern compiler that is many core aware could be substituted.

8. Context Analysis

The purpose of a Context Analysis is to identify and describe the environmental factors that inform the design of this course. These environmental factors include:

a. Media Selection

i. No Tapes, CDs, or DVDs are available or provided

ii. Electronic files are provided

1. Can be printed out for classroom use if desired

2. Lecture presentation is .PPT format

a. includes instructor notes

3. Lab Guide is .DOC format

a. includes all planned hands-on labs

b. Document is labeled Student Workbook

4. Instructor Guide

a. 5-10 pages

b. homework labs with solutions

c. classroom questions with answers

d. tips on teaching material

5. Suggested class binaries included in tar format

a. instructor or students can substitute their own binaries for suggested ones

b. substitution may be optimal, in particular if student is using code they wrote from scratch

Learning Activities

i. Lectures include optional demos for the instructor

ii. Hands-on labs for students

1. Labs are designed as student homework but can be also done during class time if preferred

iii. Class Q+A

c. Participant Materials and Instructor/Leader Guides

i. There is a short Lab Guide with this module

ii. There is a short Lecture presentation with this module

1. Minimal instructor notes are included in PPT Notes sections

iii. An archive of class binaries, if no customized or student binaries are available

d. Packaging and production of training materials

i. Materials are posted to Intel Curriculum Wiki, for worldwide use and alteration:

ii. Aside from typical programming courses, these materials would be well suited as modules for the following curricula:

introduction to digital systems

introduction to algorithms: statistics

parallel versions of all functions

algorithms for computational biology

programming for any applied science

e. Training Schedule

i. The module is 3 hours of lecture and 3 hours of homework labs

sp; ii. Class size is not restricted in any way by the course materials themselves:

1. Students require access to a recent many core system, running supported Linux OS

2. that system has Intel (VTune Analyzer, Compilers, TBB) or equivalent software installed on it

3. Students require access to either instructor-provided or their own binaries of interest on that server

f. Other References

i. VTune Performance Analyzer Essentials: Measurement and Tuning Techniques for Software Developers, by James Reinders, Intel Press, ISBN 0-97-43649-5-9

ii. The Software Optimization Cookbook: High-performance Recipes for the Intel Architecture, by Rich Gerber, Intel Press, ISBN 0-9712887-1-4

iii. Intel Architecture Software Developers Manual, Volume 1: Basic Architecture, ISBN 1-55512-272-8

iv. Programming with POSIX Threads, by David R. Butenhof, Addison Wesley, ISBN 0-2-1-63392-2

p; v. Online help for all applications, User Guides, Getting Started Guides, Reference Manuals

vi. The Software Vectorization Handbook, by Aart J.C.Bik, Intel Press

9. Task Analysis

The relevant Job/Task Analysis for this material is defined by the Software Engineering Body of Knowledge (SWEBOK) and can be viewed in detail here:

The primary Bodies of Knowledge (BKs) used include, but are not limited to:

  • Software Requirements BK
  • Software Design BK
    • Key issues in Software Design (Concurrency)
    • Data persistence, etc.
  • Software Construction BK
    • Software Construction Fundamentals
    • Managing Construction
    • Practical Considerations (Coding, Construction Testing, etc.)

Relevant IEEE standards for relevant job activities include but are not limited to:

  • Standards in Construction, Coding, Construction Quality IEEE12207-95

(IEEE829-98) IEEE Std 829-1998, IEEE Standard for Software Test Documentation, IEEE, 1998.

(IEEE1008-87) IEEE Std 1008-1987
(R2003), IEEE Standard for Software Unit Testing, IEEE, 1987.

(IEEE1028-97) IEEE Std 1028-1997 (R2002), IEEE Standard for Software Reviews, IEEE, 1997.

(IEEE1517-99) IEEE Std 1517-1999, IEEE Standard for Information Technology-Software Life Cycle Processes- Reuse Processes, IEEE, 1999.

(IEEE12207.0-96) IEEE/EIA 12207.0-1996//ISO/IEC12207:1995, Industry Implementation of Int. Std. ISO/IEC 12207:95, Standard for Information Technology-Software Life Cycle Processes, IEEE, 1996.

10. Concept Analysis

  • Demonstrate how an instance of the compiler can take advantage of new core 2 processor features.
  • Parallel vs. Serial
  • SSE which allows you to take advantage of instruction-level parallelism in each core.

    • What is SSE? SSE2, SSE3, SSE4, etc.?
  • application performance profiling
  • compiler defaults and advanced settings

    • USING an optimizing compiler
      • Intel Compiler
      • GNU Compiler (gcc)
      • MS VISUAL STUDIO (MS Compiler)

  • Writing a manycore Hello World program
  • Using SSE-enabled and parallel-enabled libraries
  • Loop level optimizations and efficient array layout

11. Specifying Learning Objectives

Multiple cores: Given the class materials, software, and hardware, examine a sample Mandelbrot computation to identify the effective use of available system cores. Further, use this data to define and implement an optimization strategy. Students will then confirm that their strategy resulted in a performance improvement utilizing all available cores.

bsp; Single core: Given the class materials, software, and hardware, understand the importance of SSE with respect to instruction level parallelism within a single core. Student will be able to improve the performance of a wide range of applications by means of this features set. Additionally, students will be able to identify loops that are good candidates for significant performance improvement. Students will be able to optimize performance both by manual code manipulation and compiler optimizations.

Given course hardware and software, students will examine data structures and arrays, and the loops operating on those arrays, and learn how to coordinate their access of the data patterns in a more effective way, resulting significant performance increases. To accomplish this, students will use available compiler and analyzer tools to diagnose, find, and improve targeted portions of code. Students will be able optimize performance both by manual code manipulation and compiler optimizations. Further, students will verify the improved performance of their suggested solution using available tools.

12. Constructing Criterion Items

Q: Why is a working knowledge of threading important?

A: Essentially all computer platforms built from now going forward will have multiple cores in them and it is essential for programmers to know how to access these cores. Threading is one way to do this.

Q Why did the industry move towards creating chips with multiple cores rather than continuing to ramp processor speeds

A: Continuing to ramp processor speeds has become too costly in terms of power consumption and heat dissipation. Adding multiple cores can add computational power with minimal increase in power consumption, as compared to ramping the frequency.

Q: How do the cores share access to memory?

A: All the cores are ultimately tied to the same main memory via the frontside bus

Q: How can a developer indicate to an openmp program that a particular variable needs to be copied to each thread executing the program

A: Using the Openmp private clause

Q: How can a developer indicate to an openmp program that a particular variable needs to be shared among
each thread executing the program

A: Using the Openmp public clause

Q: What issues can arise from stipulating that a variable is shared among threads?

A: Data Race conditions can arise that may (will) cause your application to operate incorrectly.

Q: What advantages are there in multi-threading an application

A: 1) Improved user responsiveness, 2) improved performance

Q: What new instructions or compiler intrinsics that I can use to help me search quickly for minimum values in an array?

A. A new SSE4 instruction (PHMINPOSUW) to do horizontal minimum search.

Q: How can I coax the compiler into doing high level loop optimizations for me?

A: Compile with the O3 or /O3 switch.

Q: What is the purpose of the parallel compiler switch?

A: Its the auto-parallelization flag.

Q: What is a pragma?

A: Any instruction to the compiler to compiler your program in a certain way.

Q: Name two operations that be invoked via compiler pragmas that the developer can use to help the compiler take advantage of multiple cores or take advantage of SSE instructions, or align data better to get better performance.

A: #pragma parallel for, #pragma ivdep, #pragma vector-align, #pragma vectorize_always

Q: How can I get a compiler to automatically thread loops without having to manually insert OpenMP* directives?

A:. Auto-parallelization is implemented by the Intel compiler using the parallel or /Qparallel compiler switch

Q: What does SSE stand for?

A: Supplemental Streaming SIMD Extensions

Q: Are SSE/ vectorization optimizations performed across cores or within each core?

A: SSE/ vectorization optimizations are applied within a single core

Q: What advantage i
s there in using SSE instructions (either by manually coding them or by auto-vectorizing with a compiler)?

A: Utilizing more available hardware units or more of the available register width within a single core(multiple ALUs, FPUs, registers, etc) enables higher performance (can be on the order of 3-16X for some codes)

Q: What is the advantage of manually code to take advantage of many of the new streaming instructions available on Intel processors

A: The advantage is potentially large (10X) performance gains

Q How can use the compiler to take advantage of many of the new streaming instructions available on Intel processors

A. Use auto-vectorization switches such as; axT, -OaxP, _OaxW

Q: What is the disadvantage of manually code to take advantage of many of the new streaming instructions available on Intel processors

A: The disadvantage is portability of the application and larger maintenance burden

Q How can a developer manually code to take advantage of many of the new streaming instructions available on Intel processors

A. Use available compiler intrinsics or even assembly language to achieve

Q. What is the difference between xP and the axP flags?

A: The -axP will give you a generic version of the vectorized code in addition to a processor specific vectorization (the xP will give only processor specific vectorization).

Q: What value should you receive when you call OMP_get_num_threads if it is called outside a parallel region?

A: 1

Q: Assuming youve added OpenMP pragmas to your code, what compiler switch would you use to enable OpenMP?

A: -openmp or /Qopenmp

Q: Assuming youve added OpenMP pragmas to your code, what header file do you need to include to use OpenMP?


Q: How can a develop
er determine if his application is utilizing all available cores?

A: Use a process monitor (like Windows task manager or Windows Perfmon) or a profiling tool such as VTune Anayzer.

Q: How can a developer quickly analyze unfamiliar code to identify performance bottlenecks ?

A: Use profiling tools (such as VTune Performance Analyzer) to identify performance bottlenecks using either sampling technology or call graph technology.

Q: How can a developer determine what portions of his application would get the greatest benefit from parallelism?

A: Use profiling tools (such as VTune Performance Analyzer) to identify lops or functions that take the greatest percentage of time to execute and analyze these areas for opportunities for parallelization.

Q: How can a developer determine if his threading implementation is efficiently using all cores (is the workload fairly balanced among the cores)?

A; Use tool, such as the VTune Performance Analyzers Thread view, to identify load imbalances. Visually inspect the thread viewer to determine if all the threads are doing equal work or if a small fraction of threads is doing the majority of the work.

A: How can I correct an obvious load imbalance in my threaded implementation of Mandelbrot code?

A: Experiment with different openmp scheduling strategies (Dynamic, Guided, Static) to mitigate thread imbalance and achieve best performance.

13. Expert Appraisal: Live meeting capture of a SME demo walkthrough of material will be available July 30, 2008. URL is TBD. Optional Webinar on the material may be completed: if so, the Webinar will be announced on the ISC Community Forum.

14. Developmental Testing: alpha and betas of this material posted to ISC WIKI by the July 30, 2008

15. Production:

Blueprint Target Date: 6 weeks from now, July 30, 2008

Approval by PDT required pending general availability

All materials will be posted to the ISC WIKI by August 31, 2008 (Target POR).

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Nice set of criterion items, if you ask me, BobC. Good job.


Looks good to me. Can I get it a little sooner, please?


Leave a Comment

Please sign in to add a comment. Not a member? Join today