Beta Course Design Document: Using the VTune(TM) Performance Analyzer 8.0 for Linux* -- YOUR FEEDBACK WELCOMED!

Beta Course Design Document: Using the VTune(TM) Performance Analyzer 8.0 for Linux* -- YOUR FEEDBACK WELCOMED!

Hi Team,

Below please find a beta version of the Course Design Document for a short module that introduces Intel's performance analyzer on Linux*. ALL FEEDBACK IS FAIR GAME! Please read and comment freely.




Using the Intel VTune Performance Analyzer 8.0 for Linux*

Beta Course Design Document

Version: CD_VTL_BPL_1_0.doc

1. Module Name: Using the Intel VTune Performance Analyzer 8.0 for Linux*

2. Writers: [Intel Confidential List]

3. Targeted availability: June 2008

4. Brief Description

Proposed Duration: three hours

This three-hour module is a densely-packed (and perhaps, the briefest possible) introduction to the VTune analyzer for Linux. It is designed for students who have experience neither with the analyzer nor application profiling, although students with such experience would probably consider the material a good refresher.

The module consists of one hour of lecture accompanied by two hours of hands-on labs, specifically designed for developers who are currently writing compiled code for use on contemporary computers running many-core processors.

The module teaches students two important approaches to application profiling. The first is from the processors point of view (called sampling); and the second is from a method-by-method flow-of-control perspective (known as call graph). The analyzer tool itself collects data about your Linux application ranging from system-wide all the way down to information about specific objects or functions of interest in the application itself. The analyzer then summarizes, interprets, and presents that information to you.

After successful completion of the course lab activities
the student will be able to use the analyzer from either the command line or from the Graphical User Interface (GUI) to create sampling and call graph performance analysis experiments on their code to locate potential performance bottlenecks and to determine possible code optimizations.

Further and just as importantly, they will be able to measure and therefore confirm that any optimization strategies they implement or will soon implement are in fact improving their applications performance.

5. Needs Analysis

Until the recent release of many core processors, programmers have been taught to program generally within higher levels of abstraction: simply, they didnt need to care about the hardware they were writing to. Historically instead they were taught to pay a little attention to their compilers and Operating Systems (OS) and applications of choice and not much else, and certainly even those in minimal detail. In this approach the programmer relied on the existing abstraction layers to generically handle the extant hardware.

Now in the 21st century, there have been many hardware changes with which the professional lower level software writer has not yet caught up. Specifically this includes OS writers, compiler writers, and high-performance enterprise or network application writers as well as general application writers.

This has resulted in a very large gap between the richness of features offered by the hardware and the capability of most developers to fully take advantage of that rich feature set. The business opportunity here is to inform a subset of budding developers who can then influence the industrys lower tier of software, so that they can more easily bring the currently hidden hardware feature set to a broader, larger programming community.

Additionally, the current list of CPU product differentiators at large includes faster clock speeds, more cores, and better power management to name a few. If the bulk of the programming community (including the lower level software developers) doesnt gain any advantage with 8 cores over 1 or 2 cores, then there is no further perceived benefit and the sales value proposition crumbles. This must be avoided.

As beginning or virtual beginning students are taught programming structures for perhaps the very first time, the role and strategies of parallelism must also be introduced. This is a new way of thinking in the classroom, and requires contemporary tools to assist with determining not only the
best optimizing strategies, but also, providing objective data to confirm that the strategies, once implemented, result in actual improvement in performance.

It is well known that when analyzing an application in the real world, one needs to identify bottlenecks. After correcting them, one must look for algorithmic efficiencies and without a profiling tool and methodology; this is arduous and prone to error. Further, when one looks at code written by another, the asymptotic efficiencies may not be at first (or even later) obvious: using a tool to identify them provides a significant time savings.

Modern CPUs typically a piece of hardware called a processor monitoring unit which allows analysis of the lower level events in the CPU itself, at any given time, as your applications run. One of the advantages of having this functionality, previously not available, is that performance analysis can now provide a richer array of performance data during the analysis of your application performance improvement. You need a proper tool to take advantage of this functionality, this shows how your application interacts with the CPU, and so you can make effective optimization choices.

Such powerful profiling tools are by necessity complex, and, the course materials that provide first exposure must not only include a full spectrum of hands-on experiences, but also, a thorough examination of self-help tools that will serve the student well beyond the classroom.

6. Subject Matter Experts(SMEs)

Subject Matter Experts who contributed material and time to the content of this courseware including structural, language, and technical edits are:

a. [Intel Confidential list]

Important notes about SMEs

An SME is a difficult person to locate. By definition, a SME is able to:

Demonstrate/coach any performance to a mastery level

Determine what is necessary and sufficient to reach mastery

n the importance of any task or topic important to mastery

Identify training deficiencies that can solve a training problem

Show/explain contextual needs of the learning task or topic

Identify any relevant attributes of any important concepts

Articulate the levels of mastery and standards needed to solve a training problem

Just as importantly, SMEs can differentiate the subject matter that MUST be taught, from that that SHOULD be taught, from that that COULD be taught. Lastly, SMEs help define learning objectives and behavior with regard to probability of error and the consequences of a given error in relation to the students work life.

7. Learner Analysis

Learner characteristics influence the very core of class design, including the choices of language and terminology, prerequisites, selected learning activities, how to gain learner feedback and how to evaluate the learners, the format of the instructional materials, the training schedule itself, and the choices of media and instructional equipment, to name a few.

The ideal student for this module is an adult learner at a university, who in addition to exhibiting the learning characteristics of adult learners, has also the following traits:

n an programmer in the C/C++ or Fortran compiled programming languages, who has between 0 and 3 years programming experience (or the equivalent) in one or more of those three languages

o could be a freshman, or sophomore or junior level programmer (1st, 2nd or 3rd year college student), or an advanced younger student

n A programmer who routinely writes simple sorting or computation programs (between 10 and 100 lines) from scratch in a day or less, with no difficulty whatsoever

o these short programs routinely compile with few or no problems, or the student is well able to solve the pro
blems to a successful compile

n a programmer who may have a primary interest in writing applications for use on contemporary servers of a many-core nature

n may or may not have programming experience with Java* or .NET

n may or may not have previous experience with application profiling tools

n is comfortable as a standard user of Linux; understands basic Linux file and directory permissions; is able to successfully compile and link Linux code; can identify and stop Linux system processes; is familiar with a favorite Linux shell (similar to Bourne or C shells); understands basic Linux command line commands such as ls, lf, tar, sar, etc.

o does not require an understanding of X Windows

n may or may not have knowledge of compiler optimization strategies; may or may not have wondered if the compiler could provide more benefit

n may or may not have previous application profiling experience either in Windows* or Linux

n may have expressed an interest in learning more about all the resources that current micro architecture can provide them to achieve tangible performance improvements in their code; not content to write (or soon to write) a program on an 8 core machine and having only one core do all the work, with 7 idle cores.

n may have a general or vague curiosity about how the CPU works

n may already be actively seeking ways to use current available resources more effectively to solve even more challenging problems

n is a student of algorithms in any of a variety of scientific disciplines (not just programming) who has an interest in asymptotic efficiency, or Big-O analysis, and seeks real world applications with defined loop structures (adding real world examples to pencil and paper theoreticals)

n is currently or will soon be an application developer

a. Special notes for Train the Trainer learners/attendees

Train the Trainer (TTT) attendees are special cases wherein they likely have more experience than the usual target audience for this class, and, they have the immediate goal of teaching this class in a live classroom environment with targeted students.

Ideal TTT candidates for this material have the following traits:

i. Currently instruct or plan to instruct adult students who fit in the learner description earlier in this section

ii. Currently using a successful programming curriculum, or intend to soon create or teach one

iii. NOTE: There are no inherent limitations for instructors based on the experience or lack of it with regard to these objectives and content

Further, the course materials will use Intel software tools to easily illuminate important concepts, but those concepts can be explained and exploited using many other tools. This training is NOT just for Intel tools: for example, any modern application profiler that is many core aware could be substituted.

8. Context Analysis

The purpose of a Context Analysis is to identify and describe the environmental factors that inform the design of this module. The Environmental Factors for the module include:

a. Media Selection

nbsp; i. No Tapes, CDs, or DVDs are available or provided

ii. Electronic files are provided

1. Can be printed out for classroom use if desired

2. Lecture presentation is .PPT format

a. includes instructor notes

3. Lab Guide is .DOC format

a. includes all planned hands-on labs

b. Document is labeled Student Workbook

4. Suggested class binaries included in tar format

a. instructor or students can substitute their own binaries for suggested ones

b. substitution may be optimal, in particular if student is using code they wrote from scratch

b. Learning Activities

i. Lectures include optional demos for the instructor

ii. Hands-on labs for students

1. Labs are designed as student homework but can be also done during class time if preferred

iii. Class Q+A

c. Participant Materials and Instructor/Leader Guides

i. There is a short Lab Guide with this module

ii. There is a short Lecture presentation with this module

1. Minimal instructor notes are included in PPT Notes sections

iii. An archive of class binaries, if no customized or student binaries are available

d. Packaging and production of training materials

i. Materials are posted to Intel Curriculum Wiki, for worldwide use and alteration:


e. Training Schedule

i. The module is 1 hour of lecture and optional demo and 2 hours of homework labs

ii. Class size is not restricted in any way by the course materials themselves:

1. the labs require access to a Linux* server running a supported Linux OS

2. the server must have the analyzer installed on it

3. Students require access to either instructor-provided or their own binaries of interest on that server

a. the analyzer can use remote data collectors to obtain performance data on multiple servers, but the materials do not cover this ability

f. Other References

i. VTune Performance Analyzer Essentials: Measurement and Tuning Techniques for Software Developers, by James Reinders, Intel Press, ISBN 0-97-43649-5-9

ii. The Software Optimization Cookbook: High-performance Recipes for the Intel Architecture, by Rich Gerber, Intel Press, ISBN 0-9712887-1-4

iii. Intel Architecture Software Developers Manual, Volume 1: Basic Architecture, ISBN 1-55512-272-8

iv. Programming with POSIX Threads, by David R. Butenhof, Addison Wesley, ISBN 0-2-1-63392-2

9. Task Analysis

The relevant Job/Task Analysis for this material is defined by the Software Engineering Body of Knowledge (SWEBOK) and can be viewed in detail here:

The primary Bodies of Knowledge (BKs) used include, but are not limited to:

  • Software Requirements BK
  • Software Design BK
    • Key issues in Software Design (Concurrency)
    • Data persistence, etc.
  • Software Construction BK
    • Software Construction Fundamentals
    • Managing Construction
    • Practical Considerations (Coding, Construction Testing, etc.)

Relevant IEEE standards for relevant job activities include but are not limited to:

  • Standards in Construction, Coding, Construction Quality IEEE12207-95

(IEEE829-98) IEEE Std 829-1998, IEEE Standard for Software Test Docu
mentation, IEEE, 1998.

(IEEE1008-87) IEEE Std 1008-1987 (R2003), IEEE Standard for Software Unit Testing, IEEE, 1987.

(IEEE1028-97) IEEE Std 1028-1997 (R2002), IEEE Standard for Software Reviews, IEEE, 1997.

(IEEE1517-99) IEEE Std 1517-1999, IEEE Standard for Information Technology-Software Life Cycle Processes- Reuse Processes, IEEE, 1999.

(IEEE12207.0-96) IEEE/EIA 12207.0-1996//ISO/IEC12207:1995, Industry Implementation of Int. Std. ISO/IEC 12207:95, Standard for Information Technology-Software Life Cycle Processes, IEEE, 1996.

10. Concept Analysis

  • Passage of time: clockticks
  • Program control flow changes: mispredictions, speculation, retired, executed
  • Instruction mix: ratios?
  • Memory accesses: MMX Instructions and Streaming SIMD Extensions (SSE) Events
  • Execution flow
  • Application (compiled code, Fortran, C++)
  • Call graph, sampling
  • Serial code
  • Parallelization between cores (ways to share memory, benefits, costs, ways to make simple serial code parallel)
  • Compiler ways to make parallel code
  • OpenMP ways to make parallel code

11. Specifying Learning Objectives (Terminal Objectives)

a. Given the class lab hardware and software, use the First Use Wizard to determine a sampling hotspot in a sample section of code selected either by the instructor the student himself. You must be able to list the function or object name of the hotspot when finished, as well as provide the listed number of instructions retired. Further, this experiment should take no longer than 20 minutes to complete.

b. Given the class lab hardware and software, use the analyzers graphical help system to identify three available included documents, and further, to define their best intended uses in 15 minutes or less.

c. Given the class lab hardware and software, use the Sampling Wizard to drill down to the primary hotspot of either the gzip or other provided binary. This experiment will take 15 minutes or less, and the student will indicate which function in the binary takes the most time to complete; which function in the binary has the highest CPI; and which line of source has the most clocktick samples.

d. Given the class lab hardware and software, modify the sampling collector on the server to change the default even values to Loads Retired and then drill down to either the matrix binary or another provided binary. Further, in 10 minutes or less, the student will report the samples per second the analyzer is now collecting with calibration first off, then on. Further, students will report any differences between calibration on and off.

e. Given the class lab hardware and software, use the Call Graph Configuration Wizard for Linux to establish and run a callgraph experiment using either a provided or student-owned application. Further, student will identify the function of the application which has the most time spent in it, and what functions are calling it within15 minutes of starting the lab.

f. Given the class lab hardware and software, use both of the sampling and call graph technologies to create an optimization strategy, implement it, and use the analyzer to confirm an improvement in performance, using either the provided or a student-supplied application. Further, the testing of the strategy and implementation must be completed 30 minutes after the lab has started.

g. 30 minutes Given the class lab hardware and software, run the following command line commands: vtqfagent and examine the log file contents; the man pages for vtl, vtlec, sampling, callgraph, code, and source, to identify their best uses; and finally, to rerun all sampling and call graph exercises from the previous labs but instead of using the GUI, use the vtl command line comman. Further, this entire lab must be completed within 30 minutes, and upon successful completion of the labs, the student will compare GUI and CLI results and determine their consistency, if any.

h. Threading objective, thread view

12. Criterion Items

Q: What is a hotspot?

A: A hotspot is a place in your code where a given CPU event occurs more than it h
appens at any other spot. Hotspots can point to bottlenecks in your code.

Q: What is the difference between a hotspot and a bottleneck?

A: A bottleneck is a section in your code where execution is unnecessarily slowed down do to programmer choices. If a bottleneck is related to micro-architectural functionality, then finding the hotspot that point to it is important.

Q: What is the primary job of an application profiler?

A: To collect system-wide and application performance data about your application; to analyze that data; and to present it to you.

Q: For any particular application profiler, how do you know which environments it supports?

A: By reading its release notes.

Q: What is sampling?

A: Sampling is a CPU event-based functionality in which your application is analyzed in almost real time, from the CPUs point of view.

Q: What is call graph?

A: Call Graph is functionality that allows you to examine the flow of control of your application.

Q: What are the default events selected for sampling?

A: Clockticks and instructions retired.

Q: What is the purpose of the First Use Wizard?

A: To simplify the sampling or call graph technologies for profiling novices by creating projects, activities, and results in a more automatic way.

Q: What is the purpose of the analyzers Online Help?

A: For quick assistance while you are trying to run the analyzer. It provides basic information to help you get your profiling experiment running.

Q: What is the purpose of the analyzers Getting Started Guide?

A: To provide a basic introduction to users who have done application profiling before, but who have not yet used the analyzer.

Q: What is the purpose of the analyzers Users Guide?

A: To provide a logical introduction to profiling for users who have never profiled before.

Q: What is the purpos
e of the analyzers Reference Manual?

A: It is an encyclopedic and truncated listing of some primary and secondary processor events that may well be of interest to the developer as they run profiling experiments on their code.

Q: What are the key advantages of the analyzers sampling technology?

A: No need to modify your code in anyway; use your regular build configuration with all current optimization switches on; its system wide at least until you being to drill down; there is low CPU overhead.

Q: What is the Sample after value?

A: Since interrupting the CPU every time a particular event occurs while your application is running, the sample after value is a selected number of events after which the context registers of the CPU will be examined. The sample after value assures the low CPU overhead to the sampling technology.

Q: What is actually written during a sampling interrupt of the CPU?

A: The instruction pointer (CS:IP); the process and thread identifiers that corresponded to IP; and the module containing this IP. (To get IP resolved by functions and lines of code, the module has to contain debug information.)

Q: What are clockticks?

A: Clockticks are events used for time measurement. Divide them by the processors frequency to get readings of seconds.

Q: Why does a profile consider clockticks?

A: Because many event counts only make sense as a ratio per time interval.

Q: What is the most used, popular ratio?

A: CPI (clockticks/instruction)

Q: What is a retired event?

A: A retired event includes only events that occur due to instructions that are committed to the machine state. (Loads that occur on mispredicted execution paths would not be counted.)

Q: Does the analyzer report on manycore CPUs?

A: Yes. The analyzer can break down sampling views based on individual processors or individual threads.

Q: How does the call graph functionality work?

A: It is a monitoring of function entries. The analyzer performs a binary instrumentation at the function level in which code is added to your method prologs that collects and stores IP
and timing information. Later, these data are used to do analysis and display the graph showing the flow of control. This functionality is not system wide like sampling; it is for only the profiling of the selected module and those which are called from it. Binaries that have been instrumented by call graph run more slowly than the binary without instrumentation, and this is expected.

Q: What are the advantages of the call graph technology?

A: Identifies algorithmic problems through the calls hierarchy data and critical execution paths; instrumentation is automatic and requires no user intervention.

Q: What are the significant differences between the sampling and call graph technologies?

A1: Sampling shows system-wide hotspots. Call graph shows you how you got there: which function called the hotspot.

A2: Sampling allows you to understand any mismatches with the microrarchitectural and your code. Call graph identifies the workflow, point to the algorithmic issues.

A: Call graph shows time spent in every function of every module called from the specified application.

Q: What is Pack and Go?

A: An analyzer capability that allows you to archive and share profiling experiment results with users on other servers.

Q: What is a standard approach for optimizing applications that uses the full capabilities of the analyzer?

A: Find hotspots regarding time (cycles) by sampling for clockticks; use call graph to find out how you got there: look for non-optimal resource utilization via CPI: then find hotspots regarding expensive occurrence events (L2/L3 cache misses; branch-miss predictions, pipeline flushes.

Q: What is a good overall optimizing plan for persons who dont have much profiling experience or knowledge of the CPU architecture?

A: Use sampling to find your worst problem (hotspot); use a more aggressive compiler optimization strategy on that object or module or file; use the profiler to measure any changes you made.

Q: What is the name of the VTune Quality Feedback Agent and what is it used for?

A1: /opt/intel/vtune/bin/vtqfagent

A2: It is used to provide diagnostic information about a server in an automated fashion in a text file, that Intel support might ask to see when troubleshooting issues.

Q: What command do you use to invoke the command
line help?

A: vtl help

Q: In each case, match the vtl command with its description:

  1. vtl help # answer a.
  2. vtl help c sampling # answer b.
  3. vtl version # answer c.
  4. vtl query lc # answer d.
  5. vtl show # answer e.
  6. vtl show a # answer f.
  7. vtl delete # answer g.
  8. vtl view gui # answer h.

  1. list out the command help
  2. list supported processor events for your server
  3. list the software version
  4. list the installed collectors
  5. show the hidden project
  6. show all details
  7. delete a specific
  8. delete the last activity
  9. call the project up into the Eclipse* interface for viewing in the GUI

Q: Fully interpret the following command line:

$ vtl activity lab3 c sampling o -ec en=LOADS_RETIRED:sa=200000 calibration:no app ./matrix run

A: Start an activity using the sampling collector on the binary called matrix and change the default counter to Loads Retired, while setting the sample after value to 200000; further, do not calibrate the experiment , and launch it as soon as I press ENTER.

Q: What are the advantages of the CLI version of the analyzer over the GUI version?

A: The vtqfagent diagnostic log; the concise man pages; and you can script data collection experiments to occur in an automatic fashion without human intervention

Q: What are the similarities between the CLI version of the analyzer and the GUI version?

A: You can run sampling and call graph data collection activities; you can do remote data collection; both provide graphical views (although the plug in viewers are optional on the CLI version)

Q: What is a remote agent, or Remote Data Collector?

A: An RDC is a small server piece of code that runs on a server on your network, which is not the server you are currently logged into. It listens for sampling and call graph experiment requests from your analyzer installation, and collections data and sends it across the network back to you for analysis and presentation.

Q: In an analyzer display, you see modules labeled OTHERXX? What does this mean?

A: The OTHERXX listing is a place holder for samples when the analyzer cannot locate the loaded module. XX refers to the addressing mode: for example, 32 for 32-bit, 64 for 64-bit. The code may be a bit of BIOS code, or runtime generated code.

Q: Is it possible to run a sampling experiment on specific lines of code in your application?

A: Yes, with included Ring3 APIs on MS Windows*. You insert the resume and pause calls, start the analyzer in pause mode, and as the program runs it will resume and stop data collection at the precise locations of your resume and pause calls.

Q: What are the three component pieces of the Ring3 API?

A: VTuneAPI.h (definitions), VTuneAPI.lib (import library), and VTuneAPI.DLL which provides runtime support (MS Windows* only)

13. Expert Appraisal: Live meeting capture of a SME demo walkthrough of material will be available June 30, 2008. URL is TBD. Optional Webinar on the material may be completed: if so, the Webinar will be announced on the ISC Community Forum.

14. Developmental Testing: alpha and betas of this material posted to ISC WIKI by the June 30, 2008<

15. Production:

Blueprint Target Date: 6 weeks from now, June 30, 2008

Approval by ISC PDT(governing body) required pending general availability

16. All materials currently targeted to be posted to the ISC WIKI by July 31, 2008 (Target POR):

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

This is looking good - thanks for posting this

I may have some additions to criterion items around load balancing

GREAT! Send them on, Bob!



Leave a Comment

Please sign in to add a comment. Not a member? Join today