Intel® Graphics Performance Analyzers Instrumentation Walkthrough

NOTE: This article was written using Intel GPA 4.3; though not the latest version of the product, many of the techniques outlined here are useful with recent versions of GPA. To download the latest release, see the Intel GPA Home Page.


Download Article

Download Intel® Graphics Performance Analyzers Instrumentation Walkthrough [PDF 1MB]

Intel® GPA Platform Analyzer Overview

Intel® Graphics Performance Analyzers (Intel® GPA) Platform Analyzer is an instrumentation-based tool. The fundamental data element of the instrumentation API is a task. A task is a logical group of work on a specific thread. A task may correspond to code in functions, scope blocks, case blocks in switch statements, or any significant piece of code as determined by the developer. The instrumentation API provides functionality to describe various constructs such as dependencies between tasks. Instrumented tasks are displayed in a timeline view by Intel GPA Platform Analyzer. Besides your defined tasks, you’ll see other information displayed on the timeline. Intel® graphics drivers, the DirectX* interceptor used by Intel GPA, and other Intel libraries like the Intel® Media SDK come pre-instrumented and will display relevant information on Intel GPA Platform Analyzer. Even if you don’t add any instrumentation to your code, you will at the very least see pre-instrumented libraries and/or graphics driver information. By default, you will be able to see the amount of time and the order in which frames are processed on the CPU and the GPU. This is helpful when determining if the application is CPU or GPU bound.

Figure 1. Intel® GPA Platform Analyzer user interface


Intel GPA Platform Analyzer is made up of several panels. The figure above has the various panels numbered that make up Intel GPA Platform Analyzer. Panel 1 is the timeline view that displays threads and tasks per thread. Panels 2 and 3 show metrics related to selected tasks in the timeline. Panel 4 shows hardware tracks for each of the processor cores. Panel 4 is not displayed by default but can be enabled from the Profiles window of the Intel GPA Monitor as described in Appendix A : Intel® GPA Monitor. Refer to the Intel GPA help file for further details about the user interface.

The workflow for using Intel GPA Platform Analyzer is simple. Identify sections in code whose execution time and order are better understood visually. Good candidates for instrumentation are AI simulation, rigid body physics, collision detection, rendering, or any other system that requires detailed understanding. Add instrumentation to the identified sections as described in the rest of this article. Run the Intel GPA Monitor and the instrumented application. From the Intel® GPA System Analyzer HUD overlaid on screen, capture a trace for Intel GPA Platform Analyzer. Open captured trace in Intel GPA Platform Analyzer and begin analysis. Instrumentation can be used to visually confirm execution order and distribution of tasks, profiling an application, and even to describe complex relationships between systems. Unlike other tools like Intel® GPA Frame Analyzer that focus on specific frames, Intel GPA Platform Analyzer casts a much wider net and shows not only the GPU activity but what is happening on the CPU as well. As the name suggests, Intel GPA Platform Analyzer shows a better picture of the platform.

Leading Game Middleware Instrumented for Intel® GPA Platform Analyzer

Most games in the industry use middleware solutions for various subsystems such as user interface, physics, vegetation, and many others. Because middleware companies generally specialize on solving a particular problem, they are able to provide high-quality solutions. Game developers want to release a game with many subsystems that work together to create an engaging and fun experience for the players. Game middleware can be thought of as a black box that takes some input, does some work with the input, and returns results that can be used in a game. Of course, nothing is free, and even highly optimized middleware products can end up being the bottleneck, or at the very least taking up more CPU time than desired. In some cases, profiling the game can help the developer understand where issues might be that are holding up the middleware.

We’ve worked with some of the leading middleware companies in the game industry to take off the veil and show developers exactly what the middleware is doing behind the scenes. The following middleware companies are a few that have instrumented their latest products for Intel GPA Platform Analyzer:

Figure 2. Instrumented game middleware


If you are using any of these middleware solutions, you’ll see details about the work each is doing when you capture a trace for Intel GPA Platform Analyzer, as shown in Figures 3-5 below. The instrumentation is probably not enabled on the “Release” version of these products, but is likely associated with a “Development” or “Profile” build. Refer to each product’s help or support for further information about enabling instrumentation for Intel GPA Platform Analyzer.

The benefits of knowing and being able to see what these middleware products are doing range from being able to confirm whether middleware is the bottleneck, to being able to see what effect different input has on the middleware without having to add any instrumentation to your own code. The following screen captures show three examples of middleware products’ instrumentation. These traces are from tech samples created by each of the middleware providers to show their own product features.

Figure 3. Autodesk Scaleform GFx* 4

Figure 4. Unity Web Player*

Figure 5. Geomerics Enlighten*

Simple and flexible instrumentation API

With a better idea of how Intel GPA Platform Analyzer displays an instrumented project, it is time to instrument some code. Intel GPA Platform Analyzer comes with 32-bit and 64-bit libs found in <Intel GPA install dir>\sdk\libs. Use the appropriate lib according to the project’s build requirements. To begin adding instrumentation, include the ittnotify.h header in the files that will have instrumented code, and wrap the code in __itt_task_begin/__itt_task_end calls.

// include the header in the file that will be instrumented 
#include <ittnotify.h>
static __itt_domain* g_pDomain = __itt_domain_create( "Domain.Name" ); 
  
void System::DoWork( void ) 
{ 
 __itt_string_handle* szStringHandle = __itt_string_handle_create("System::DoWork"); 
 __itt_task_begin( g_pDomain, __itt_null, __itt_null, szStringHandle ); 
 // do work 
 __itt_task_end( g_pDomain ); 
}

Calls to __itt_task_begin/__itt_task_end can be nested if necessary to create a hierarchical construct. In the following sections, some organization techniques available in the instrumentation API -task groups, markers, and relations-will be discussed to help organize instrumented code.

Create or Integrate into Instrumentation System

Most game engines have an instrumentation/profiling system. Intel® Instrumentation and Tracing Technology (Intel® ITT) can integrate well into instrumentation systems. If your game engine does not have an instrumentation system, you can build it using the Intel ITT instrumentation API described above. Most instrumentation systems use macros to mark code sections for profiling. All the examples above call the Intel ITT functions directly. You can abstract the calls by either creating your own macros or integrating them into your existing instrumentation system. Most instrumentation systems work in one of two ways:

  1. Begin/End markers: Delimit specific code sections with begin/end calls.
  2. Scoped objects: Macros called at the beginning of the profiled code section create an instrumentation object that goes out of scope at the end of the code section.

The Intel ITT instrumentation API lends itself out of the box to the Begin/End scenario. As shown in the examples above, __itt_task_begin and __itt_task_end are the two major API calls. The second scenario can be accomplished by creating a class that calls __itt_task_begin in the constructor and __itt_task_end in the destructor, creating an instrumentation system that supports both scenarios. For example, here’s some code that should accomplish the feat:

#include <ittnotify.h>

class GPAScopedTask
{
  public:
  GPAScopedTask( __itt_domain* pDomain, const char* szTaskName )
  : m_pDomain( pDomain )
  { 
    __itt_string_handle* pTaskName = __itt_string_handle_createA( szTaskName );
    __itt_task_begin( m_pDomain, __itt_null, __itt_null, pTaskName ); 
  }
  ~GPAScopedTask( void ) 
  { 
    __itt_task_end( m_pDomain ); 
  }
  
  private:
  __itt_domain* m_pDomain;
}

#define GPA_TASK( Domain, TaskName ) 
  GPAScopedTask _gpa_scoped_task_( Domain, TaskName )
#define GPA_TASK_BEGIN( TaskName ) 
  __itt_task_begin( Domain, __itt_null, __itt_null, TaskName )
#define GPA_TASK_END( Domain ) __itt_task_end( Domain )
 

Organize Your Instrumentation

Using the calls and concepts presented in this document, you are able to get started instrumenting at will. The last thing we'll cover in this document are some suggestions to organize your instrumentation. As you get going with instrumentation, you will quickly realize you will need some organization scheme in order to understand what your code is doing. A consistent naming scheme for your tasks is a good place to start, but the instrumentation API provides some useful organization schemes. Looking at the concepts we've already covered, you can organize your instrumented code enough to not just make sense, but to also begin making assessments. If your product is a middleware solution, using the following organization schemes will make the lives of your licensees much easier.

A sane task naming scheme will keep your task hierarchy understandable and decipherable. An easy way to keep your task naming scheme sane is to either use the function name or use __FUNCTION__. This is easy to implement, and will not only give you names that you're familiar with, but will also help you identify code sections that might be a bottleneck.

Once you have a sane naming scheme, the next step is to associate your instrumentation with an appropriate domain. The domain is one of the parameters passed into __itt_task_begin. With domains, you are able to control which tasks are saved into the trace from the Profiles window in the Intel GPA Monitor. For this reason, name your domain something that correctly describes the associated tasks. In the examples above, "Domain.Name" was used, but yours should be clearer. For example, if you're instrumenting a middleware solution, use "CompanyName.ProductName" as the domain name. You can also have multiple domains active and associate tasks appropriately. A list of all domains will appear in the Profiles window.

After you've associated tasks with the appropriate domains, you will be ready to begin making assessments about where time is spent. The concept of task groups is useful for organizing tasks, as well as to help understand performance on a subsystem level. Create a task group per subsystem and associate tasks from that subsystem with the task group. Creating the task group per frame will give you an idea of where time is spent per frame, as well as a visual representation of dips and spikes. If you are instrumenting middleware, create a task group per frame that associates the middleware task. This will not only help licensees understand per frame how much time your solution is taking, but will help results make more sense when several middleware solutions exist in a single game. Follow your domain naming scheme-CompanyName.ProductName-to name your task group.

Now you should be ready not only to begin adding instrumentation to your game, middleware solution, or whatever code you wish, but to add it an a way that will help you take advantage of the visual representation.

The rest (and then some) of the instrumentation API

In its latest release, Intel GPA supports the Intel® Instrumentation and Tracing Technology (Intel® ITT) API, a unified instrumentation API with other Intel® tools. Intel ITT provides several constructs for organizing code instrumentation. This section will describe the use of three of these constructs: task groups, markers, and relations. Task groups can be useful to describe collections of tasks that may all serve a similar purpose, like AI. A task group could encompass all tasks over several threads that involve AI. Markers represent events in the execution time. Markers can be used to signal specific events such as calling ID3DDevice::Present. Relations can be used to describe complex interactions between tasks, such as dependencies between tasks even across multiple threads.

Task groups are useful to define logical groups of work. For example, AI tasks may be executed on several threads, and thus not easily contained within a task or nested task hierarchy. With task groups, the execution time of AI tasks can be aggregated and easily accessible from Intel GPA Platform Analyzer. Creating task groups is achieved with __itt_task_group, and then tasks can be added to the task group with a call to __itt_relation_add_to_current or __itt_relation_add. Read the Intel GPA SDK Reference to understand the subtle difference between these two functions.

#include <ittnotify.h> 
  
class GPAScopedTask  
{  
  public:  
  GPAScopedTask( __itt_domain* pDomain, const char* szTaskName )  
  : m_pDomain( pDomain )  
  {   
    __itt_string_handle* pTaskName = __itt_string_handle_createA( szTaskName );  
    __itt_task_begin( m_pDomain, __itt_null, __itt_null, pTaskName );   
  }  
  ~GPAScopedTask( void )   
  {   
    __itt_task_end( m_pDomain );   
  }  
    
  private:  
  __itt_domain* m_pDomain;  
}  
  
#define GPA_TASK( Domain, TaskName )   
  GPAScopedTask _gpa_scoped_task_( Domain, TaskName )  
#define GPA_TASK_BEGIN( TaskName )   
  __itt_task_begin( Domain, __itt_null, __itt_null, TaskName )  
#define GPA_TASK_END( Domain ) __itt_task_end( Domain ) 

Markers, as the name suggests, help to point to when discrete events occur in the execution time. For example, the end of a frame is usually defined by a call to D3DDevice::Present. Adding a marker when Present is called adds a visual representation for the end of the frame in the timeline view of Intel GPA Platform Analyzer. Markers can be created with several scopes: global, process, thread, and task. The Intel GPA SDK Reference provides more details for choosing the appropriate scope. Adding a marker is as simple as calling __itt_marker.

    // include the header in the file that will be instrumented
    #include <ittnotify.h>
    static __itt_domain* g_pDomain = __itt_domain_create( "Domain.Name" );
    void System::DoWork( void )
    {
    __itt_string_handle* szStringHandle = __itt_string_handle_create("System::DoWork");
    __itt_task_begin( g_pDomain, __itt_null, __itt_null, szStringHandle );
    // do work
    // end of frame...call Present
    __itt_string_handle* szEndFrameMarker = __itt_string_handle_create("EndFrameMarker");
    __itt_marker( g_pDomain, __itt_null, szEndFrameMarker, __itt_marker_scope_task );
    D3DDevice::Present();
    __itt_task_end( g_pDomain );
    }

As described earlier, tasks are a logical group of work on a specific thread. Tasks are associated with a specific section of code that takes some amount of time to execute. When instrumenting code, it might be important to describe more symbolic relations between tasks beyond log levels and categories. The Intel ITT API provides functionality to describe other semantic relations between tasks such as:

  • __itt_relation_is_dependent_on
  • __itt_relation_is_sibling_of
  • __itt_relation_is_parent_of
  • __itt_relation_is_continuation_of
  • __itt_relation_is_child_of
  • __itt_relation_is_continued_by
  • __itt_relation_is_predecessor_to

With these relations, instrumenting a task scheduling system, for example, can fully describe the distribution of work on different threads and the dependencies between tasks.

As demonstrated above, instrumenting code is not difficult with the Intel GPA Platform Analyzer instrumentation API, and the benefits of understanding how the code is behaving are vast. Being able to understand at a high level what the low-level code is doing is important when targeting heterogeneous platforms. Visualizing work distribution and execution order is important when work can be done on multiple CPU threads, and will become more important when work is distributed amongst various computing devices.

Appendix A : Intel® GPA Monitor

This appendix provides more detail about the Intel® GPA Monitor; in particular, some of the features that are relevant to Intel GPA Platform Analyzer traces. This appendix is by no means an extensive description of all the features of GPA Monitor. Refer to the Intel GPA help file, which is the definitive guide for all Intel GPA features. For the purpose of this document, this appendix covers three features of Intel GPA Monitor:

  1. Running applications from Intel GPA Monitor
  2. Enabling Hardware Context Data
  3. Viewing and enabling domains

 

Running applications from Intel® GPA Monitor
In order for Intel® GPA to attach to your instrumented application and provide the functionality to capture traces and frames and execute state overrides, you must run your application from the Intel GPA Monitor, as shown in Figure 6 below.

Figure 6. Running applications from Intel® GPA Monitor


An application is run by entering the path to the executable, command line parameters, and the working folder and clicking the Run button. The “Add/Edit Profiles…” button opens up the Profiles window that is described in more detail in the following sections.
Enabling Hardware Context Data
Panel 4 in Figure 1 shows the Hardware Context Data of a trace in Intel GPA Platform Analyzer. This panel is not enabled by default, but is easily enabled from the Profiles window of the Intel GPA Monitor, as shown in Figure 7 below.

Figure 7. Enabling hardware context data in the Profiles window


Viewing and enabling domains
Throughout the document, the concept of domains is described and used. The full list of domains associated with an instrumented application can be viewed in the Domains tab of the Profiles window as show in Figure 7 below. The application must first be run from the Analyze Application window, as shown in Figure 6 above. While the application is running, open the Domains tab in the Profiles window to view and enable the domains you are interested in making part of the Intel GPA Platform Analyzer trace.

Figure 8. Domains tab in the Profiles window


* All screenshots in this document were captured using Intel® Graphics Performance Analyzers 4.3


About the Author

Omar A. Rodriguez is a software engineer in the Intel Software and Services Group, where he supports Intel graphics solutions in the Visual Computing Software Division. He holds a B.S. in Computer Science from Arizona State University. Omar is not the lead guitarist for the Mars Volta.

如需更全面地了解编译器优化,请参阅优化注意事项