Maximum FPS: Creating VTune™ Analyzer Performance DLLs

Submit New Article

October 6, 2009 12:00 AM PDT



Introduction

Welcome back to Maximum FPS! Last month we discussed some tips for speeding up your code. This month's column is being written by Will Damon who will be examining a way to quickly improve the quality and performance of applications by creating performance DLLs which plug into the Intel® VTune™ Performance Analyzer. Though aimed mainly at tuning for performance, performance DLLs may aid in speeding up development, debugging code, and even adding features. So, let's welcome Will as he gets started and we'll see how all this magic works and how it can work for you!


Overview

Intel® VTune™ Performance Analyzer is a software package used for achieving higher levels of performance in software running on Intel processors. Part of what it can do is track and graph performance counters that exist on the processor. This column takes a slightly different approach and looks at a similar construct in software space, the performance DLL. With this tool, you as the developer can define specific parts of your application to track, graph the results, seek out and destroy performance bottlenecks, and potentially find bugs you didn't even know existed!

VTune Performance Analyzer 4.5 or greater uses these performance DLLs to monitor performance counters in hardware or applications. A performance counter is a variable that may be implemented in hardware, or software. Performance DLLs enable the tracking of performance counters in applications, operating systems, device drivers, and hardware; however, for the purposes of this column, we are only concerned with monitoring an application.

So how does it all work? How does VTune analyzer know how a specific application is performing with respect to very specific data and/or updates? How can I monitor the performance of something like world-updates in my Direct3D* application? What else can I monitor for performance fine-tuning? Let's see if we can start to answer these questions by taking a high-level look at the architecture in order to see how the pieces fit together. Then we will move into a step-by-step tutorial to show exactly how you can start utilizing this great technology.


Architecture

The high-level view of interactions between VTune™ analyzer, an application, and a performance DLL is relatively simple, and is best shown with a diagram as Figure 1 depicts

Figure 1: Process Interaction

Figure 1: Process Interaction

When VTune analyzer starts, it loads several performance DLLs, one or more of which can be custom built. Then, when a sampling session is run, your application (myApp) is launched and may also load one or more of the same performance DLLs. During runtime myApp updates the performance counters specified, and at each sampling interval VTune analyzer collects the information from the performance DLL. The information that VTune analyzer collects is later used to generate a graph, called a chronology that visually displays how an application is performing over time. Your application is responsible for loading any performance DLLs and calling the appropriate update functions. In our example, we will dynamically load (and unload) our own performance DLL.

Now that we have a general understanding of what is going on under the hood, let's see how we can use this to monitor our application. We'll illustrate the concept by developing an example performance DLL for a simple Direct3D* application.


5 Steps to Creating a Performance DLL

The tools we will use here include Intel® VTune™ Performance Analyzer 5.0, and Microsoft* Visual C++* 6.0. You can download a fully functional 30-day evaluation copy of VTune analyzer from the Intel website at http://www.intel.com/cd/software/products/asmo-na/eng/219690.htm. Assuming we have the necessary tools, an application to monitor, and some ideas of what we want to track, our next step is to create a performance DLL. Here is the list of the steps we are going to go through:

  1. Download and Install the Performance DLL SDK
  2. Create a new project in Visual C++ using the Performance DLL AppWizard
  3. Customize the project
  4. Compile and build the project
  5. Debug and Test

 

Step 1: Download and Install the Performance DLL SDK

This step is easy, and will only take us minute.

The performance DLL SDK, provided by Intel, is free to download at http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/. The entire SDK is only 1 MB, so it will not take long to download. Also on the site is a white paper that introduces performance DLLs and contains all the background information we covered earlier. Once the download is complete, double click the self-extracting archive to run the setup program.

After the SDK is installed we are ready to move on to step 2.

Step 2: Create a new project

Start up Visual C++*, select file->new, and select Performance DLL Wizard. Enter the name of the project (I called the project "myDll"), and click OK. A dialog box will come up telling you about the project you just created. Read it over, click OK, and MSVC will generate the code for your project. If you are interested in the specifics of all the generated files, refer to the SDK help; we will modify myDll.rc, myDll.cpp, and myDll.def.

Figure 2: Create a new project with the Performance DLL Wizard

Step 3: Customize the project

The SDK help is fairly thorough about explaining what we need to do to customize our project, but for the sake of completeness we will step through the process here. Customizing the project is a two-step process:

  1. Define the objects and counters the performance DLL will export
  2. Customize the necessary methods of CMyProjectModul (in our case CMyDllModule)

 

Defining Objects and Counters

Let's start by selecting the Resource View tab in the workspace window. Here we can create resource strings for all the objects and counters we wish to support. For this example, we are modifying one of the pre-defined counters (I modified two in the example project). Right click on MYDLL Counter 1, select properties, and change the caption to whatever you like. I chose to make the first counter a frame rate counter for this example. Next, right click on "Text explaining the meaning of Counter 1", and change the caption to something meaningful. Figure 3 shows the result:

Figure 3: Modifying the string table

Now select the File View tab, put on your coding hat, and open myDll.cpp.

Customizing CMyDllModule Methods

For our example, almost everything we need is already set up. By default the AppWizard builds one object definition and four performance counter definitions into the performance DLL project. We are going to use the provided definitions in our example to implement our first counter.

We begin by adding a static member variable to the main DLL class, CMyDllModule. This variable will be the frame rate performance counter. We make this variable static because we want to ensure that only one instantiation of the counter exists no matter how many processes instantiate CMyDllModule. In order to track, update, and reset single variables among multiple processes, namely the Vtune analyzer and the application we want to monitor, we must set up a shared data segment in the performance DLL. This is achieved by adding a few #pragma statements after the class definition to specify what data is shared.

http://msdn.microsoft.com/library/default.asp*. The snippet below shows how I set up the mutex object.

HANDLE gMutex;

// DllMain - the global instance's constructor will run before this

BOOL WINAPI DllMain(HINSTANCE hinst, DWORD reason, LPVOID

reserved)

{

TCHAR sz_msg[100];

gMutex = CreateMutex(NULL, FALSE, "myDll mutex");

if (NULL == gMutex)

{

wsprintf(sz_msg, "CreateMutex error: %d.",

GetLastError());

MessageBox(NULL, sz_msg, "Error", MB_OK);

return FALSE;

}

//

 

I declare a global HANDLE for the mutex object, and create the mutex in DllMain(). The first process to call CreateMutex() will actually create a mutex object. Any subsequent calls thereafter will generate a handle to the existing mutex object.

Next we add the code into CMyDllModule::collect() to write the performance counter to VTune analyzer. Each sampling interval, VTune analyzer makes a call to collect() for each enabled performance DLL. After that, we implement the method to update our performance counter, and we're done. Don't forget to add any performance counter methods to myDll.def, otherwise getting the address of the performance counter update method from within the application you are monitoring will fail. Note that because we are counting frame updates, we also reset the performance counter, mFrameCnt, to zero each time it is collected. Below is a snippet that shows what I added to CMyDllModule::collect().

void CMyDllModule::collect(DWORD object,

DataStream& dstream)

{

//

else if (object == MY_OBJECT)

{

//

for (int idx = 0; idx < NUM_COUNTERS; idx++)

{

if (g_counters[idx].active)

{

if (0 == idx)

{

DWORD wait_result = WaitForSingleObject(gMutex, 5000L);

if (WAIT_OBJECT_0 == wait_result)

{

dstream.write_counter32(mFrameCnt);

mFrameCnt = 0;

ReleaseMutex(gMutex);

}

}

//

else

dstream.write_counter32(idx+1);

//COUNTER_X has data value

X

}

}

//

 

Since we know that the zeroth counter is the frame counter, we can hardcode it into the collection. If the current counter is active and is the zeroth counter, we wait for the mutex object. If we gain ownership of the mutex before the timer of five seconds runs out, we write the performance counter, otherwise it simply doesn't get collected or reset. I could have added some extra error checking and recovery, but this is sufficient for demonstration purposes. After we reset the performance counter, we release the mutex so other processes (e.g. myApp) can gain access to the shared data. Implementing the performance counter update method works in a similar fashion: wait for the mutex object, increment the performance counter, and release the mutex.

Step 4: Compile and Build the Project

Now that we have all the pieces in place, we can compile and build our performance DLL! If for some reason you had to reconstruct the project workspace, the SDK help has some tips on setting the correct compile and link settings, otherwise you are good to go.

Step 5: Debug and Test

In order to debug a performance DLL we will have to set up the system to do so. To do this, we need to add a DWORD key called DebugChron with a value of 0x1 to the VTune analyzer Globals registry: HKEY_CURRENT_USERSoftwareVB and VBA Program SettingsVTune 4.0Globals. Once we are convinced the performance DLL is correct, we can restore the system to "normal operation mode" by setting DebugChron to 0x0. If we want to launch a performance DLL in debug mode from within Visual C++, we must specify vtunecca.exe as the executable to debug. When VTune analyzer starts under normal operation mode, vtunecca launches as a separate process and is responsible for collecting data. To debug we'll have to launch the debugger, and then run VTune analyzer. Keep in mind that we will not be able to fully debug the performance DLL until we modify our application to update counters.

See "Fast AGP Writes for Dynamic Vertex Data", or feel free to monitor any project to which you have access to the source. I included a slightly modified version of Dean's project in the corresponding download for this column. Using a performance DLL from an application's standpoint only requires a few small adjustments:

  • Call LoadLibrary("myDll") where it makes sense, usually in a class constructor or one-time initialize method
  • Retrieve a pointer to the address of any performance counter update methods via GetProcAddress()
  • Call the performance counter update method where appropriate
  • FreeLibrary() when you no longer need the library loaded in your application. This is usually upon class destruction via the class destructor or the class destroy() method

 

That's all there is to it. Once that code is in place, we are ready to launch VTune analyzer, set some options, and check out the cool chronologies of our application! We can also debug the performance DLL if things like performance counters don't seem to be correct (e.g. a graph for frame rates being flat at 0 could mean that the frame rate counter isn't being updated, or collected).


Setting Options in Intel® VTune™ Performance Analyzer

Here are some configuration steps we need to follow in order to tell VTune analyzer to use our performance DLL:

  • Under Configure->Options select Sampling: Advanced and check Collect Chronology Data (Figure 4)
  • Select Sampling: Chronology Objects, and check MYDLL (Figure 5)
  • Hit close to get out of the dialog
  • Run a sampling session
  • When your application exits, or the sampling session completes, you can select Chronologies in the workspace view bar
  • Select myDll, check the counters you want graphed (Figure 6), and finally select the Graphs tab

 

Congratulations! You are now looking at the recorded run-time performance of your application! If you are satisfied that the DLL is working properly, don't forget to return your system to "normal operation mode" for further testing of your application.

Figure 4: Advanced options to select collect chronology data

Figure 4: Advanced options to select collect chronology data

 

Figure 5: Check

Figure 5: Check "MYDLL Object" under Chronology Objects

 

Figure 6: Select counters to view before selecting the Graphs tab

Figure 6: Select counters to view before selecting the Graphs tab

 


References

 


About the Authors

Will Damon was a Technical Marketing Engineer within Intel's Software Solutions Group. He has a bachelor's degree in Computer Science from Virginia Polytechnic Institute and State University*, where he graduated with honors. He has been with Intel for over a year, helping game developers enable their titles to achieve the highest performance possible on Intel® Pentium® 4 processor-based PCs. He welcomes email regarding optimization, mathematics, physics, artificial intelligence, or anything else related to real-time 3D graphics, and gaming.

Dean Macri's research has focused on tessellating NURBS surfaces in real-time, simulating cloth surfaces in real-time and procedurally generating 3D content. Currently, he is helping game developers achieve maximum performance in their titles.