Analyzing and Optimizing Performance of Windows* Store Apps Using Intel® VTune™ Amplifier and GPUView


Download Analyzing and Optimizing Performance of Windows* Store Apps Using Intel® VTune™ Amplifier and GPUView [PDF 536KB]


The objective of this article is to show a method of analyzing and optimizing the frame rate performance of a Windows* Store app using the VTune™ Amplifier performance profiler, GPUView, and the Visual Studio* Debugger tools.

Table of Contents

  1. Introduction
  2. Reference Material
  3. Analyzing Performance Using VTune Amplifier
  4. Analyzing Performance Using GPUView
  5. Summary

1. Introduction

Windows Store apps run on multiple form factors from Ultrabook™ devices to tablets. Users expect a compelling user experience (UX) on any form factor. For example, if a Windows Store app runs smoothly on a laptop, then users expect the app to run smoothly on a tablet as well. To demonstrate how to optimize and analyze performance, a Windows Store app called MathMania was created in Visual Studio. MathMania is an educational game that involves constructing math equations by selecting tiles that are moving on the screen.

2. Reference Material

Below are links to Intel® Developer Zone blogs discussing touch APIs, collision detection, image kinetics, and game loop APIs using MathMania as a demonstration vehicle:

3. Analyzing Performance Using VTune Amplifier

3.1 Threads

Windows Store apps have a UI thread and a composition thread. The composition thread is able to help out with rendering if the UI thread is blocked. To learn more about the UI thread, composition thread, and responsiveness, please see the links below:

The CPU is in charge of giving the GPU work to do. If the CPU is too busy with computations, then the GPU may be blocked from doing enough work. “Enough work” is characterized as the GPU achieving 60 frames per second (fps), where fps is a rendering benchmark. The Visual Studio debugger has the capability to capture real-time frame rates for the UI thread and composition thread. To learn more about the API for using this capability, please see the link below:

3.2 Vertical Synchronization (VSync) Interval

VSync allows video rendering to occur only as fast as the screen refresh rate. VSync is always enabled in C#-based Windows Store apps like MathMania. As a consequence, MathMania would miss renders and have a frame rate below 60 fps. To learn more about VSync, please see the link below:

3.3 Stopwatch API

When analyzing performance in MathMania, the Stopwatch API is used for determining how long the MoveTiles routine takes to complete and determining the time elapsed between calls to the Render routine. In MathMania, Render is the routine invoked before every frame. Render invokes MoveTiles to compute the next position of the game tiles on the screen. Figure 3.1 below shows the Visual Studio Debugger with a breakpoint that will trigger if the frame time is longer than the VSync interval of 17 ms. It was determined that calls to MoveTiles were taking longer than a VSync interval.

Figure 3.1: Using a Stopwatch at Runtime (image taken from Visual Studio* 2012)

To learn more about the Stopwatch API, please see the link below:

3.4 Performance Analysis and Optimization of CPU Code with VTune Amplifier

To analyze and optimize the performance of MathMania, the VTune™ Amplifier XE 2013 performance profiler was used. Figure 3.2 below shows system traces captured while MathMania was running using the Advanced Hotspot Analysis in the VTune Amplifier. It is observed that the moveTiles routine is the top hotspot for CPU computation time and the overall CPU idle time is very high as shown in the Elapsed Time area. To learn more about VTune Amplifier XE 2013 , please see the link below:

Figure 3.2: Analysis Summary of MathMania (Screenshot taken from VTune™ Amplifier XE 2013)

Figure 3.3 below provides a deeper picture of the call stack and how the CPU and GPU components relate. It is observed that the GPU Usage bar at the bottom of the figure shows the GPU is waiting for work. In addition, the CPU Time bar for the computational work is not high. In conclusion, the figure shows that the routine MathMania::GameBoard::MoveTiles has the largest CPU utilization relative to other API calls.

Figure 3.3: Bottom-up View of MathMania (Screenshot taken from VTune™ Amplifier XE 2013)

Now that the suspect function has been identified in the call stack, the source code for the function is viewed by double clicking the entry pertaining to MoveTiles. Below is the pseudo code showing the high-level algorithm used in the code snippet in Figure 3.4.

For each tile x
	Clear collision flags and adjust tile velocity if recently flung
	For each tile y
		If y is a valid tile that isn’t x
			If x and y are about to collide
				Change movement direction so that images tiles move apart

	Move tile x based on its (possibly changed) movement direction

Observe that line 167 in Figure 3.4 below pertains to line 5 in the algorithm above. The VTune Amplifier results show that this line of code takes the most CPU time. It is determined that the UX of MathMania is not limited to the MoveTiles execution time, but the work levels of the GPU and the CPU.

Figure 3.4: Source Code Analysis (Screenshot taken from VTune™ Amplifier XE 2013)***

4. Analyzing GPU Performance Using GPUView

GPUView is a free tool for analyzing GPU performance that comes with the Performance Toolkit included in the Windows* Assessment and Deployment Kit (ADK). To learn more about the ADK and GPUView please see the links below:

Windows has a window manager called the dwm.exe process. Dwm.exe provides a baseline to help determine how an app is rendering. To learn more about dwm.exe, please see the link below:

GPUView is used to give another perspective on how the CPU and GPU performance relate by showing the render rate of dwm.exe. Figure 4.1 below shows the output from GPUView while MathMania is running. There is a wide window where dwm.exe refreshes the screen at approximately 60 fps as discussed in one of the blogs mentioned in Section 2 above. In this window, there are noticeable gaps in the GpuWork performed. Additionally, the GpuWork presence depends on whether the Context CPU Queue entry depicts CPU work. At the top of the figure, the Flip Queue indicates whether the GPU is working or is idle. It is observed that the GPU fps for MathMania is improved by giving the CPU more work.

Figure 4.1: GPUView Snapshot Relating CPU, GPU, and dwm.exe (Image taken from GPUView)

Figure 4.2 below shows a zoomed-in view on the graph. This view helps compare the intervals for GpuWork and VSync State. It is observed that the GPU is missing VSync intervals, that the CPU is not blocking the UI thread, and that the CPU is underutilized.

Figure 4.2: Zooming into the Graph in GPUView

5. Summary

This article showed a method of analyzing the performance and optimization of MathMania. Tools such as VTune Amplifier, GPUView, and the Visual Studio Debugger were used in the analysis. The analysis showed graphs obtained from the tools and algorithm details to help isolate the cause of lower frame rates and determine where optimizations could be made.

About the Author

David Medawar is a Software Engineer with Intel Corporation. He worked on app enabling for Windows 8 and Android* and now on development for design rule checkers. He has been with the company for eight years, and his former development experience includes system BIOS and boot-loader enabling.

About the Editor

Mike Rylee is a Software Engineer with Intel Corporation. He currently works on app enabling for Windows 8 and Android.


Intel, the Intel logo, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.


**This sample source code includes XAML code automatically generated by Visual Studio IDE and is released under the Intel OBL Sample Source Code License (MS-LPL Compatible)
***This sample source code is released under the Microsoft Limited Public License (MS-LPL)