Download Article
Download Using Intel® Graphics Performance Analyzer (GPA) to analyze Intel® Media Software Development Kit-enabled applications [1.8MB]The 2nd Generation Intel® Core™ family of processors provides hardware-accelerated media encode, decode and preprocessing capabilities called Intel® Quick Sync Video technology. The Intel® Media Software Development Kit (Intel® Media SDK) provides developers with a standard application programming interface (API) to create high-performance video solutions for consumer and professional uses based on Intel® Quick Sync Video technology.

The Intel Media SDK provides developers with:
- Highly optimized routines for delivering maximum video performance on 2nd Generation Intel Core processors.
- Built-in support for upcoming video capabilities in future Intel® platforms.
- Faster time to market with easy-to-access APIs and reduced development time.
Intel® Graphics Performance Analyzers (Intel® GPA)
Intel® GPA is a suite of software tools that provides platform-level graphics performance analysis to help developers optimize application performance. Intel GPA has the following major components:- Intel® GPA Frame Analyzer is a powerful, intuitive, best-in-class single frame analysis and optimization tool.
- Intel® GPA System Analyzer Heads-up Display (HUD) and Standalone provide straightforward initial analysis and interactive Microsoft Direct3D* pipeline state overrides.
- Intel® GPA Platform Analyzer provides a timeline view for analysis of tasks, threads, Microsoft DirectX*, OpenCL™ and GPU-accelerated media applications in context.
- Intel® GPA Media Performance Analyzer: See how efficiently your code utilizes hardware acceleration on Intel® Core™ processor-based PCs with Intel® HD Graphics, or run real-time analysis of encode and decode metrics to get in-depth, real-time media performance analysis.

Figure 1: Intel® GPA
If Intel GPA is installed on the system, usually there is an Intel GPA icon on the taskbar. If there is no icon, start Intel GPA from the Microsoft Windows* Start menu; it will place the icon on the taskbar. Right-click on the Intel GPA icon. It will have "Media Performance...." as one of the options (Figure 1: Intel® GPA). Select the "Media Performance ..." option. It will open up the Intel® GPA Media Performance Analyzer Window as shown below:
Figure 2: Intel® GPA Media Performance Analyzer
To demonstrate how to use the Intel® GPA Media Performance Analyzer to optimize an Intel Media SDK-enabled application, we will use the sample encode application shipped with Intel Media SDK distributions. The Intel Media SDK provides two APIs:
- Submit the frame to Intel HD Graphics for encoding:
MFXVideoENCODE_EncodeFrameAsync(mfxSession session, mfxEncodeCtrl *ctrl, mfxFrameSurface1*surface, mfxBitstream *bs, mfxSyncPoint *syncp) encode - Receive the encoded frame:
MFXVideoCORE_SyncOperation(mfxSession session, mfxSyncPoint syncp, mfxU32 wait);
MFXVideoENCODE_EncodeFrameAsync() is an asynchronous (non-blocking) API call used by the application to submit an uncompressed (NV12) frame to the Intel Media SDK for encoding. This API takes the session handle, encode settings (mfxEncodeCtrl), input frame (mfxFrameSurface1), output buffer (mfxBitStream), and sync point (MfxSyncPoint) as input. The sync point is used by the MFXVideoENCODE_EncodeFrameAsync API to retrieve the encoded frame back from the Intel Media SDK.Consider the following two scenarios to understand how the data flows in Intel HD Graphics between different encode stages, and how the Intel GPA Media Performance Analyzer helps to optimize the application.
- Scenario A: we will use the simple encode application to
- Read a raw frame from the input file;
- Submit the frame to Intel Media SDK using EncodeFrameAsync API;
- Retrieve the encoded frame from Intel Media SDK;
- Write the encoded frame to the file.
- Scenario B: we will use the simple encode application to
- Read 3 raw frames from the input file;
- Submit these frames to Intel Media SDK using EncodeFrameAsync API;
- Retrieve the encoded frames from Intel Media SDK;
- Write the encoded frames to the file.
The input YUV file is 1920x1080 with 300 frames and encoded to H.264/AVC for 8Mbps with Constant Bit Rate (CBR) settings. The important encoder configuration parameters for the Intel Media SDK are set as follows:
Figure 3 shows the changes in the Intel GPA Media Performance Analyzer window when we run the simple encode application for Scenario A:// set mfx parameters mfxEncParams.mfx.CodecId = MFX_CODEC_AVC; mfxEncParams.IOPattern = MFX_IOPATTERN_IN_SYSTEM_MEMORY; mfxEncParams.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420; mfxEncParams.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12; mfxEncParams.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE; mfxEncParams.mfx.TargetUsage = MFX_TARGETUSAGE_BALANCED; mfxEncParams.mfx.RateControlMethod = MFX_RATECONTROL_CBR; mfxEncParams.mfx.NumThread = 0; mfxEncParams.mfx.EncodedOrder = 0; mfxOption.CAVLC = MFX_CODINGOPTION_OFF; // CABAC

Figure 3: Scenario A - Intel® GPA Media Performance Analyzer View

Figure 4: Scenario B - Intel® GPA Media Performance Analyzer View
Table 1: Performance Comparison of Scenario A and Scenario B
| GPU Usage | EU Usage | MFX Usage | Frames/sec | |
| Scenario A (single synchronized frame encoding) | 62% | 51% | 10% | 41* fps |
| Scenario B (multiple asynchronous frame encoding) | 98% | 83% | 25% | 108* fps |
Note: * the above performance number no way indicates minimum or maximum performance that can be achieved from Intel Quick Sync Video. These numbers are obtained through a sample application to understand the Intel GPA tool.
It is easy to imagine that multiple asynchronous frame encoding will yield better performance over single synchronized frame encoding. However, what is the optimal value for AysncDepth (Intel Media SDK parameter) to get the best performance? Often the developer is doing multiple asynchronous encoding sessions, but is not getting any performance benefit, or it is actually hurting the performance. The Intel GPA Media Performance Analyzer helps by visualizing concurrency between hardware processing elements and logical (API) calls.
If you click on the "Capture" button while the application is running, the Intel GPA Media Performance Analyzer will capture a detailed execution trace of the application. Let us run Scenario A of the sample encode application and capture the trace to understand its execution inside Intel HD Graphics. Tracing Duration tells us the length of the trace to be captured in milliseconds. 1000ms or 2000ms is more than enough to understand the performance issues of the application. Click on the Capture button and start the simple encode application. After 1000ms, the trace will stop and it will open up the trace in a separate window. Figure 5 shows the trace from a simple encode application for Scenario A:

Figure 5: Scenario A - Trace View
Here we click inside a track and zoom in (Figure 6 below). This expands the details about the tracks.

Figure 6: Scenario A - Expanded Track View

Figure 7: Scenario A – Expanded Frame Info
Let us understand these tracks in more detail.
Application Track:
We zoom in a little bit more and can see that the Application Track has two function calls, MFX_EncodeFrameAsync and MFX_SyncOperation. MFX_EncodeFrameAsync takes only .0237ms, and MFX_SyncOperation takes 20.75ms.
Figure 8: Scenario A - Application Track
MSDK Track
The MSDK Track has a main track called "Encode Submit", which has multiple subtracks (see Figure 9). Encode Submit first locks the frame, then copies the frame to the graphics unit, then unlocks the frame. The first step to copy the frame also depends on frame size. In our case, it is 1920x1080 YUV buffer. The second step is to execute DXVA commands to the graphics unit to encode the frame. That is where the graphics unit starts encoding the frame.
Figure 9: Scenario A - MSDK Track
GPU Encode Track
There are two GPU Encode tracks (Figure 10Figure 10). This is because there are two different hardware blocks used in the encoding process to perform two separate tasks.
Figure 10: Scenario A - GPU Encode Track
Motion Estimation Track
The Motion Estimation Track is actually the kernel software which runs on the Execution Units (EUs) of the graphics unit. This kernel is executed while adaptively invoking the motion estimation acceleration hardware. The actual behavior depends on the TargetUsage set by the application. Here we can see 4 stages inside the Motion Estimation Track with MFX_TARGETUSAGE_BALANCED mode. The performance of this track depends on the usage parameter, but it is also affected by whether graphics Intel Turbo Boost Technology is on or off.Coding Track
The Coding Track is executed on the independent coding acceleration hardware, which is separate from the EUs. Because motion estimation and coding are on independent hardware, the Motion Estimation Track and Coding Track can work in parallel. On the Intel GPA Platform Analyzer, these two processes are serialized, but this is just because of logical dependency, which means the Coding Track needs the motion vector and macroblock type from the Motion Estimation Track. In other words, this serialization is done by the driver software, not by hardware logic. If there is no dependency, the hardware can work in parallel. This is a very important point to consider when optimizing the encoding performance. The Motion Estimation Track is separated into multiple stages to improve performance by breaking the entire motion estimation process into several pieces. So, multiple frames which are in the encoding stage (Motion Estimation Track or Coding Track) can be encoded in parallel. In this case, motion estimation for the current frame can work in parallel with coding of the last frame. Intel GPA Media Performance Analyzer helps the developer to understand whether encode hardware is used optimally inside the graphics or not.End of Encode_Query
Encode_Query messages are issued by the Intel Media SDK to check whether encoding is complete or not. If you look at Figure 10Figure 10, after Encode Submit, there are tasks at regular intervals in the MSDK Track; these tasks are Encode_Query tasks. Figure 11 shows the expanded form of the MSDK Track near Encode_Query.
Figure 11: Scenario A - Encode Query
An application can get the best performance from Intel Quick Sync Video technology by fully utilizing the hardware acceleration capabilities in Intel HD Graphics. For example, in Scenario A:
- There is a gap of ~4.8 ms when the first frame encoding ends and the second frame encoding starts (Figure 12). The application should submit more work to the graphics unit, as the graphics unit is idle during this gap.

Figure 12: Scenario A - Idle Time in Intel® HD Graphics - The graphics unit encode capabilities are not overlapped optimally. As mentioned earlier, there are parts of motion estimation and encoding that can run in parallel, as there are different hardware units used. Figure 12 shows that GPU Encode Track 1 and GPU Encode Track 2 are mostly working serially. About ~12 ms could be recovered if these were overlapped.
- First, by submitting two frames in parallel, we can eliminate the CPU bottleneck.
- Second, by using more than two frames in parallel, we can fully overlap the two Encode tracks.
Let us look at the Intel GPA Media Performance Analyzer trace for Scenario B (Figure 14).
Figure 13: Scenario A - Add Parallelism
In Scenario B, the application is using the Motion Estimation Track and the Coding Track in parallel. There is no idle time in between them. Multiple frames are submitted for encoding to the graphics unit, which also limits the idle time between different hardware encoding units.
Figure 14: Scenario B - Expanded View
It is clear that Scenario B is able to hide the idle time on the graphics unit. The performance of Scenario A is 41 fps, and the performance of Scenario B is 108 fps. Scenario B only reduces 7-16ms time on hardware, which does not account for the more than double performance boost. Where is the extra performance coming from? To find out, let us compare the traces of the two scenarios.
Figure 15 shows the combined traces from both scenarios.
Figure 15: Scenario A and Scenario B Comparison
Summary
Let me summarize the key points that we learned from analyzing Intel GPA Media Performance Analyzer traces to get the best encode performance from Intel® Quick Sync Video:- The "Encode Submit" in the MSDK Track should finish before the Motion Estimation for the previous frame finishes (Figure 14Figure 14). If the application can do that, then the application is making sure that the Motion Estimation unit has another frame to work on after the current frame is finished.
- If it is hard to achieve the requirement mentioned in point 1 due to application complexity, try to complete "Encode_Submit" before the Coding Track for the previous frame completes. This enables the application to achieve nearly 100% GPU utilization.
- If even the second point is not feasible, try to achieve 90% GPU utilization; then Intel Turbo Boost Technology on the graphics unit can benefit application performance.
*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
12-20-2011
12-20-2011
Tech Articles
Intel® Media SDK
The Intel® GPA Media Performance Analyzer helps the developer to understand and fine-tune performance gaps when using the Intel® Media SDK. This white paper provides detailed usage examples and tips on using the Intel GPA Media Performance Analyzer to understand performance issues in Intel Media SDK-enabled applications.

Comments
i did not get the down load
Hello,
Sorry that you are having download problems.
Please provide more details on your system configuration, and the exact error message(s) that you are seeing.
Thanks!
Neal