Download ArticleDownload Using Intel® Graphics Performance Analyzer (GPA) to analyze Intel® Media Software Development Kit-enabled applications [1.8MB]
The 2nd Generation Intel® Core™ family of processors provides hardware-accelerated media encode, decode and preprocessing capabilities called Intel® Quick Sync Video technology. The Intel® Media Software Development Kit (Intel® Media SDK) provides developers with a standard application programming interface (API) to create high-performance video solutions for consumer and professional uses based on Intel® Quick Sync Video technology.
The Intel Media SDK provides developers with:
- Highly optimized routines for delivering maximum video performance on 2nd Generation Intel Core processors.
- Built-in support for upcoming video capabilities in future Intel® platforms.
- Faster time to market with easy-to-access APIs and reduced development time.
Intel® Graphics Performance Analyzers (Intel® GPA)Intel® GPA is a suite of software tools that provides platform-level graphics performance analysis to help developers optimize application performance. Intel GPA has the following major components:
- Intel® GPA Frame Analyzer is a powerful, intuitive, best-in-class single frame analysis and optimization tool.
- Intel® GPA System Analyzer Heads-up Display (HUD) and Standalone provide straightforward initial analysis and interactive Microsoft Direct3D* pipeline state overrides.
- Intel® GPA Platform Analyzer provides a timeline view for analysis of tasks, threads, Microsoft DirectX*, OpenCL™ and GPU-accelerated media applications in context.
- Intel® GPA Media Performance Analyzer: See how efficiently your code utilizes hardware acceleration on Intel® Core™ processor-based PCs with Intel® HD Graphics, or run real-time analysis of encode and decode metrics to get in-depth, real-time media performance analysis.
The Media Performance Analyzer window provides real-time Intel® HD Graphics usage information. The GPU General tab shows the overall usage percentage of Intel HD Graphics. The GPU Execution Unit Engine (EU) tab provides total real-time usage of the execution units. The table under the GPU Execution Unit Engine tab provides the usage of various components of the GPU execution units. The right hand side shows the total usage of the Multi-Format Codec Engine (MFX Engine) and its components.
Figure 2: Intel® GPA Media Performance Analyzer
To demonstrate how to use the Intel® GPA Media Performance Analyzer to optimize an Intel Media SDK-enabled application, we will use the sample encode application shipped with Intel Media SDK distributions. The Intel Media SDK provides two APIs:
- Submit the frame to Intel HD Graphics for encoding:
MFXVideoENCODE_EncodeFrameAsync(mfxSession session, mfxEncodeCtrl *ctrl, mfxFrameSurface1
*surface, mfxBitstream *bs, mfxSyncPoint *syncp) encode
- Receive the encoded frame:
MFXVideoCORE_SyncOperation(mfxSession session, mfxSyncPoint syncp, mfxU32 wait);
MFXVideoENCODE_EncodeFrameAsync()is an asynchronous (non-blocking) API call used by the application to submit an uncompressed (NV12) frame to the Intel Media SDK for encoding. This API takes the session handle, encode settings (mfxEncodeCtrl), input frame (mfxFrameSurface1), output buffer (mfxBitStream), and sync point (MfxSyncPoint) as input. The sync point is used by the MFXVideoENCODE_EncodeFrameAsync API to retrieve the encoded frame back from the Intel Media SDK.
Consider the following two scenarios to understand how the data flows in Intel HD Graphics between different encode stages, and how the Intel GPA Media Performance Analyzer helps to optimize the application.
- Scenario A: we will use the simple encode application to
- Read a raw frame from the input file;
- Submit the frame to Intel Media SDK using EncodeFrameAsync API;
- Retrieve the encoded frame from Intel Media SDK;
- Write the encoded frame to the file.
- Scenario B: we will use the simple encode application to
- Read 3 raw frames from the input file;
- Submit these frames to Intel Media SDK using EncodeFrameAsync API;
- Retrieve the encoded frames from Intel Media SDK;
- Write the encoded frames to the file.
The input YUV file is 1920x1080 with 300 frames and encoded to H.264/AVC for 8Mbps with Constant Bit Rate (CBR) settings. The important encoder configuration parameters for the Intel Media SDK are set as follows:
Figure 3 shows the changes in the Intel GPA Media Performance Analyzer window when we run the simple encode application for Scenario A:// set mfx parameters mfxEncParams.mfx.CodecId = MFX_CODEC_AVC; mfxEncParams.IOPattern = MFX_IOPATTERN_IN_SYSTEM_MEMORY; mfxEncParams.mfx.FrameInfo.ChromaFormat = MFX_CHROMAFORMAT_YUV420; mfxEncParams.mfx.FrameInfo.FourCC = MFX_FOURCC_NV12; mfxEncParams.mfx.FrameInfo.PicStruct = MFX_PICSTRUCT_PROGRESSIVE; mfxEncParams.mfx.TargetUsage = MFX_TARGETUSAGE_BALANCED; mfxEncParams.mfx.RateControlMethod = MFX_RATECONTROL_CBR; mfxEncParams.mfx.NumThread = 0; mfxEncParams.mfx.EncodedOrder = 0; mfxOption.CAVLC = MFX_CODINGOPTION_OFF; // CABAC
Figure 4 shows the Intel® Media Performance Analyzer output when Scenario B is run:
Figure 3: Scenario A - Intel® GPA Media Performance Analyzer View
In Scenario B, GPU usage is 98%, while in Scenario A, GPU usage is 62%. Similarly, execution unit and MFX unit usage are higher in Scenario B.
Figure 4: Scenario B - Intel® GPA Media Performance Analyzer View
Table 1: Performance Comparison of Scenario A and Scenario B
|GPU Usage||EU Usage||MFX Usage||Frames/sec|
|Scenario A (single synchronized frame encoding)||62%||51%||10%||41* fps|
|Scenario B (multiple asynchronous frame encoding)||98%||83%||25%||108* fps|
Note: * the above performance number no way indicates minimum or maximum performance that can be achieved from Intel Quick Sync Video. These numbers are obtained through a sample application to understand the Intel GPA tool.
It is easy to imagine that multiple asynchronous frame encoding will yield better performance over single synchronized frame encoding. However, what is the optimal value for AysncDepth (Intel Media SDK parameter) to get the best performance? Often the developer is doing multiple asynchronous encoding sessions, but is not getting any performance benefit, or it is actually hurting the performance. The Intel GPA Media Performance Analyzer helps by visualizing concurrency between hardware processing elements and logical (API) calls.
If you click on the "Capture" button while the application is running, the Intel GPA Media Performance Analyzer will capture a detailed execution trace of the application. Let us run Scenario A of the sample encode application and capture the trace to understand its execution inside Intel HD Graphics. Tracing Duration tells us the length of the trace to be captured in milliseconds. 1000ms or 2000ms is more than enough to understand the performance issues of the application. Click on the Capture button and start the simple encode application. After 1000ms, the trace will stop and it will open up the trace in a separate window. Figure 5 shows the trace from a simple encode application for Scenario A:
The trace provides a system-wide view of how the application code works with Intel Media SDK and how media-related workloads execute on Intel HD Graphics. The trace is organized in horizontal "tracks", which can be a processing thread (running on a CPU core) or within an Intel HD Graphics hardware block, or it can represent the duration of a blocking API call. The user can easily zoom in or out horizontally with the mouse wheel or "-" and "=" keys. The bottom right corner provides the list of panels that can provide more details about the trace when a user selects part of the trace. For example, in the picture below, the Statistics panel is selected and shown on the top right panel.
Figure 5: Scenario A - Trace View
Here we click inside a track and zoom in (Figure 6 below). This expands the details about the tracks.
The Statistics and Summary panels provide more details about the selected region. The Summary panel tells us that MFX_SyncOperation is selected. This can be easily confirmed by zooming in more on the Track panel
Figure 6: Scenario A - Expanded Track View
Figure 7 shows the encoding of one H.264/AVC frame. The first track, msdk_sample.exe (let us call it Application Track), runs the Mfx_SyncOperation API. The corresponding tasks in the Intel Media SDK are labeled as MSDK Track. Two GPU Encode tracks show the movement of frames inside the GPU and execution of different functions. The first GPU Encode track is labeled as Motion Estimation Track, and the other is labeled as Coding Track.
Figure 7: Scenario A – Expanded Frame Info
Let us understand these tracks in more detail.
Application Track:We zoom in a little bit more and can see that the Application Track has two function calls, MFX_EncodeFrameAsync and MFX_SyncOperation. MFX_EncodeFrameAsync takes only .0237ms, and MFX_SyncOperation takes 20.75ms.
The MFX_EncodeFrameAsync call is asynchronous and returns immediately, but the MFXSyncOperation call waits for the encode to finish for the frame submitted by the MFX_EncodeFrameAsync. This is one of the issues in Scenario A, where the application is waiting for each frame to complete before submitting the next frame for encoding. Encode does not take that much time, but there is additional overhead associated with the operation, e.g., copying the frame from the CPU memory to the graphics unit memory and getting the encoded frame back from the graphics unit.
Figure 8: Scenario A - Application Track
MSDK TrackThe MSDK Track has a main track called "Encode Submit", which has multiple subtracks (see Figure 9). Encode Submit first locks the frame, then copies the frame to the graphics unit, then unlocks the frame. The first step to copy the frame also depends on frame size. In our case, it is 1920x1080 YUV buffer. The second step is to execute DXVA commands to the graphics unit to encode the frame. That is where the graphics unit starts encoding the frame.
The DXVA2_Execute time may differ for every frame type (I-frame, B-frame, or P-Frame) as the driver may attach other information with the command (e.g., information needed to manage reference frames, etc.).
Figure 9: Scenario A - MSDK Track
GPU Encode TrackThere are two GPU Encode tracks (Figure 10Figure 10). This is because there are two different hardware blocks used in the encoding process to perform two separate tasks.
The first GPU Encode track is responsible for motion estimation, which includes motion detection and mode decision. The second task is mainly responsible for bitstream coding based on the information sent from the Motion Estimation Track, which includes CABAC. The application cannot control the execution time of these tracks, but it is helpful to understand these tracks for performance purposes.
Figure 10: Scenario A - GPU Encode Track
Motion Estimation TrackThe Motion Estimation Track is actually the kernel software which runs on the Execution Units (EUs) of the graphics unit. This kernel is executed while adaptively invoking the motion estimation acceleration hardware. The actual behavior depends on the TargetUsage set by the application. Here we can see 4 stages inside the Motion Estimation Track with MFX_TARGETUSAGE_BALANCED mode. The performance of this track depends on the usage parameter, but it is also affected by whether graphics Intel Turbo Boost Technology is on or off.
Coding TrackThe Coding Track is executed on the independent coding acceleration hardware, which is separate from the EUs. Because motion estimation and coding are on independent hardware, the Motion Estimation Track and Coding Track can work in parallel. On the Intel GPA Platform Analyzer, these two processes are serialized, but this is just because of logical dependency, which means the Coding Track needs the motion vector and macroblock type from the Motion Estimation Track. In other words, this serialization is done by the driver software, not by hardware logic. If there is no dependency, the hardware can work in parallel. This is a very important point to consider when optimizing the encoding performance. The Motion Estimation Track is separated into multiple stages to improve performance by breaking the entire motion estimation process into several pieces. So, multiple frames which are in the encoding stage (Motion Estimation Track or Coding Track) can be encoded in parallel. In this case, motion estimation for the current frame can work in parallel with coding of the last frame. Intel GPA Media Performance Analyzer helps the developer to understand whether encode hardware is used optimally inside the graphics or not.
End of Encode_QueryEncode_Query messages are issued by the Intel Media SDK to check whether encoding is complete or not. If you look at Figure 10Figure 10, after Encode Submit, there are tasks at regular intervals in the MSDK Track; these tasks are Encode_Query tasks. Figure 11 shows the expanded form of the MSDK Track near Encode_Query.
If the encoded frame is ready, then the graphics driver locks the frame, copies the data from the graphics unit to the destination frame, and unlocks the frame. This data copy (graphics unit to CPU) is much faster than when an uncompressed frame is copied to the graphics unit. There are two reasons: first, the compressed frame is smaller in size as compared to uncompressed frame, and second, the Intel Media SDK utilizes an optimized data copy with a combination of MOVNTDQA and MFENCE instructions.
Figure 11: Scenario A - Encode Query
An application can get the best performance from Intel Quick Sync Video technology by fully utilizing the hardware acceleration capabilities in Intel HD Graphics. For example, in Scenario A:
- There is a gap of ~4.8 ms when the first frame encoding ends and the second frame encoding starts (Figure 12). The application should submit more work to the graphics unit, as the graphics unit is idle during this gap.
Figure 12: Scenario A - Idle Time in Intel® HD Graphics
- The graphics unit encode capabilities are not overlapped optimally. As mentioned earlier, there are parts of motion estimation and encoding that can run in parallel, as there are different hardware units used. Figure 12 shows that GPU Encode Track 1 and GPU Encode Track 2 are mostly working serially. About ~12 ms could be recovered if these were overlapped.
- First, by submitting two frames in parallel, we can eliminate the CPU bottleneck.
- Second, by using more than two frames in parallel, we can fully overlap the two Encode tracks.
Let us look at the Intel GPA Media Performance Analyzer trace for Scenario B (Figure 14).
Figure 13: Scenario A - Add Parallelism
In Scenario B, the application is using the Motion Estimation Track and the Coding Track in parallel. There is no idle time in between them. Multiple frames are submitted for encoding to the graphics unit, which also limits the idle time between different hardware encoding units.
Figure 14: Scenario B - Expanded View
It is clear that Scenario B is able to hide the idle time on the graphics unit. The performance of Scenario A is 41 fps, and the performance of Scenario B is 108 fps. Scenario B only reduces 7-16ms time on hardware, which does not account for the more than double performance boost. Where is the extra performance coming from? To find out, let us compare the traces of the two scenarios.
Figure 15 shows the combined traces from both scenarios.
Figure 15: Scenario A and Scenario B Comparison
SummaryLet me summarize the key points that we learned from analyzing Intel GPA Media Performance Analyzer traces to get the best encode performance from Intel® Quick Sync Video:
- The "Encode Submit" in the MSDK Track should finish before the Motion Estimation for the previous frame finishes (Figure 14Figure 14). If the application can do that, then the application is making sure that the Motion Estimation unit has another frame to work on after the current frame is finished.
- If it is hard to achieve the requirement mentioned in point 1 due to application complexity, try to complete "Encode_Submit" before the Coding Track for the previous frame completes. This enables the application to achieve nearly 100% GPU utilization.
- If even the second point is not feasible, try to achieve 90% GPU utilization; then Intel Turbo Boost Technology on the graphics unit can benefit application performance.
*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.