By Sam Siewert
Real-time Media and Encoding Tools
Real-time media has become pervasive with Web 2.0 viral video services, Internet Protocol Television (IPTV), mobile media for Web-enabled phones, and interactive systems for digital cable. Digital content creation, production, delivery, and consumption-increasingly in high definition (HD) relative to standard definition (SD) television-is changing communication, with more on-demand services and high demand for digital media in a wide range of resolutions, encoding formats, and transport formats.
This article provides a brief overview of digital media concepts and software tools used for encoding video and audio in compressed formats for playback and transport. Using an open source encoder with a simple workflow to trans-code MPEG4 content into MPEG2, with debugging symbols included, moving on to cover the use of the Intel® VTune™ profiling tools to understand where processor utilization hot-spots are in the software in terms of number of instructions and efficiency in the execution of those instructions. Tracing down a specific hot-spot, I examine cache misses that contribute to inefficiency. Finally, examining the specific hot-spot, I find that it makes use of the
Intel® Streaming SIMD (Single Instruction Multiple Data) Extensions, indicating that the code has employed the extensions to speed-up this section of code already, but also indicating a data cache inefficiency. The article reviews methods to improve cache efficiency, as well as both periodic and event-based profiling methods. In general, the paper provides an introduction to using Intel® VTune™ tools with digital media encoding software and demonstrates the importance of using
Intel® Streaming SIMD Extensions (Intel® SSE) and understanding where code hot-spots are and methods that can be used to make execution of code in hot-spots most efficient.
This article focuses on video, because it has an inherently high data rate based upon the number of bits per pixel (picture element), number of pixels per frame (picture), and frames per second for full-motion video. The typical uncompressed SD encoding of National Television Standards Council (NTSC) video-what everyone knows as
TV-has 29.97 interlaced frames per second typically digitized into 720×480 24-bit color-encoded pixels (8 bits for red, green, and blue [RGB]) at the same frame rate, which is right around a 30 megabytes (MB)/sec data rate if it was not further encoded with compression into a format like MPEG2. HD 1080×1920 resolution at 29.97 frames per second (fps) with the same 24-bit pixel encoding is about 180 MB/sec. Without encoders and decoders, video transport for streaming or even download would clearly not be practical, even with today's broadband networks.
Encoding and decoding is obviously necessary for video content delivery and consumption with players on both desktop and mobile computing devices. I first review a useful open source code base and tool called ffmpeg, which can encode a wide variety of original content into many compressed formats, including MPEG, from uncompressed Portable Pixmap (PPM) frame sequence formats as well as transcode from one encoding to another. Working through a few simple exercises to encode and transcode some open source content, you'll see that this is a very CPU-intensive task. To better understand where that CPU time is being spent and why encoding and transcoding is so CPU intensive, you'll use the Intel® VTune™ Performance
Analyzer to profile the ffmpeg code that you build for Windows*. Finally, I take a quick look at the use of Intel® Integrated Performance Primitives (Intel® IPP) and how you might optimize real-time media encoding tools like
ffmpeg.
MPEG Encoding and Encoding Tools
Historically, digital video content came from studios; but today, most anyone can dabble in content creation and production using HD and SD cameras, a home computer, and some basic post-production tools to edit digital video and audio streams, and then encode them into deliverable playback formats or streaming formats. For this paper, please use your own content or download some open source content from one of these open content sources:
Hopefully, you'll find this project not only informative but also entertaining.
A Brief Overview of Encoding Program and Transport StreamsPSs are container formats like Audio Video Interleave (AVI) and MPEG PSs. They encapsulate a number of elementary streams (ESs) composed of only video or audio into one compressed format that can include multiple ESs. Often, a video and an audio ES are multiplexed together to create what we all think of as a
movie. These container formats or PSs work well for file download and playback with a media player off of local storage devices like disk, DVD, or flash. They typically include time stamps to guide the player on how fast to decode (Decode Time Stamps, or DTSs) and how fast to present (Presentation Time Stamps, or PTSs) to an output video monitor.
TSs are used to encapsulate multiple PSs into a format designed for network delivery-188-byte packets for MPEG. A TS includes the ability to provide multiple video and audio programs, Program Specific Information (PSI) or guide data, and additional timing used to keep decoders synchronized with real-time streaming services (the Program Clock Reference, or PCR). So, TSs often included multiple programs that are "packetized" ESs (PES).
Either way, the audio and video must first be encoded into ESs that can be packetized for further encapsulation into a PS or TS for delivery or playback. The encoding involves steps to compress video that include:
- Pixel compression, often for 8 bits for each color band RGB for 24-bit pixels down to 16-bits or less using color space sub-sampling so that luminance is provided more frequently than chrominance data-for example, YUV 4:2:2, where the Y (luminance) is sampled in every pixel and the UV (color RGB difference signals) are sub-sampled so that in each 4-pixel subset of an image, two UV color difference samples (Cr and Cb) are included every other pixel, also known as YCrCb format. Computationally, this is a fairly simple in-memory reformat of data.
- Frame, 8×8 pixel macroblock encoding, where a Discrete Cosine Transform (DCT) is first applied to each block, followed by Quantization, and finally Huffman Encoding using the Zig-Zag pattern. The details behind this process go beyond the scope of this paper, but the key is to understand the complexity and basic sequence of mathematical and algorithmic operators applied to each video frame.
- Sequence, where temporal redundancy is removed in a Group of Pictures (GOP) so that an I-frame (compressed) is at the beginning of each GOP, but change-only data is encoded into P frames, with B frames between I and P that include only changes between surrounding I and P frames. Here again, this is a somewhat complex reformatting of data in a larger sequence in memory.
So, encoding is fairly memory and CPU intensive and must access large files. For single-stream encoding from SD, uncompressed video data, or from another compressed HD format, the bottleneck is most often the CPU. For encoding from uncompressed HD formats like 1080×1920, the bottleneck may also be the storage subsystem. Here, I focus on single-stream encoding and the CPU bottleneck. You would expect that a large portion of the CPU time would be spent in functions related to encoding and decoding, which you'll see in the Intel® VTune™ analyzer profile.
Open Source Encoding Tools for Linux* or Windows*A huge variety of open source and for-purchase post-production encoding tools exist for both Linux* and Windows*. This paper focuses on encoding and transcoding (going from one encoded format like MPEG2 to another, like MPEG4). You'll use the Big Buck Bunny* 720×1280 24-fps open source content, which you can download to understand digital media encoding, how it uses CPU resources, and how you might optimize it.
You can download the ffmpeg tool used for transcoding in this paper from
http://ffmpeg.arrozcru.org/builds/. I built Subversion revision 16537 of ffmpeg for this article using the MinGW tools and MSYS bash shell on my Windows laptop. The build went off without problems using the recommended build procedure in my MSYS shell:
./configure --enable-debug --enable-memalign-hack --extra-cflags="-fno-common"
make
make install
The
--enable-debug element is important for Intel® VTune™ Performance Analyzer so that you can drill down to the code hotspots, with symbols to identify functions where the most time is spent.
A Basic Encoding Workflow
An
encoding workflow simply defines an input source file, the encoding operations to apply to it, and an output result file. The
ffmpeg tool has many options for both audio and video encoding from either uncompressed original sources or already-encoded sources to many different encodings and file formats. Issuing
ffmpeg -formats on the command line lists all formats available. For the basic resource usage and CPU profiling example in this paper, I transcoded the
Big Buck Bunny* AVI format file, first removing all audio with the command:
ffmpeg -i big_buck_bunny_720p_surround.avi -an -vcodec copy big_buck_no_audio.mp4
Then, to study resource usage during transcoding, I used the following command to transcode a specific number of seconds of the MPEG4 video:
ffmpeg -t 10 -i big_buck_no_audio.mp4 -vcodec mpeg2video big_buck_no_audio_10s.mp2
The
-t 10 argument specifies 10 seconds transcoding the MPEG4 input into an MPEG2 output. The output for the 10-second transcode run looks like this:
Sam Siewert@SAM-LAPTOP /c/content/open-source
$ ffmpeg -t 10 -i big_buck_no_audio.mp4 -vcodec mpeg2video big_buck_no_audio_10s.mp2
FFmpeg version UNKNOWN, Copyright (c) 2000-2009 Fabrice Bellard, et al.
configuration: --enable-memalign-hack --extra-cflags=-fno-common
libavutil 49.12. 0 / 49.12. 0
libavcodec 52.10. 0 / 52.10. 0
libavformat 52.23. 1 / 52.23. 1
libavdevice 52. 1. 0 / 52. 1. 0
built on May 4 2009 02:12:23, gcc: 3.4.5 (mingw-vista special r3)
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'big_buck_no_audio.mp4':
Duration: 00:09:56.45, start: 0.000000, bitrate: 4003 kb/s
Stream #0.0(und): Video: mpeg4, yuv420p, 1280x720 [PAR 1:1 DAR 16:9], 24.00 tb(r)
Output #0, mp2, to 'big_buck_no_audio_10s.mp2':
Stream #0.0(und): Video: mpeg2video, yuv420p, 1280x720 [PAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 24.00 tb(c)
Stream mapping:
Stream #0.0 -> #0.0
Press [q] to stop encoding
frame= 240 fps= 40 q=31.0 Lsize= 1442kB time=9.96 bitrate=1185.9kbits/s
video:1442kB audio:0kB global headers:0kB muxing overhead 0.000000%
Previewing the resulting encoded output files is always an important step, because ultimately, this content will be delivered for playback, and eyeballs are still the best judges of video quality. An excellent open source option for preview and playback is Video LAN Client* (VLC), which has the capability to play back PS or AVI files as well as stream or play digital media TSs or PSs over User Datagram Protocol (UDP) or Real-Time Transport Protocol (RTP). You can download the VLC server/client for Windows or Linux as binaries or source to build from
http://www.videolan.org/vlc/.
Figure 1 shows an example VLC preview window with a scene from the
Big Buck Bunny content that I transcoded to MPEG2 from MPEG4 and removed audio from with the following commands:
ffmpeg -i big_buck_bunny_720p_surround.avi -an -vcodec copy big_buck_no_audio.mp4
ffmpeg -i big_buck_no_audio.mp4 -vcodec mpeg2 big_buck_no_audio.mp2
Figure 1. Big Buck Bunny preview with VLC
Using Intel® VTune™ Performance Analyzer to Profile Encoding
You can download Intel® VTune™ Performance Analyzer for trial from Intel at
http://software.intel.com/en-us/articles/intel-software-evaluation-center/. The trial is good for 30 days and will allow you to find hotspots in your code or in open source code like ffmpeg so that you can understand which functions and even which specific lines of code take the most time during execution. Intel® VTune™ Performance Analyzer works by using the x86 Performance Monitoring Unit (PMU), a hardware performance profiling unit built into x86 cores. You can program the PMU to raise an interrupt on a periodic basis to sample the Instruction Pointer address and key counters such as cycle count, instruction count, and number of cache misses since the last interrupt. Intel® VTune™ Performance Analyzer is ideally suited to software such as media encoders, because the work includes repetition of common functions used to encode or decode each frame and frames in each GOP and employs common functions and primitives repeatedly. Because of repetition in the workload, Intel® VTune™ analyzer periodic sampling profiles the code accurately as long as a test of significant duration is run. The basic concept is that if you randomly sampled what you do during the week enough times, you would correctly determine the amount of time you commute to work, sit at your desk, get up to get coffee, commute home, and so on, which might help point out where you're inefficient in your work day and find hotspots where you waste time so that perhaps you could work fewer hours, get more done, and head home earlier, even avoiding commute traffic. This is the basic idea of Intel® VTune™ Performance Tools analysis.
You run Intel® VTune™ tools on the ffmpeg_g.exe file, which has symbols so that you can drill down into the profile with function names and with path information to the full C source code you built with in the MSYS MinGW environment in Windows. Figure 2 shows the resulting analysis by Intel® VTune™ Performance Environment for basic counter monitors, including CPU time.
Figure 2. Counter monitor results
The data that is more useful than the basic counter monitors is the drill-down by program, thread, module, and finally source code function in terms of where most of the time is spent. Figure 3 shows the top functions where most of the time is spent, which matches the expectation that most of it would be in the encoding and decoding functions and related primitives.
Figure 3. Hotspots by function where most time is spent
The function level information might be helpful enough, but if you want to, you can drill down into the source to see which lines of C code you spend the most time on, as shown in Figure 4 for the
diff_pixels_mmx function, which is one of the most often called functions.
Figure 4. Hotspots analysis of a function
The tuning assistant suggests that your CPI (Clocks Per Instruction) might be high: You spend a larger number of clock cycles than just one per instruction on average, so that may indicate that you have hotspots where you are missing cache and stalling the processor pipeline. To determine this, you require an event-based run rather than a periodic time-based profile. While the Intel® VTune™ analyzer GUI is fairly intuitive, those less familiar with profiling methods and usage of tools like Intel® VTune™ Performance Analyzer will find the built-in Help, the examples and tutorials, and the User Manual to be quite helpful. However, the Intel white paper "
Using Intel® VTune™ Performance Analyzer Events/ Ratios & Optimizing Applications" is an excellent reference that provides overview of profiling theory along with concrete examples. Here you apply the same techniques with a focus on using them to understand the specific requirements and characteristics of digital media encoding software. Once you have the hang of using Intel® VTune™ tools, give an event-based profile run a try, and see if the same
diff_pixels_mmx function comes up as a hotspot for cache misses.
So, running again, you configured to track CPI, L1 data cache miss rate, and L2 data cache miss rate. Intel® VTune™ Performance Analyzer employs event-based counters built into the x86 architecture that are configured to generate interrupts when event counts are incremented so that the tool can sample the counts and associate events like cache misses to specific lines of code and even specific instructions. While this sounds complicated, Intel® VTune™ Performance Analyzer makes this a simple process of selection through the GUI and automatically programs the built-in PMU (Performance Monitoring Unit) registers to generate the interrupts and sampling points to profile the code being run. The cache miss rate is, in fact, found to be high with event-based sampling, as shown in Figure 5 by the L2 unified cache miss count of 87 and the L1 data miss count of 160, so it may make sense to add a prefetch for pixel data to make this code more efficient. Modifying this open source code to prefetch pixel data can be done using compiler directives. Because this case uses GCC (GNU C Compiler), refer to this compiler's directives as documented by GNU in the "
Data Prefetch Support" notes found on gcc.gnu.org. When pre-fetching data, carefully consider how this might affect other data in cache since cache lines must be evicted in order to make room for the pre-fetched data, and, in fact, this could cause a new cache miss hot-spot elsewhere in the code. Good background on this is provided in the GCC Data Prefetch Support notes. This would be an interesting portion of the ffmpeg code to alter with prefetch instructions or compiler directives and to re-profile to examine whether this provided an overall optimization or simply eliminated this hot-spot, but created new ones. The best way to learn how to use Intel® VTune™ tools is to start experimenting with some code changes on your own to see how they influence execution efficiency.
Figure 5. Confirmed cache-miss hotspot
Optimizing Digital Video Encoder Software
Video encoding software can be optimized by making sure that hotspots are executed efficiently. So, as you saw above, in a frequently executed function, you may want to modify code to try and ensure that it doesn't have cache misses and pipeline stalls by either prefetching data into cache or perhaps even locking state information into cache.
In general, you would also want to make sure that you're taking full advantage of vectorized instructions known as
Intel® Streaming SIMD Extensions (Intel® SSE), which are instructions that vectorize instructions so that a single instruction can process multiple data words known as
Single Instruction, Multiple Data (SIMD). The Intel® SSE instructions available with x86 processors were designed for streaming operations that would typically be found in code like ffmpeg. Much of ffmpeg does appear to use Intel® SSE and SIMD Multi-Media Extensions (MMX) to the base instruction set. You can do this can by writing assembly code and either linking it in or using inline assembly pragmas, as you saw in the
diff_pixels_mmx function.
Alternatively, you could use the Intel® Integrated Performance Primitives (Intel® IPP) - a threaded library of functions designed for multimedia applications that is well documented in "
Intel® IPP Getting Started Guide" and a great resource are the code samples found in "
Intel® IPP Sample Code". The Intel® IPP and use of compiler directives will help immensely with code like ffmpeg which can benefit from Intel® SSE and optimized primitives that can be used for code that is data processing intensive.
Summary
Video encoding is critical for content delivery even with gigabit networks, next-generation wireless, and broadband to the home. The need for encoding on the fly in real time is growing based on more on-demand media services, and the processor resources required is increasing based on high-bit rate, high-resolution content being downloaded and streamed more frequently to a broad range of desktop, mobile, and entertainment devices. Today's digital media production, delivery, and consumption require significant computing resources, and developers need to know how to optimize encoding/decoding software. Using Intel® VTune™ Performance Tools, without extensive training or profiling experience, we were able to quickly drill down into a hot-spot in the open source ffmpeg software to understand where most of computational complexity resides in primitive functions used for encoding. Without tools like Intel® VTune™ Performance Analyzer, it's often difficult for programmers to have a good feel for the relative cost of specific data transformation operations involved in encoding and where most of the time is being spent. You must first know where the problems are before attempting to solve them, and you need to have a tool that allows follow-up measurement after code modification to verify that code restructuring was helpful. These tools and methods are truly invaluable to those involved in digital media.
About the Author
Dr. Sam Siewert is the chief technology officer (CTO) of Atrato, Inc. and has worked as a systems and software architect in the aerospace, telecommunications, digital cable, and storage industries. He also teaches as an Adjunct Professor at the University of Colorado at Boulder in the Embedded Systems Certification Program, which he co-founded in 2000. His research interests include high-performance computing and storage systems, digital media, and embedded real-time systems.