Throughout Q1/Q2 of 2013 Intel and ooVoo collaborated to enable multiple hardware accelerated video conferencing use cases. These include standard person-to-person video conferencing at 720p, multi-party video conferencing (up to 12 participants), and media sharing. To enable this, ooVoo collaborated with Intel to make the most of the wide range of performance options available on systems running both Intel® Atom™ and 4th generation Intel® Core™ processors.
Materials presented within this paper target Microsoft Windows* 8.x operating systems. Key resources / technologies leveraged during the optimization process include:
- Intel® VTune™ Analyzer - http://software.intel.com/en-us/intel-vtune-amplifier-xe
- Intel® Media SDK - http://software.intel.com/en-us/vcsource/tools/media-sdk
- Intel® Performance Primitives (Intel® IPP) - http://software.intel.com/en-us/intel-ipp
- Microsoft MSDN Reference - http://msdn.microsoft.com/en-us/default.aspx
To provide a complete picture of the optimization process, this paper discusses the overall analysis and testing approach, specific optimizations made, and illustrates before and after quality improvements.
Overall, focus was placed on ensuring video quality first and foremost. No significant emphasis was placed on power-efficiency optimization at this time. The quality improvements enabled by leveraging Intel hardware offloading capabilities are impressive and can be enjoyed in either person-to-person conferencing or in multi-party conferences for primary speakers.
Eliminating sporadic corruption occurring during network spikes and on bandwidth-limited connections was by far the biggest challenge faced during the optimization process. Initially, all HD calls experienced persistent constant corruption.
Analysis & Testing Approach
Before pursuing optimization efforts, a test plan and analysis approach was defined to ensure repeatability of results. Several key learnings were observed during our initial testing cycles. These included:
- Establish performance baseline – Before beginning to optimize our application, we made sure we had eliminated network inconsistencies, unnecessary traffic / noise, adequate lighting, and adequate cameras. With the major variables under control we were able to confirm that our testing approach and results were repeatable. From this baseline, we could begin the optimization process.
- Establish minimum network bandwidth (BW) requirements – Poor quality networks lead to poor quality video conferencing. A minimum of 1-1.5 MBits available for uplink / downlink was required to realize a full HD conference.
- Camera quality is important – Poor quality cameras lead to poor quality video conferencing. As camera quality degrades, the amount of noise, blocking, and other artifacts increases. This also tends to increase the overall BW required to drive the call as well.
- Define, test, and validate the network – To be certain you are not reporting issues created by BW issues present on your network, testing is required to determine the amount of packet loss, jitter, etc. present on the network before testing can begin.
The following data was collected during each testing cycle to enable triage / investigation of corruption issues affecting video quality. Initial data collected at the beginning of the optimization process showed that even reliable Ethernet connections experience corruption problems despite very low packet loss and jitter on a ~17 MBit connection.
|Test||Intvl||Tx Size||BW||Jitter||Lost/Total |
|Out of |
|enc fps||dec fps||Quality|
|0.00%||1||15||15||Some corruption |
and coarse image
The three major optimization phases are: phase 1 focused on quality of service (QoS), phase 2 examined the user interface behavior, and phase 3 took a hard look at the rendering pipeline itself.
- Phase 1 - QoS approach optimization – Resulting in a switch from 1% to 5% acceptable packet loss.
- Phase 2 – User interface optimization – Resulting in I/O pattern set to output to system memory versus video memory due to the use of CPU-centric rendering APIs and reduction in GDI workload.
- Phase 3 - Rendering pipeline optimization – Resulting in the elimination of per pixel copies, use of Intel IPP for memory copies where possible, and update to Intel IPP v7.1.
Phase 1 – QoS Approach Optimization
Quality of service algorithms seek to ensure that a consistent level of service is provided to end users. During our initial evaluation of the software, it was unclear if the QoS solution within ooVoo was too aggressive or if our network environment was too unstable. Initial measurements of network integrity indicated that it was unlikely that the network conditions were causing our issues. Jitter was measured at around 1.0-1.8 ms, packet loss was at or near 0%, and very few (if any packets) were being received out of order. All in all this indicated a potential issue on the QoS side of the application.
To find the root cause of the issue, it was necessary to perform a low-level analysis on the actual bitstream data being sent / received by the ooVoo application. Our configuration for this process was as follows:
Direct analysis of the bitstream being encoded on the transmit side and the bitstream reconstructed on the receiver side indicated that frames were clearly being lost somewhere. Further analysis of the encode bitstream showed that all frames could be accounted for on the transmit side of the call; however, the receive side of the call was not seeing the entire bitstream encoded on the transmit side.
As can be seen from the diagram above, the receiver (decoder) stream is missing frames throughout the entire call. After working closely with the ooVoo development team, it was found that relaxing the QoS to accommodate up to 5% packet loss improved things significantly during point-to-point 720p video calls.
Phase 2 – User Interface Optimizations
Today, graphics and media developers have a wide variety of APIs to select from to meet engineering needs. Some APIs offer richer feature sets targeting newer hardware while others offer backwards compatibility. Since backwards compatibility was a key requirement for the ooVoo application, legacy APIs developed by Microsoft Corporation such as GDI and DirectShow* are necessary.
The following simplified pipeline illustrates the key area (“Draw UI” in green) where optimizations took place during this phase.
Before diving in to the details, a quick word regarding video and system memory is in order. In simple terms, video memory is typically accessible to the GPU while system memory is typically accessible to the CPU. Memory can be copied between video / system memory; however, this comes at a significant performance cost. When working with graphics APIs, it is important to know whether the API you are using is CPU-centric. If it is, then it is critical to set the MSDK I/O PATTERN to output to system memory. Failure to do this when using a CPU-centric rendering API may lead to very poor performance. In cases where APIs such as GDI are used to operate on the surface data provided by the MSDK, operations that require surface locking will (in particular) be the most costly.
In the case of the ooVoo client application, it was observed that fullscreen rendering required significantly more processing power than when running in windowed mode. This puts us squarely in the case of needing to account for a CPU-centric API in our rendering pipeline.
A detailed look at the overall workload when in fullscreen mode illustrates the following GDI activity (see yellow). Measurements below were made on an Ivy Bridge platform with a total of 4 cores yielding 400% total processing power.
Continuing the investigation, it became clear that there was a significant difference in how the ooVoo application handled window versus fullscreen display modes. Note GDI workload virtually disappears in window mode.
The Intel Vtune analyzer was used to identify the area of code where the GDI workload was being introduced. After discussing the issue with the ooVoo team, it became clear that this was unexpected behavior, and the ooVoo team found that the application was using GDI too frequently during fullscreen rendering. The solution was to limit the number of GDI calls made during each frame when in fullscreen mode. Despite the simple nature of this change, significant improvements were observed across the board:
GDI workload reduction impact and observations:
- Limited rendering via legacy APIs such as GDI+ is possible for video conferencing applications if resources are already available in system memory and very limited calls are made to GDI+ during each frame.
- Reduction of GDI+ call frequency within ooVoo application virtually eliminated all GDI overhead.
- Broad overall application CPU utilization for ooVoo went down by ~4-5%. Within the app the percent of time spent on GDI+ work is down from 10% to 0.2.
- Overall workload associated with the ooVoo application is more organized and predictable with less CPU spikes due to less surface locking by GDI.
- System wide reduction in CPU of ~20%.
The following diagram illustrates the ooVoo application workload and related GDI effort after our optimizations:
Final measurements post optimization follow:
|Metric||Previous Build||Latest Build||Delta / Improvement|
|System CPU Peaks||~200%||~175%||Reduced ~25%|
|ooVoo.exe CPU Peaks||~40%||~37%||Reduced 3%|
|GDI+ Workload||10% of ooVoo.exe||0.12% of ooVoo.exe||Reduced 10%|
|Total System CPU||159% of 400%||137% of 400%||Reduced ~22%|
|ooVoo Total CPU||114% of 400%||110% of 400%||Reduced ~4-5%|
Phase 3 – Rendering Pipeline Optimizations
Our final step in the optimization process was to take a hard look at the backend rendering pipeline for any un-optimized copies or pixel format conversions that might be affecting performance. Three key things to watch out for include:
- Per pixel copies – A copy operation executed serially for each pixel. For this type of operation it is always best to leverage Intel IPP. Our Intel IPP package comes with copy operations optimized for Intel HW.
- Copies across video / system memory boundaries – Instead of copying MSDK frame data from video to system memory yourself, it is more effective to allow the MSDK to steam to system memory for you.
- Fourcc conversions – Fourcc color conversions are always expensive. If possible, try to get your data in the format you need and stay there. If converting between YUV / RGB colorspace, you can use either Intel IPP or pixel shaders to expedite.
Early on in the process of profiling the ooVoo application, it was clear that memory copies were affecting performance; however, it was not clear what opportunities existed to address the issue. The ooVoo team performed a detailed code review and found cases where Intel IPP copy operations were not being used, places where per pixel copies were used, and ultimately upgraded to Intel IPP v7.1 to benefit from the latest updates.
The results were impressive, giving us our first look at video conferencing at 720 on both Intel Core and Intel Atom platforms. The following before/after shots illustrate the improvements.
Note the elimination of blocky corruption in the facial area:
Configuration: Point to point, 720p, 15 fps, 1-1.5 MBits/sec, IVB:IVB, 4G network
Note the level of detail enabled for primary speaker during multi-party conference.
Configuration: Muti-Party via ooVoo Server, 4 callers + YouTube, 15 fps, IVB
Intel, the Intel logo, Atom, Core, and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.