Digital media applications are unique in that they generally can consume all the processing power they can get. Unlike other tasks that execute in a few seconds, the rendering of stills, audio and video can take several minutes or even hours. Digital media applications translate increases in performance to increases in end-user productivity, and therefore should be written to take advantage of the latest platform technologies.
The Pentium® 4 Processor with Hyper-Threading Technology (HT technology) delivers performance that dramatically reduces the overall processing time and improves the responsiveness of the system. Processors equipped with Hyper-Threading Technology have multiple logical CPUs per physical package. The state information necessary to support each logical processor is replicated while sharing and/or partitioning the underlying physical processor resources. Multiple threads running in parallel can achieve higher processor utilization and increased throughput.
In Section 1, Video Production is used as an example of application workflow to show how Hyper-Threading Technology benefits digital media production. Each of the four major steps of video production are examined in detail.
In Section 2, the multi-tasking characteristics of a system with HT technology are considered. When multiple applications are running on a system, HT technology helps reduce stalls and task switching delays caused by the interaction of two or more independent programs.
Section 3 discusses some software and system level design considerations for optimizing multi-threaded applications in a multitasking environment. In this section we'll look at how application developers can use the Intel compilers to produce optimized code and then use Intel® VTune™ Performance Analyzer to identify and eliminate hotspots interfering preventing peak performance.
Section 1: Video Production Case Study
Video production is a complex multi-step process that often involves using multiple programs to achieve the desired output. There are four major steps to this process:
- Acquire: Capture movies and pictures, capture audio
- Build/Edit: Edit, mix, preview, store your project
- Render: Apply compression and format the file
- Output: Store the end result on hard drive, or burn to disk
Digital Video Cameras connect to the PC using Firewire, USB, or through an analog connection. They transmit at a fixed rate of 25 or 30 frames per second (depending on format: PAL or NTSC), so the capture step can never go faster than the actual play time of the video. Five minutes of video takes five minutes to capture.
The data rates are high (about 4 Mbps) and the PC has to keep up with the source, or else dropped frames will result. Dropped frames degrade the quality of the video. Consequently, most software packages warn not to do anything else on your system while capture is under way.
On systems without enabled HT technology that's still good advice. With HT technology enabled, the multi-tasking capability of the system is much enhanced: A background task is less likely to be pre-empted by other programs: Multi-task ing allows continued PC use for other activities. Video capture is not a very CPU-intensive activity - it typically consumes less than 15% capacity of a 3.06 GHz Pentium® 4 processor (figure 1)*. Why not allow the end-user to use that time for something else?
Some capture applications simultaneously encode the incoming Digital Video stream into Windows Media* or MPEG formats. The advantages are that smaller files are created, and the media is in the desired output format early in the process.
Figure 2 shows DV capture from IEEE 1394, with encoding to MPEG2 for the output. Capture time is still the rate-limiting step, but the CPU is kept very busy with the encoding task. The completion times of both were approximately equal, but with HT technology there is more CPU capability available for other tasks to use. This translates into faster UI responsiveness, even under heavy multi-tasking loads.
It takes a lot of multi-tasking activity during video capture to cause frame drops on systems enabled for HT technology. The only caveat: while capturing video watch out for disk conflicts. The I/O rates for DV capture create a continuous demand on the hard disk of about 2-3%. The data rates are not that high, but streaming must be maintained. If these disk updates cannot happen in real-time, frame dropping may result. Application developers can avoid some of these problems by locking I/O resources during critical real-time operations, but should do so with the full understanding that other applications may stall as a result.
During the editing phase of production, audio, video, and stills are mixed together from various sources. During video preview, decoders for MP3, MPEG2, AVI, and other formats will typically be running. Individually, these decoders do not demand high CPU utilization. Playback is very smooth - until you throw in additional audio tracks, transitions and special effects where multiple codecs and filters must run simultaneously. The more complex transitions can be very CPU intensive and usually involve decoding two or more media streams at the same time. In figure 3, the peaks seen every 10 seconds are video transitions.
Rendering involves taking an edit decision list and creating video file on the hard disk. Rendering is very CPU intensive - it can use all the performance capability you can throw at it and scales well with faster processors. Audio and Video encoders run simultaneously during rendering so this step is well suited to threading and parallelism.
With the speed of a 3 GHz Pentium® 4 processor and HT technology, it is now possible to encode full resolution NTSC video faster than real-time! In Figure 4 the source was a 180 second DV video. Without HT technology, the video was encoded in 136 seconds. With HT technology enabled, the time needed to encode MPEG2 decreased to 111 seconds. For a one-hour video project, the encode time was about 37 minutes.
One surprising difference between systems with and without HT technology enabled is the responsiveness of the User Interface. New tasks on enabled systems launch right away and the cursor is rarely in an hourglass. The encoding task is spread across both processors, and there is plenty of headroom for other applications to run. As shown in Figure 4, without Hyper Threading technology, the video encode task consumes 70% of CPU clock cycles - leaving limited resources for other programs.
4. Output to media
With video encoding complete, create a disk image and writing it to CD or DVD so that it can be distributed and played back is the next step in the process. This phase of the process actually has two major phases with many sub-steps that utilize different parts of the system.
In the first phase the video and audio files are converted to the proper format. Depending on the playback target, the format may be MPEG2 (for consumer DVD players), MPEG4 (for posting on the web), or VCD (a lower resolution format for writable CDs). In the following example, the output media will be assumed to be high quality DVD-compatible 720x480 30 frames per second. Figure 5 shows the two phases of the Output cycle for writing a DVD.
Phase 1 involves transcoding or re-encoding the audio and video streams into a compatible format. This is a CPU-intensive process, and hard disk activity is also high as files get read in, modified, and then written back out to disk. Without HT technology, CPU consumption is 100% and takes about three (3) times longer for the encoding phase.
Phase 2 consists of file operations to prepare the image for burning and then writing it to the media. It is not CPU-intensive since the rate-limiting step is the CD or DVD burner. Multitasking during optical disk writing can be risky. On older systems there is a event known as a "Buffer Under-run" that can occur if the CPU is not able to produce data fast enough to keep the disk writer sufficiently stocked.
This problem has been largely overcome as the new drives have larger input buffers (typically 2 Mbit) and there are now protection mechanisms such as "Burn Proof" technology that ensure the disk will get written properly. Most DVD writers do have buffer under-run protection, but the drives are slower (the fastest is 2.4x) and DVD writing can take up to an hour. Your system is still available, but you should avoid operations involving heavy disk activity.
* A Note about the System Activity Graphs:
The performance graphs in this whitepaper were generated from traces taken with Perfmon - a standard utility program with the Windows® XP and Windows® 2000 operating systems. Counters were set up to monitor both virtual CPU s, disk, and network activity. The results were exported as a CSV file and graphed from a spreadsheet.
Section 2: Multitasking Software on an HT technology enabled system
Hyper-Threading Technology is enabled by the multi-processing support of the Windows® XP and Windows® 2000 operating systems. Figure 6 shows that HT technology improved the execution times of some common digital media activities - even while the system was under heavy load during a video encoding session.
That said, HT technology does not guarantee your application will run faster. To benefit from HT technology, programs need to have executable sections that can run in parallel. Threading improves the granularity of an application so that operations can be broken up into smaller units whose execution is scheduled and controlled by the operating system. Now two threads can run independently of each other without requiring task switches to get at the resources of the processor.
Figure 7 shows a comparison of code executing on a CPU with HT On versus HT Off. When HT is Off, a process can stall while waiting for I/O to complete or another task to provide information it needs. The CPU is blocked from further execution. When HT is On, other threads continue running and the system does not hang or stall.
Figure 7 - Separate data paths enable the CPU to continue working on other threads, even when one becomes blocked
How the Windows* Operating System handles Multi-threading and Multitasking
Multitasking occurs at the user interface level every time a user runs multiple programs at once. Some applications also perform multitasking internally by creating multiple processes. Each process is given a time-slice during which time it executes. Creation of a process involves the creation of an address space, the applications image in memory, which includes a code section, a data section and a stack. Parallel programming using processes requires the creation of two or more processes and an inter-process communication mechanism to coordinate the parallel work.
Threads are tasks that run independently of one another within the context of a process. A thread shares code and data with the parent process but has its own unique stack and architectural state that includes an instruction pointer. Threads require fewer system resources than processes. Intra-process communication is significantly cheaper in CPU cycles than inter-process communication.
The life cycle of a thread begins when the application assigns a thread pool and creates a thread from the pool. When invoked, the thread is scheduled by the Windows XP* operating system according to a round-robin mechanism. The next available thread with the highest priority gets to run.
When the thread is scheduled, the Operating System checks to see which virtual processors are available, then allocates resources needed to execute the thread. Each time a thread is dispatched, resources are replicated, divided, or shared to execute the additional threads. When a thread finishes, the operating system idles the unused processor, freeing the resources associated with that thread.
Section 3: HT Technology Software Design Considerations
In a processor with HT technology, software developers should be aware that architectural state is the only resource that is replicated. All other resources are either shared or partitioned between logical processors. This introduces the issue of resource contention, which can degrade performance, or in the extreme case, cause an application to fail. Synchronization between threads is another area where problems can arise. The following section contains a brief discussion of some of the most common issues in multi-threaded software design.
Synchronization is used in threaded programs to prevent race conditions (for example, multiple threads simultaneously updating the same global variable). A spin-wait loop is a common technique used to wait for the availability of a variable or I/O resource.
Consider the case of a master thread that needs 'to know' when a disk write has completed. The master thread and the disk write thread share a synchronization variable in memory. When this variable gets written, it can cause an out-of-order memory violation that forces a performance penalty. Inserting a PAUSE instruction in the master thread read loop can greatly reduce memory order violations.
Spin-wait loops consume execution resources while they are cycling. One solution: If other tasks are waiting to run, the thread performing the spin lock can insert a call to Sleep(0) which releases the CPU. If no tasks are waiting, this thread immediately continues execution.
Another alternative to long spin-wait loops is to replace the loop with a thread-blocking API, such as WaitForMultipleObjects. Using this system call ensures that the thread will not consume resources until all of the listed objects are signaled as ready and have been acquired by the thread.
Avoid 64K aliasing in L1 Cache
The first level data cache (L1) is a shared resource on HT technology processors. Cache lines are mapped on 64KB boundaries, so if two virtual memory addresses are modulo 64KB apart, they will conflict for the same L1 cache line. Under Microsoft Windows* operating systems, threads are created on megabyte boundaries, and 64K aliasing can occur when these threads access local variables on their stacks. A simple solution is to offset the starting stack address by a variable amount using the _alloc function.
False Sharing in the Data Cache
A cache line in the Pentium® 4 processor consists of 64 bytes for write operations, and 128 bytes for reads. False sharing occurs when two threads access different data elements in the same cache line. When one of those threads performs a write operation, the cache line is invalidated, causing the second thread to have to fetch the cache line (128 bytes) again from memory. If this occurs frequently, false sharing can seriously degrade the performance of an application.
False sharing can be diagnosed using the Intel VTune™ Performance Anal yzer to monitor the 'machine clear caused by other thread' counter. Some techniques to avoid False Sharing include partitioning data structures, creating a local copy of the data structure for each thread, or padding data structures so they are twice the size of a read cache line.
Write Combining Buffers
The Intel® NetBurst™ architecture has 6 Write Combine store (WC) buffers, each buffering one cache line. The Write Combine buffers allow code execution to proceed by combining multiple write operations before they are written back to memory through the L1 or L2 caches. If an application is writing to more than 4 cache lines at about the same time, the WC store buffers will begin to be flushed to the second level cache. This is done to help insure that a WC store buffer is ready to combine data for writes to a new cache line.
To take advantage of the Write Combining buffers, an application should write to no more than 4 distinct addresses or arrays inside an inner loop. On HT technology enabled processors, the WC store buffers are a shared resource; therefore, the total number of simultaneous writes by both threads running on the two logical processors must be considered. If data is being written inside of a loop, it is best to split inner loop code into multiple inner loops, each of which writes no more than two regions of memory.
Cache Blocking Techniques
Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. This is accomplished by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, Image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.
The effectiveness of the cache blocking technique is highly dependent on data block size, processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using Intel's CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache for systems that are not HT technology enabled and one-quarter to one-half the physical cache size for systems that are.
Adjusting Task Priorities for Background Tasks
In some applications there are background activities that run continuously, but have little impact on the responsiveness of the system. In these cases, consider adjusting the task or thread priority downward so that this code only runs when resources become available from higher priority tasks.
Conversely, if an application requires real-time response, it can increase task priority so that it runs ahead of other normal priority tasks. This technique should be used with caution, since it can degrade the responsiveness of the user interface, and may affect the performan ce of other applications running on the system.
On a multi-processor or HT technology enabled system, load balancing is normally handled by the operating system, which allocates workload to the next available resource. In some cases a virtual CPU will becomes idle, while the other is overloaded. The developer can address a load imbalance such as this by setting Processor Affinity.
Processor affinity allows a thread to specify exactly which processor (or processors) the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of its active threads, it can ensure that load imbalance will not occur among its threads and eliminate thread migration from one virtual processor to another.
Simultaneous Fixed and Floating Point Operations
With HT technology, there are several ALU's (for integer logic), but only one shared floating-point unit. If your application uses floating-point calculations, it may be beneficial to isolate those threads and set the processor affinity of the threads to minimize the processor resource contention.
Avoiding Dependence on Timing Loops
Relying on the execution timing between threads as a synchronization technique is not reliable because of speed differences between host systems. Delay loops are sometimes used during initialization as well, and should be avoided for the same reasons.
Software Design Considerations for Multitasking
Most of the rules above also apply for Multitasking, plus there are a few additional considerations. Task switches are much slower then thread context switches because each task operates in its own address space. The state of the previously running task must be saved and data residing in the cache will be invalidated and reloaded.
Hyper-Threading Technology enhances multitasking because the state information for each task is stored on a separate virtual processor. Cache invalidation will still occur, but the need for a task switch is eliminated since both tasks can run at once. Since cache is a shared resource on HT technology enabled processors, all of the above rules regarding data alignment and blocking still apply.
Contention for resources can be a problem when multitasking. It can occur in memory, on the system busses, or on I/O devices. Consider the case of video capture while creating an MP3 file. Both applications use the hard disk intensively, but video capture has to occur in real-time. The result of contention is that the video drops frames, and the MP3 file skips.
Applications should check the status of an I/O device before attempting to pass data to it. If necessary, peripherals can be locked to avoid access by other applications. This makes sense for a CD or DVD writer, which is essentially a single use device. Locking the hard drive is not recommended, since it is a critical OS resource.
Task and thread priority can have a dramatic effect in a multitasking environment. If priority is raised in a task that runs continuously, other tasks will starve until the high priority task releases the processor. Lowering priority on such a task may be the best choice all around. Consider the case of a video encoder which normally takes 100 % of the processor. If you lower the priority, the user will be able to use the computer on demand, and the video encode will still run 100% of the time when the CPU is otherwise available.
Load balancing within applications can actually degrade multi-tasking performance. If one application assumes it has full control of both processors, resource contention may occur when a second application attempts to load. This highlights a fundamental issue with multitasking programs: you never know what other software will be running concurrently with your program. It is usually best not to lock up resources that other programs will likely need.
Optimized Compilers and Libraries Help Avoid Multi-threading Problems
The best way to design, implement, and tune for HT technology enabled processors is to start with components or libraries that are thread-safe and designed for use with this technology. The operating system and threading libraries are likely already to be optimized for various processors. Use operating system and/or threading synchronization libraries instead of implementing application specific mechanism like spin-waits. Existing applications can take advantage of enhanced code modules by re-linking or through the use of dynamic link libraries.
Intel compilers enable threading by supporting both OpenMP and auto-parallelization. OpenMP is an industry standard for portable threaded application development, and is effective at threading loop-level parallel problems and function level parallelism. The C++ compiler supports OpenMP API version 1.0 and performs code transformation for shared memory parallel programming.
The Intel® C++ Compiler for Windows with auto-parallelization uses a high-level symmetric multi-processing (SMP) programming model to enable migration to multiprocessing machines (multiple physical CPUs). This option detects which loops are capable of being executed safely in parallel and automatically generates threaded code for these loops. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations.
Tools to identify performance bottlenecks
The Intel® VTune™ Performance Analyzer allows you to visualize how software utilizes CPU resources. By seeking out 'hotspots' in your code, you can focus on optimizing the sections of code that occupy most of the computation time. VTune enables you to view potential problem areas by memory location, functions, classes, or source files. You can double-click and open the source or assembly view for the hotspot and see more detailed information about the performance of each instruction. The Intel® Thread Checker which works in conjunction with the Intel VTune™ Performance Analyzer automates detection of most threading errors such as:
- Deadlocks (detection and prediction)
- Memory access issues
- Race conditions
- Thread stalls (potential deadlocks), waits
- Potential and realized rata races / dependencies improperly synchronized I/O
- Invalid threading-library calls, arguments, returns
- Threaded calls to non-reentrant routines
Summary and Conclusion
Digital video production is among the most demanding of applications you can run on your computer, yet the Pentium® 4 with Hyper-Threading Technology delivers smooth, real-time performance even while other programs are running. These benefits go beyond simple clock rate.
HT technology improves the availability of the CPU for multi-tasking and background processing. The User Interface is noticeably more responsive under heavy system loads. All of these factors contribute to a favorable, and more productive, end-user experience when using your programs with other digital media tools.
Threaded applications take this one step further by improving parallel execution within a task. Application developers can produce optimized code using the Intel Compiler, and then use the VTune Performance Analyzer to identify code hotspots and bottlenecks as described in this paper.
A substantial collection of whitepapers and application notes are available to provide Hyper Threading technical help for application developers. For more information search on Hyper-Threading at the Intel software developers website /en-us/parallel/.
Software used in this whitepaper
- Pinnacle Studio Version 8
- Ulead Movie Producer Version 7
- Sonic Solutions MyDVD
- Roxio Movie Creator 5
- Roxio EZ CD Creator 5.3
- Musicmatch 7.0