Digital media applications are unique in that they can generally consume all the performance they can get. Unlike other tasks that execute in a few seconds, the rendering of stills, audio and video can take several minutes or even hours. Applications in the digital media space can translate increases in performance to increases in end-user productivity, and it is therefore beneficial for them to take advantage of the latest platform technologies.
The Pentium® 4 Processor with Hyperthreading Technology delivers performance and architectural features that dramatically reduce the overall processing time and improve the responsiveness of the system. Processors equipped with Hyper-Threading Technology have multiple logical CPU’s per physical package. The state information necessary to support each logical processor is replicated while sharing and/or partitioning the underlying physical processor resources. Multiple threads running in parallel can achieve higher processor utilization and increased throughput.
In Section 1, Video Production is used as an example of application workflow to show how Hyperthreading Technology benefits digital media production. Each of the four major steps of video production is examined in detail.
In Section 2, the multi-tasking characteristics of a system with HT technology are considered. When multiple applications are running on a system, HT technology helps reduce stalls and task switching delays caused by the interaction of two or more independent programs.
Section 3 discusses some software and system level design considerations for optimizing multi-threaded applications in a multitasking environment. In this section we’ll look at how application developers can use the Intel compilers to develop optimized code and then use the VTune® Performance Analyzer to identify hotspots and optimize the code.
Section 1: Video Production Case Study
Video production is a complex multi-step process that often involves using multiple programs to achieve the desired output. There are four major steps to this process:
- Acquire: Capture movies and pictures, capture audio
- Build/Edit: Edit, mix, preview, store your project
- Render: Apply compression and format the file
- Output: Store the end result on hard drive, or burn to disk
Digital Video Cameras connect to the PC using IEEE 1394 (Firewire), USB, or through an analog connection. They transmit at a fixed rate of 25 or 30 frames per second (depending on format: PAL or NTSC), so the capture step can never go faster than the actual play time of the video. Five minutes of video takes five minutes to capture.
The data rates are high (about 4 Mbps) and the PC has to keep up with the source, or else dropped frames will result. Dropped frames degrade the quality of the video, so most software packages warn not to do anything else on your system while capture is under way.
On non-Hyper Threaded systems and slower systems that’s still good advice. But with HT technology, multi-tasking capability of the system is enhanced: A background task is less likely to get pre-empted by other programs. Multi-tasking allows the user to continue using their PC for other activities. Video capture is not a CPU intensive activity – it typically consumes less than 15% of a 3.06 GHz Pentium® 4 processor (figure 1). Why not allow the end user to use that time for something else?
Figure 1: DV Capture from IEEE 1394
Some capture applications simultaneously encode the incoming Digital Video stream into Windows Media or MPEG formats. The advantages are that smaller files are created, and the media is in the desired output format early in the process. The disadvantage is that some quality will be lost early in the production cycle.
Figure 2 shows DV capture from IEEE 1394, with encoding to MPEG2 for the output. Capture time is still the rate-limiting step, but the CPU is kept very busy with the encoding task. The completion times of both were approximately equal, but with HT technology there is more CPU capability available for other tasks to use. This translates into faster UI responsiveness, even under heavy multi-tasking loads.
Figure 2: DV Capture from IEEE 1394 with MPEG 2 Encoding
It takes a lot of multi-tasking activity during video capture to cause frame drops on Hyper Threaded systems. The only caveat is to watch out for disk conflicts. The I/O rates during DV capture create a continuous demand on the hard disk of about 2-3%. The data rates are not that high, but streaming must be maintained. If these disk updates cannot happen in real-time, then frame dropping can result. Application developers can avoid some of these problems by locking I/O resources during critical real-time operations – but should do so with the full understanding that other applications may stall as a result.
During the editing phase of production, audio, video, and stills are mixed together from various sources. During video preview, decoders for MP3, MPEG2, AVI, and other formats will be running. Individually, these decoders do not demand high CPU utilization. Playback is very smooth – until you throw in additional audio tracks, transitions and special effects where multiple codecs and filters must run simultaneously. The more complex transitions can be very CPU intensive and usually involve decoding two or more media streams at the same time. In figure 3, the peaks that are seen every 10 seconds are video transitions.
Figure 3: Video Preview with Transitions
Rendering involves taking the edit decision list and creating video file on the hard disk. Rendering is very CPU intensive – it can use all the performance you can through at it and scales well with faster processors. Audio and Video encoders run simultaneously during rendering so this step is well suited to threading and parallelism.
Figure 4: Video Encoding to MPEG2
With the speed of a 3 GHz Pentium® 4 processor and HT technology, it is now possible to encode full resolution NTSC video faster than real-time! In Figure 4 the source was a 180 second DV video. Without HT technology, the video was encoded in 136 seconds. With HT technology enabled, the time to encode MPEG2 decreased to 111 seconds. For a one-hour video project, the encode time was about 37 minutes.
One surprising difference between Hyper Threaded and non-Hyper Threaded systems is the responsiveness of the User Interface. New tasks launch right away and the cursor is rarely in an hourglass. The encoding task is spread across both processors, and there is plenty of headroom for other applications to run. Without Hyper Threading technology, Figure 4 shows the CPU is 70% consumed with the video encode task – leaving limited resources for other programs.
Output to media
After video encoding is complete, the next part of the process is to create a disk image and write it to CD or DVD so that it can be distributed and played back on a DVD player connected to a television. This phase of the process actually has two major phases with many sub-steps that utilize different parts of the system.
In the first phase the video and audio files are converted to the proper format. Depending on the playback target, the format may be MPEG2 (for consumer DVD players), MPEG4 (for posting on the web), or VCD (a lower resolution format for writable CD’s). In the following example, the output media will be assumed to be high quality DVD-compatible 720×480 30 frames per second. Figure 5 shows the two phases of the Output cycle for writing a DVD.
Figure 5: Output to DVD
Phase 1 involves transcoding or re-encoding the audio and video streams into a compatible format. This is a CPU intensive process, and hard disk activity is also high as files get read in, modified, and then written back out to disk. Without Hyperthreading technology, the CPU is 100% consumed and takes about three (3) times longer for the encoding phase.
Phase 2 consists of file operations to prepare the image for burning and then write it to the media. It is not CPU intensive since the rate-limiting step is the CD or DVD burner. Multitasking during optical disk writing can be risky. On older systems there is a event known as a “Buffer Under-run” that can occur if the CPU is not able to produce data fast enough to keep the disk writer sufficiently stocked.
This problem has been largely overcome as the new drives have larger input buffers (typically 2 Mbit) and there are now protection mechanisms such as “Burn Proof” technology that ensure the disk will get written properly. Most DVD writers do have buffer under-run protection, but the drives are slower (typically 2x) and DVD writing can take up to an hour. Your system is still available, but you should avoid operations involving heavy disk activity.
Section 2: Multitasking Software in a Hyperthreaded Environment
Figure 6: Video Encoding while Multitasking
Hyper Threading technology is enabled by the multi-processing support in the Windows® XP and Windows® 2000 operating systems. Figure 6 shows that HT technology improved the execution times of some common digital media activities – even while the system was under heavy load during a video encoding session.
Hyper Threading does not guarantee that your application will run faster. To benefit from hyperthreading, programs need to have executable sections that can run in parallel. Threading improves the granularity of an application so that operations can be broken up into smaller units whose execution is scheduled and controlled by the operating system. Now two threads can run independently of each other without requiring task switches to get at the resources of the processor.
Figure 7 shows a comparison of code executing on a CPU with HT On versus HT Off. When HT is off, a process can stall while waiting for I/O to complete or another task to provide information it needs. The CPU is blocked from further execution. When HT is on, other threads continue running so the system does not hang or stall.
Figure 7 – Separate data paths enable the CPU to continue working
on other threads, even when one becomes blocked
How the Windows® Operating System handles Multi-threading and Multitasking
Multitasking occurs at the user interface level every time a user is runs multiple programs at once. Some applications also perform multitasking internally by creating multiple processes. Each process is given a time-slice during which time it executes. Creation of a process involves the creation of an address space, the applications image in memory, which includes a code section, a data section and a stack. Parallel programming using processes requires the creation of two or more processes and an inter-process communication mechanism to coordinate the parallel work.
Threads are tasks that run independently of one another within the context of a process. A thread
shares code and data with the parent process but has its own unique stack and architectural state
that includes an instruction pointer. Threads require fewer system resources than processes. Intra-process communication is significantly cheaper in CPU cycles than inter-process communication.
The life cycle of a thread begins when the application assigns a thread pool and creates a thread from the pool. When invoked, the thread gets scheduled by the Windows® XP operating system according to a round-robin mechanism. The next available thread with the highest priority gets to run.
When the thread is scheduled the Operating System checks to see which logical processors are available, then allocates the necessary resources to execute the thread. Each time a thread is dispatched, resources are replicated, divided, or shared to execute the additional threads. When a thread finishes, the operating system idles the unused processor, freeing the resources associated with that thread.
Section 3: Software Design Considerations for Hyperthreading:
In a processor with HT technology, software developers should be aware that architectural state is the only resource that is replicated. All other resources are either shared or partitioned between logical processors. This introduces the issue of resource contention, which can degrade performance, or in the extreme case – cause an application to fail. Synchronization between threads is another area where problems can arise. The following section contains a brief discussion of some of the most common issues in multi-threaded software design. For more complete information, a collection of technical papers is available at the Intel Developer Services website:
Synchronization is used in threaded programs to prevent race conditions (e.g., multiple threads simultaneously updating the same global variable). A spin-wait loop is a common technique used to wait for the availability of a variable or I/O resource.
Consider the case of a master thread that needs to know when a disk write has completed. The master thread and the disk write thread share a synchronization variable in memory. When this variable gets written, it can cause an out-of-order memory violation that forces a performance penalty. Inserting a PAUSE instruction in the master thread read loop can greatly reduce memory order violations.
Spin-wait loops consume execution resources while they are cycling. If other tasks are waiting to run, the thread performing the spin lock can insert a call to Sleep(0) which releases the CPU. If no tasks are waiting, this thread immediately continues execution.
Another alternative to long spin-wait loops is to replace the loop with a thread-blocking API, such as WaitForMultipleObjects. Using this system call ensures that the thread will not consume resources until all of the listed objects are signaled as ready and have been acquired by the thread.
Avoid 64K aliasing in L1 Cache:
The first level data cache (L1) is a shared resource on HT technology processors. Cache lines are mapped on 64KB boundaries, so if two virtual memory addresses are modulo 64KB apart, they will conflict for the same L1 cache line. Under Microsoft Windows® operating systems, threads are created on megabyte boundaries, and 64K aliasing can occur when these threads access local variables on their stacks. A simple solution is to offset the starting stack address by a variable amount using the _alloc function.
False Sharing in the Data Cache
A cache line in the Pentium® 4 processor consists of 64 bytes for write operations, and 128 bytes for reads. False sharing occurs when two threads access different data elements in the same cache line. When one of those threads performs a write operation, the cache line is invalidated, causing the second thread to have to fetch the cache line (128 bytes) again from memory. If this occurs frequently, false sharing can seriously degrade the performance of an application.
False sharing can be diagnosed using the VTUNE Performance Analyzer to monitor the “machine clear caused by other thread’ counter. Some techniques to avoid False Sharing include partitioning data structures, creating a local copy of the data structure for each thread, or padding data structures so they are twice the size of a read cache line.
Write Combining Buffers
The Intel NetBurst™ architecture has 6 Write Combine store (WC) buffers, each buffering one cache line. The Write Combine buffers allow code execution to proceed by combining multiple write operations before they get written back to memory through the L1 or L2 caches. If an application is writing to more than 4 cache lines at the same time, the WC store buffers will begin to be flushed to the second level cache. This is done to help insure that a WC store buffer is ready to combine data for writes to a new cache line.
To take advantage of the Write Combining buffers, an application should not write to
more than 4 distinct addresses or arrays inside an inner loop. On Hyper-Threading enabled processors, the WC store buffers are a shared resource; therefore, the total number of simultaneous writes by both threads running on the two logical processors must be considered. If data is being written inside of a loop, it is best to split inner loop code into multiple inner loops, each of which writes no more than two regions of memory.
Cache Blocking Techniques
Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. This is accomplished by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, Image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.
The effectiveness of the cache blocking technique is highly dependent on data block size,
processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using Intel’s CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache for
non-Hyper-Threading processors and one-quarter to one-half the physical cache size for a
Hyper-Threading enabled processor supporting two logical processors.
Adjusting Task Priorities for Background Tasks
In some applications there are background activities that run continuously, but have little impact on the responsiveness of the system. In these cases, consider adjusting the task or thread priority downward so that this code only runs when resources become available from higher priority tasks.
Conversely, if an application requires real-time response, it can increase task priority so that it runs ahead of other normal priority tasks. This technique should be used with caution, since it can degrade the responsiveness of the user interface, and may affect the performance of other applications running on the system.
On a multiple processor system or on Hyper Threading enabled processors, load balancing is normally handled by the operating system, which allocates workload to the next available resource. In some cases it is possible that one of the logical CPU’s becomes idle, while the other is overloaded. In this case a developer can address load imbalance by setting Processor Affinity.
Processor affinity allows a thread to specify exactly which processor (or processors) that the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of its active threads, it can ensure that load imbalance will not occur among its threads and eliminate thread migration from one logical processor to another.
Simultaneous Fixed and Floating Point Operations
With Hyper Thread technology, there are several ALU’s (for integer logic), but only one shared floating-point unit. If your application uses floating-point calculations, it may be beneficial to isolate those threads and set the processor affinity of the threads to minimize the processor resource contention.
Avoiding Dependence on Timing Loops
Relying on the execution timing between threads as a synchronization technique is not reliable because of speed differences between host systems. Delay loops are sometimes used during initialization as well, and should be avoided for the same reasons.
Software Design Considerations for Multitasking:
Most of the rules above also apply for Multitasking, plus there are a few additional considerations. Task switches are much slower than thread context switches because each task operates in its’ own address space. The state of the previously running task must be saved and data residing in the cache will be invalidated and reloaded.
Hyperthreading enhances multitasking because the state information for each task is stored on a separate logical processor. Cache invalidation will still occur, but the need for a task switch is eliminated since both tasks can run at once. Since cache is a shared resource on Hyper Threading enabled processors, all of the above rules regarding data alignment and blocking still apply.
Contention for resources can be a problem when multitasking. It can occur in memory, on the system busses, or on I/O devices. Consider the case of video capture while creating an MP3 file. Both applications use the hard disk intensively, but video capture has to occur in real-time. The result of contention is that the video drops frames, and the MP3 file skips.
Applications should check the status of an I/O device before attempting to pass data to it. If necessary, peripherals can be locked to avoid access by other applications. This makes sense for a CD or DVD writer, which is essentially a single use device. Locking the hard drive is not recommended, since it is a critical OS resource.
Task and thread priority can have a dramatic effect in a multitasking environment. If priority is raised in a task that runs continuously, other tasks will starve until the high priority task releases the processor. But lowering priority can be good for background processing tasks. Consider the case of a video encoder – which normally takes 100% of the processor. If you lower the priority, the user will still be able to use their computer on demand, but the video encode will still run 100% of the time when the CPU is otherwise available.
Load balancing within applications can actually degrade multi-tasking performance. If one application assumes it has full control of both processors, resource contention can occur when a second application attempts to load. This highlights a fundamental issue with multitasking programs – you never know what other software will be running concurrently with your program. It is usually best to not lock up resources that other programs are likely to need.
Optimized Compilers and Libraries Help Avoid Multi-threading Problems
The best way to design, implement, and tune for Hyper-Threading enabled processors is to start with components or libraries that are thread-safe and designed for Hyper-Threading enabled processors. The operating system and threading libraries are likely to already be optimized for various processors. Use operating system and/or threading synchronization libraries instead of implementing application specific mechanism like spin-waits. Existing applications can take advantage of enhanced code modules by re-linking or through the use of dynamic link libraries.
Intel compilers enable threading by supporting both OpenMP* and auto-parallelization. OpenMP is an industry standard for portable threaded application development, and is effective at threading loop-level parallel problems and function level parallelism. The C++ compiler supports OpenMP API version 1.0 and performs code transformation for shared memory parallel programming.
The Intel® C++ Compiler for Windows with auto-parallelization uses a high-level symmetric multi-processing (SMP) programming model to enable migration to multiprocessing machines (multiple physical CPU’s). This option detects which loops are capable of being executed safely in parallel and automatically generates threaded code for these loops. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations.
Tools to identify performance bottlenecks
The Intel® VTune™ Performance Analyzer allows you to visualize how software utilizes CPU resources. By seeking out ‘hotspots’ in your code, you can focus on optimizing the sections of code that occupy most of the computation time. VTune enables you to view to potential problem areas by memory location, functions, classes, or source files. You can double-click and open the source or assembly view for the hotspot and see more detailed information about the performance of each instruction. The Intel® Thread Checker which works in conjunction with the Intel VTune™ Performance Analyzer automates detection of most threading errors such as:
- Deadlocks (detection and prediction)
- Memory access issues
- Race conditions
- Thread stalls (potential deadlocks), waits
- Potential and realized rata races / dependencies improperly synchronized I/O
- Invalid threading-library calls, arguments, returns
- Threaded calls to non-reentrant routines
Summary and Conclusion
Digital video production is among the most demanding of applications you can run on your computer, yet the Pentium® 4 processor with Hyperthreading technology delivers smooth, real-time performance even while other programs are running. These benefits go beyond simple clock rate.
HT technology improves the availability of the CPU for multi-tasking and background processing. The User Interface is noticeably more responsive under heavy system loads. All of these factors contribute to a favorable, and more productive, end-user experience when using your programs with other digital media tools.
Threaded applications take this one step further by improving parallel execution within a task. Application developers can produce optimized code using the Intel Compiler, and then use the VTUNE Performance Analyzer to identify code hotspots and bottlenecks as described in this paper.
A substantial collection of whitepapers and application notes are available to provide Hyper Threading technical help for application developers. For more information search on Hyper Threading at the Intel software developer website:
Software used in this whitepaper:
- Pinnacle Studio Version 8
- Ulead Movie Producer Version 7
- Sonic Solutions MyDVD
- Roxio Movie Creator 5
- Roxio EZ CD Creator 5.3
- Musicmatch 7.0
All brands and trademarks are the property of their respective owners.
A Note about the System Activity Graphs:
The performance graphs in this whitepaper were generated from traces taken with Perfmon – a standard utility program with the Windows® XP and Windows® 2000 operating systems. Counters were set up to monitor both virtual CPU’s, disk, and network activity. The results were exported as a CSV file and graphed from a spreadsheet.
About the Author:
Roger Finger is a 21 year employee of Intel Corporation, currently working as a Market Segment Specialist in the Software Solutions Group. In this capacity, he works with ISV’s to advise them of new architectural features on Intel processors – providing them with tools and resources to enable features such as HyperThreading technology in their applications. Since the early 1990’s when multimedia first became practical on personal computers, Roger has held positions as applications engineer and product manager for audio/video technologies such as Indeo(tm), ProShare(tm), and the Intel Pocket Concert(tm) MP3 player. “As an amateur videographer, I’m incredibly excited about the tools that are now available to consumers. The cameras are affordable, the software has become much easier to use, and with writable DVD, the PC is an ideal editing and mastering platform for creative expression.