Section 3: Software Design Considerations for Hyperthreading:
In a processor with HT technology, software developers should be aware that architectural state is the only resource that is replicated. All other resources are either shared or partitioned between logical processors. This introduces the issue of resource contention, which can degrade performance, or in the extreme case - cause an application to fail. Synchronization between threads is another area where problems can arise. The following section contains a brief discussion of some of the most common issues in multi-threaded software design. For more complete information, a collection of technical papers is available at the Intel Developer Services website:
Thread Synchronization:
Synchronization is used in threaded programs to prevent race conditions (e.g., multiple threads simultaneously updating the same global variable). A spin-wait loop is a common technique used to wait for the availability of a variable or I/O resource.
Consider the case of a master thread that needs to know when a disk write has completed. The master thread and the disk write thread share a synchronization variable in memory. When this variable gets written, it can cause an out-of-order memory violation that forces a performance penalty. Inserting a PAUSE instruction in the master thread read loop can greatly reduce memory order violations.
Spin-wait loops consume execution resources while they are cycling. If other tasks are waiting to run, the thread performing the spin lock can insert a call to Sleep(0) which releases the CPU. If no tasks are waiting, this thread immediately continues execution.
Another alternative to long spin-wait loops is to replace the loop with a thread-blocking API, such as WaitForMultipleObjects. Using this system call ensures that the thread will not consume resources until all of the listed objects are signaled as ready and have been acquired by the thread.
Avoid 64K aliasing in L1 Cache:
The first level data cache (L1) is a shared resource on HT technology processors. Cache lines are mapped on 64KB boundaries, so if two virtual memory addresses are modulo 64KB apart, they will conflict for the same L1 cache line. Under Microsoft Windows® operating systems, threads are created on megabyte boundaries, and 64K aliasing can occur when these threads access local variables on their stacks. A simple solution is to offset the starting stack address by a variable amount using the _alloc function.
False Sharing in the Data Cache
A cache line in the Pentium® 4 processor consists of 64 bytes for write operations, and 128 bytes for reads. False sharing occurs when two threads access different data elements in the same cache line. When one of those threads performs a write operation, the cache line is invalidated, causing the second thread to have to fetch the cache line (128 bytes) again from memory. If this occurs frequently, false sharing can seriously degrade the performance of an application.
False sharing can be diagnosed using the VTUNE Performance Analyzer to monitor the “machine clear caused by other thread' counter. Some techniques to avoid False Sharing include partitioning data structures, creating a local copy of the data structure for each thread, or padding data structures so they are twice the size of a read cache line.
Write Combining Buffers
The Intel NetBurst™ architecture has 6 Write Combine store (WC) buffers, each buffering one cache line. The Write Combine buffers allow code execution to proceed by combining multiple write operations before they get written back to memory through the L1 or L2 caches. If an application is writing to more than 4 cache lines at the same time, the WC store buffers will begin to be flushed to the second level cache. This is done to help insure that a WC store buffer is ready to combine data for writes to a new cache line.
To take advantage of the Write Combining buffers, an application should not write to
more than 4 distinct addresses or arrays inside an inner loop. On Hyper-Threading enabled processors, the WC store buffers are a shared resource; therefore, the total number of simultaneous writes by both threads running on the two logical processors must be considered. If data is being written inside of a loop, it is best to split inner loop code into multiple inner loops, each of which writes no more than two regions of memory.
Cache Blocking Techniques
Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. This is accomplished by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, Image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.
The effectiveness of the cache blocking technique is highly dependent on data block size,
processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using Intel's CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache for
non-Hyper-Threading processors and one-quarter to one-half the physical cache size for a
Hyper-Threading enabled processor supporting two logical processors.
Adjusting Task Priorities for Background Tasks
In some applications there are background activities that run continuously, but have little impact on the responsiveness of the system. In these cases, consider adjusting the task or thread priority downward so that this code only runs when resources become available from higher priority tasks.
Conversely, if an application requires real-time response, it can increase task priority so that it runs ahead of other normal priority tasks. This technique should be used with caution, since it can degrade the responsiveness of the user interface, and may affect the performance of other applications running on the system.
Load Balancing
On a multiple processor system or on Hyper Threading enabled processors, load balancing is normally handled by the operating system, which allocates workload to the next available resource. In some cases it is possible that one of the logical CPU's becomes idle, while the other is overloaded. In this case a developer can address load imbalance by setting Processor Affinity.
Processor affinity allows a thread to specify exactly which processor (or processors) that the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of its active threads, it can ensure that load imbalance will not occur among its threads and eliminate thread migration from one logical processor to another.
Simultaneous Fixed and Floating Point Operations
With Hyper Thread technology, there are several ALU's (for integer logic), but only one shared floating-point unit. If your application uses floating-point calculations, it may be beneficial to isolate those threads and set the processor affinity of the threads to minimize the processor resource contention.
Avoiding Dependence on Timing Loops
Relying on the execution timing between threads as a synchronization technique is not reliable because of speed differences between host systems. Delay loops are sometimes used during initialization as well, and should be avoided for the same reasons.
Software Design Considerations for Multitasking:
Most of the rules above also apply for Multitasking, plus there are a few additional considerations. Task switches are much slower than thread context switches because each task operates in its' own address space. The state of the previously running task must be saved and data residing in the cache will be invalidated and reloaded.
Hyperthreading enhances multitasking because the state information for each task is stored on a separate logical processor. Cache invalidation will still occur, but the need for a task switch is eliminated since both tasks can run at once. Since cache is a shared resource on Hyper Threading enabled processors, all of the above rules regarding data alignment and blocking still apply.
Contention for resources can be a problem when multitasking. It can occur in memory, on the system busses, or on I/O devices. Consider the case of video capture while creating an MP3 file. Both applications use the hard disk intensively, but video capture has to occur in real-time. The result of contention is that the video drops frames, and the MP3 file skips.
Applications should check the status of an I/O device before attempting to pass data to it. If necessary, peripherals can be locked to avoid access by other applications. This makes sense for a CD or DVD writer, which is essentially a single use device. Locking the hard drive is not recommended, since it is a critical OS resource.
Task and thread priority can have a dramatic effect in a multitasking environment. If priority is raised in a task that runs continuously, other tasks will starve until the high priority task releases the processor. But lowering priority can be good for background processing tasks. Consider the case of a video encoder - which normally takes 100% of the processor. If you lower the priority, the user will still be able to use their computer on demand, but the video encode will still run 100% of the time when the CPU is otherwise available.
Load balancing within applications can actually degrade multi-tasking performance. If one application assumes it has full control of both processors, resource contention can occur when a second application attempts to load. This highlights a fundamental issue with multitasking programs - you never know what other software will be running concurrently with your program. It is usually best to not lock up resources that other programs are likely to need.
Optimized Compilers and Libraries Help Avoid Multi-threading Problems
The best way to design, implement, and tune for Hyper-Threading enabled processors is to start with components or libraries that are thread-safe and designed for Hyper-Threading enabled processors. The operating system and threading libraries are likely to already be optimized for various processors. Use operating system and/or threading synchronization libraries instead of implementing application specific mechanism like spin-waits. Existing applications can take advantage of enhanced code modules by re-linking or through the use of dynamic link libraries.
Intel compilers enable threading by supporting both OpenMP* and auto-parallelization. OpenMP is an industry standard for portable threaded application development, and is effective at threading loop-level parallel problems and function level parallelism. The C++ compiler supports OpenMP API version 1.0 and performs code transformation for shared memory parallel programming.
The Intel® C++ Compiler for Windows with auto-parallelization uses a high-level symmetric multi-processing (SMP) programming model to enable migration to multiprocessing machines (multiple physical CPU's). This option detects which loops are capable of being executed safely in parallel and automatically generates threaded code for these loops. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations.
Tools to identify performance bottlenecks
The Intel® VTune™ Performance Analyzer allows you to visualize how software utilizes CPU resources. By seeking out 'hotspots' in your code, you can focus on optimizing the sections of code that occupy most of the computation time. VTune enables you to view to potential problem areas by memory location, functions, classes, or source files. You can double-click and open the source or assembly view for the hotspot and see more detailed information about the performance of each instruction. The Intel® Thread Checker which works in conjunction with the Intel VTune™ Performance Analyzer automates detection of most threading errors such as:
- Deadlocks (detection and prediction)
- Memory access issues
- Race conditions
- Thread stalls (potential deadlocks), waits
- Potential and realized rata races / dependencies improperly synchronized I/O
- Invalid threading-library calls, arguments, returns
- Threaded calls to non-reentrant routines



0 