We conducted our performance evaluation with two multimedia applications to examine the performance of multithreaded codes generated by the Intel compiler. The generated codes are highly optimized with architecture-specific, advanced scalar and array optimizations assisted with aggressive memory disambiguation. Our results show that Hyper-Threading technology and the Intel compiler offer a cost-effective performance gain (10%~28%) for our applications on a single processor (SP+HT), and offer up to 2.23x speedup on a dual-processor system with Hyper-Threading technology-enabled (DP+HT). The performance measurement of two multimedia applications SVM and AVSR is carried out on a dual–processor HT-enabled Intel Xeon™ system running at 2.0GHz, with 1024MB memory, an 8K L1-Cache, a 512K L2-Cache, and no L3-Cache. When we measure single-processor performance on a Dual-Processor (DP) system, we disable one physical processor from the BIOS. We disable the support of Hyper-Threading technology from the BIOS in order to measure the performance of our applications on the processor without using Hyper-Threading technology. To use the serial execution time as a base on the system experimentally in our lab setting, we disable one physical processor and Hyper-Threading technology, and run the highly optimized serial codes of applications.
Essentially, the performance scaling is derived from the serial execution (SP) with Hyper-Threading technology disabled and one physical processor disabled on our system. The multithreaded execution is done with three system configurations: (1) SP+HT (Single-Processor with HT-enabled), (2) DP (Dual Processor with HT-disabled), (3) DP+HT (Dual-Processor with HT-enabled). In Figure 13, we show the normalized speedup of our multithreaded execution of the SVMs (2 kernels). The workloads achieved very good performance gain using the Intel OpenMP C++ compiler for data-domain decomposition. For instance, from a single processor with HT-disabled to the single processor with HT-enabled, we achieve speedups ranging from 1.10x to 1.13x with 2-thread run. The speedup ranges from 1.92x to 1.97x for 2-thread run with DP configuration. The speedup ranges from 2.13 to 2.23x for 4-thread run with DP+HT configuration. This indicates that we utilize the microprocessor more efficiently.
Figure 14 shows the speedup of the OpenMP version of the AVSR with different amount of nested parallelism under different system configuration. Again, by changing from a single processor Hyper-Threading technology disabled to the single processor with Hyper-Threading technology-enabled, a speedup ranging from 1.18 to 1.28x is achieved with 2 threads under the SP+HT configuration. The speedup is 1.61x for 4 outer threads, 2.03x for 4 outer, 2 inner threads, and 1.95x for 4 outer, 4 inner threads with the DP configuration. The speedup is 1.57x for 4 outer threads, 1.99x for 4 outer, 2 inner threads, and 1.85x for 4 outer, 4 inner threads with DP+HT configuration. Clearly, we achieved ~2x speedup from a single-CPU system to a dual-CPU system.
One observation we have from Figure 14 is that the best speedup of AVSR workload with DP+HT configuration is 1.97% lower than the best speedup of the AVSR with the DP configuration. It attributes to one cause, that is, only three logical processors are effectively used when the A (2.5%) and O (8.8%) are completed for 4-outer-2-inner-thread execution. This means that the benefit from one physical processor with HT-enabled, which is evidenced with the performance gain under SP+HT configuration, is not enough to counteract the penalty of one idle logical processor caused by the unbalanced load. Our observation applies to the 4-outer-4-inter-thread execution scheme as well. The challenge here is how to exploit parallelism in AV (36.6%), which is one of our future research topics beyond the scope of this article. Another observation we have from Figure 14 is that the speedup from the 4 outer and 2 inner threads is better than the speedup from the 4 outer and 4 inner threads under both DP and DP+HT configurations. This is simply due to the less threading overheads are introduced with a smaller number of threads. Later, we discuss more about controlling parallelism and controlling spin-waiting for getting a good trade-off between benefits and costs. In any case, we have achieved ~2x speedup under both DP and DP+HT configurations.
Furthermore, we characterize the multimedia workloads by using Intel VTune Performance Analyzer under SP, SP+HT, DP, and DP+HT configurations to examine the HT benefits and costs instead of presenting speedup only. As shown in Table 1, although the numbers of instructions retired and cache miss rates (e.g., 2.7% vs 2.9% first-level cache miss rates for the linear SVM) are increased for both applications after threading due to execution resource sharing, cache and memory sharing, and contention, the overall application performance still increases. More specifically, the IPC is improved from 0.77 to 0.83 (8%) for SVM (linear) on SP, 17% for SVM (linear) on DP, 13% for SVM (RBF) on SP, 12% for SVM (RBF) on DP, and 30% for AVSR on SP. These results indicate the processor resource utilization is greatly improved for our multimedia applications with the Hyper-Threading technology.
- "Hyper-Threading Technology for Multimedia Apps, Page 1"
- "Hyper-Threading Technology for Multimedia Apps, Page 2"
- "Hyper-Threading Technology for Multimedia Apps, Page 3"
- "Hyper-Threading Technology for Multimedia Apps, Page 4"
- "Hyper-Threading Technology for Multimedia Apps, Page 5"