posted by Intel Researchers (for OSNews) on Wed 12th Mar 2003 23:25 UTC

"Hyper-Threading Technology for Multimedia Apps, Page 4"
Performance

We conducted our performance evaluation with two multimedia applications to examine the performance of multithreaded codes generated by the Intel compiler. The generated codes are highly optimized with architecture-specific, advanced scalar and array optimizations assisted with aggressive memory disambiguation. Our results show that Hyper-Threading technology and the Intel compiler offer a cost-effective performance gain (10%~28%) for our applications on a single processor (SP+HT), and offer up to 2.23x speedup on a dual-processor system with Hyper-Threading technology-enabled (DP+HT). The performance measurement of two multimedia applications SVM and AVSR is carried out on a dual–processor HT-enabled Intel Xeon™ system running at 2.0GHz, with 1024MB memory, an 8K L1-Cache, a 512K L2-Cache, and no L3-Cache. When we measure single-processor performance on a Dual-Processor (DP) system, we disable one physical processor from the BIOS. We disable the support of Hyper-Threading technology from the BIOS in order to measure the performance of our applications on the processor without using Hyper-Threading technology. To use the serial execution time as a base on the system experimentally in our lab setting, we disable one physical processor and Hyper-Threading technology, and run the highly optimized serial codes of applications.

Essentially, the performance scaling is derived from the serial execution (SP) with Hyper-Threading technology disabled and one physical processor disabled on our system. The multithreaded execution is done with three system configurations: (1) SP+HT (Single-Processor with HT-enabled), (2) DP (Dual Processor with HT-disabled), (3) DP+HT (Dual-Processor with HT-enabled). In Figure 13, we show the normalized speedup of our multithreaded execution of the SVMs (2 kernels). The workloads achieved very good performance gain using the Intel OpenMP C++ compiler for data-domain decomposition. For instance, from a single processor with HT-disabled to the single processor with HT-enabled, we achieve speedups ranging from 1.10x to 1.13x with 2-thread run. The speedup ranges from 1.92x to 1.97x for 2-thread run with DP configuration. The speedup ranges from 2.13 to 2.23x for 4-thread run with DP+HT configuration. This indicates that we utilize the microprocessor more efficiently.

Figure 14 shows the speedup of the OpenMP version of the AVSR with different amount of nested parallelism under different system configuration. Again, by changing from a single processor Hyper-Threading technology disabled to the single processor with Hyper-Threading technology-enabled, a speedup ranging from 1.18 to 1.28x is achieved with 2 threads under the SP+HT configuration. The speedup is 1.61x for 4 outer threads, 2.03x for 4 outer, 2 inner threads, and 1.95x for 4 outer, 4 inner threads with the DP configuration. The speedup is 1.57x for 4 outer threads, 1.99x for 4 outer, 2 inner threads, and 1.85x for 4 outer, 4 inner threads with DP+HT configuration. Clearly, we achieved ~2x speedup from a single-CPU system to a dual-CPU system.

One observation we have from Figure 14 is that the best speedup of AVSR workload with DP+HT configuration is 1.97% lower than the best speedup of the AVSR with the DP configuration. It attributes to one cause, that is, only three logical processors are effectively used when the A (2.5%) and O (8.8%) are completed for 4-outer-2-inner-thread execution. This means that the benefit from one physical processor with HT-enabled, which is evidenced with the performance gain under SP+HT configuration, is not enough to counteract the penalty of one idle logical processor caused by the unbalanced load. Our observation applies to the 4-outer-4-inter-thread execution scheme as well. The challenge here is how to exploit parallelism in AV (36.6%), which is one of our future research topics beyond the scope of this article. Another observation we have from Figure 14 is that the speedup from the 4 outer and 2 inner threads is better than the speedup from the 4 outer and 4 inner threads under both DP and DP+HT configurations. This is simply due to the less threading overheads are introduced with a smaller number of threads. Later, we discuss more about controlling parallelism and controlling spin-waiting for getting a good trade-off between benefits and costs. In any case, we have achieved ~2x speedup under both DP and DP+HT configurations.

Functional decomposition may not deliver the best performance due to unbalanced load of all tasks among all processors in the system. Given the inherent variation of granularity for each task (or module), it is hardly to achieve the best potential performance without exploiting another level of parallelism. Essentially, for media workloads, we can exploit data-parallelism to overcome the issue of exploiting task-parallelism. As we show in Figure 14, by exploiting the inner parallelism with data-domain decomposition, we achieve much better speedups -- the performance gain is around 40% with the 4 outer and 2 inner threads comparing to 4 outer threads (exploiting task-parallelism only). Thus, exploiting nested-parallelism is necessary to achieve better load balance and speedup. (Note: the inner-parallelism does not have to be data-parallelism always; it can be task-parallelism as well.) On the other hand, Figure 14 also shows that excessive threads introduce more extra threading overhead, the performance improvement with 4 inner threads is not better than that with 2 inner threads. Therefore, effectively controlling parallelism is still an important aspect to achieve the desired performance on a HT-enabled Intel Xeon processor system, even though the potential parallelism could improve the processor utilization. With Intel compiler and runtime, users are allowed to control how much time each thread should spend spinning at run-time. An environment variable KMP_BLOCKTIME is supported in the library. Also, the spinning time can be adjusted by using the kmp_set_blocktime() API call at runtime. On a HT-enabled processor more than one thread can be executing on the same physical processor at the same time. This indicates that both threads have to share that processor’s resources. It makes spin-waiting extremely expensive since the thread that is just waiting is now taking valuable processor resources away from the other thread that is doing useful work. Thus, when exploring the use of Hyper-Threading technology, the block-time should be very short so that the waiting thread sleeps as soon as possible allowing still useful threads to more fully utilize all processor resources. In our previous work, we use Win32 Threading Library calls to parallelize our multimedia workloads [5]. While we can achieve good performance, multi-threading them takes a huge amount of effort. With the Intel OpenMP compiler and OpenMP runtime library support, we demonstrated same or better performance with much less effort. In other words, the programmer intervention for parallelizing our multimedia applications is pretty minor.

Furthermore, we characterize the multimedia workloads by using Intel VTune Performance Analyzer under SP, SP+HT, DP, and DP+HT configurations to examine the HT benefits and costs instead of presenting speedup only. As shown in Table 1, although the numbers of instructions retired and cache miss rates (e.g., 2.7% vs 2.9% first-level cache miss rates for the linear SVM) are increased for both applications after threading due to execution resource sharing, cache and memory sharing, and contention, the overall application performance still increases. More specifically, the IPC is improved from 0.77 to 0.83 (8%) for SVM (linear) on SP, 17% for SVM (linear) on DP, 13% for SVM (RBF) on SP, 12% for SVM (RBF) on DP, and 30% for AVSR on SP. These results indicate the processor resource utilization is greatly improved for our multimedia applications with the Hyper-Threading technology.

Table of contents
  1. "Hyper-Threading Technology for Multimedia Apps, Page 1"
  2. "Hyper-Threading Technology for Multimedia Apps, Page 2"
  3. "Hyper-Threading Technology for Multimedia Apps, Page 3"
  4. "Hyper-Threading Technology for Multimedia Apps, Page 4"
  5. "Hyper-Threading Technology for Multimedia Apps, Page 5"
e p (0)    19 Comment(s)

Technology White Papers

See More