Exploring the Use of HyperThreading Technology for Multimedia Apps

Guest post by Intel Researchers (for OSNews) 2003-03-12 Intel 19 Comments

Processors with Hyper-Threading technology can improve the performance of applications by permitting a single processor to process data as if it were two processors by executing instructions from different threads in parallel rather than serially. However, the potential performance improvement can be only obtained if an application is multithreaded by parallelization techniques. This article presents the multithreaded code generation and optimization techniques developed for the Intel C++/Fortran compiler. We conduct the performance study of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel compiler on the Hyper-Threading (HT) technology enabled Intel single-processor and multi-processor systems.

The performance results show that the threaded code generated by the compiler achieved up to 1.28x speedups on a HT-enabled single-processor system and up to 2.23x speedup on a HT-enabled dual-processor system. Our three key observations are: (a) the threaded code generated by the Intel compiler yields a good performance gain with the parallelization guided by OpenMP pragmas in multimedia applications; (b) exploiting thread-level parallelism (TLP) causes inter-thread interference in caches and places greater demands on the memory system, however, Hyper-Threading technology hides the additional latency and delivers a good performance gain of the whole program; (c) Hyper-Threading technology is effective on exploiting both task-parallelism and data-parallelism inherent in multimedia applications.

1. Introduction

Simultaneous Multi-Threading (SMT) [7, 15] was proposed to allow multiple threads to compete for and share all processor’s resources such as caches, execution units, control logic, buses and memory systems. The Hyper-Threading technology (HT) [4] brings the SMT idea to the Intel architectures and makes a single physical processor appear as two logical processors with duplicated architecture state, but with shared physical execution resources. This allows two threads from a single application or two separate applications to execute in parallel, increasing processor utilization and reducing the impact of memory latency by overlapping the latency of one thread with the execution of another Hyper-Threading technology-enabled processors offer significant performance improvements for applications with a high degree of thread-level parallelism without sacrificing compatibility with the existing software or single-threaded performance. These potential performance gains are only obtained, however, if an application is efficiently multithreaded. The Intel C++/Fortran compilers support OpenMP directive- and pragma-guided parallelization, which significantly increase the domain of various applications amenable to effective parallelism. A typical example is that users can use OpenMP parallel sections to develop an application where section-A calls an integer-intensive routine and where section-B calls a floating-point intensive routine, so the performance improvement is obtained by scheduling section-A and section-B onto two different logical processors that share the same physical processor to fully utilize processor resources with the Hyper-Threading technology. The OpenMP directives or pragmas have emerged as the de facto standard of expressing thread-level parallelism in applications as they substantially simplify the notoriously complex task of writing multithreaded applications. The OpenMP 2.0 standard API [6, 9] supports a multi-platform, shared-memory, parallel programming paradigm in C++/C and Fortran95 on all popular operating systems such as Windows NT, Linux, and Unix. This paper describes threaded code generation techniques for exploiting parallelism explicitly expressed by OpenMP pragmas/directives. To validate the effectiveness of our threaded code generation and optimization techniques, we also characterize and study two workloads of multimedia applications parallelized with OpenMP pragmas and compiled with the Intel OpenMP C++ compiler on Intel Hyper-Threading architecture. Two multimedia workloads, including Support Vector Machine (SVM) and Audio-Visual Speech Recognition (AVSR), are optimized for the Intel Pentium 4 processor. One of our goals is to better explain the performance gains that are possible in the media applications through exploring the use of Hyper-Threading technology with the Intel compiler.

The remainder of this article is organized as follows. We first give a high-level overview of Hyper-Threading technology. We then present threaded code generation and optimization techniques developed in the Intel C++ and Fortran product compilers for the OpenMP pragma or directive guided parallelization, which includes the exploitation of nested parallelism, and workqueuing model extension for exploiting irregular-parallelism. Starting from Section 4, we characterize and study two workloads of multimedia applications parallelized with OpenMP pragmas and compiled with the Intel OpenMP C++ compiler on Hyper-Threading technology enabled Intel architectures. Finally, we show the performance results of two multimedia applications.

2. Hyper-Threading Technology

Hyper-Threading technology brings the concept of Simultaneous Multi-Threading (SMT) to Intel Architecture. Hyper-Threading technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors [4]. From a software or architecture perspective, this means operating systems and user programs can schedule threads to logical CPUs as they would on multiple physical CPUs. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources [4].

The optimal performance is provided by the Intel NetBurst™ microarchitecture while executing a single instruction stream. A typical thread of code with a typical mix of instructions, however, utilizes only about 50 percent of execution resources. By adding the necessary logic and resources to the processor die in order to schedule and control two threads of code, Hyper-Threading technology makes these underutilized resources available to a second thread, offering increased system and application performance. Systems built with multiple Hyper-Threading enabled processors further improve the multiprocessor system performance, processing two threads for each processor.
Figure 1(a) shows a system with two physical processors that are not Hyper-Threading technology-capable. Figure 1(b) shows a system with two physical processors that are Hyper-Threading technology-capable. In Figure 1(b), with a duplicated copy of the architectural state on each physical processor, the system appears to have four logical processors. Each logical processor contains a complete set of the architecture state. The architecture state consists of registers including the general-purpose register group, the control registers, advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors required to store the architecture state is a very small fraction of the total. Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses. Each logical processor has its own interrupt controller or APIC. Interrupts sent to a specific logical processor are handled only by that logical processor.

With the Hyper-Threading technology, the majority of execution resources are shared by two architecture states (or two logical processors). Rapid execution engine process instructions from both threads simultaneously. The Fetch and Deliver engine and Reorder and Retire block partition some of the resources to alternate between the two intra-threads. In short, the Hyper-Threading technology improves performance of multi-threaded programs by increasing the processor utilization of the on-chip resources available in the Intel NetBurst™ microarchitecture.

Intel Compiler for Multithreading

The Intel compiler incorporates many well-known and advanced optimization techniques [14] that are designed and extended to fully leverage Intel processor features for higher performance. The Intel compiler has a common intermediate representation (named IL0) for C++/C and Fortran95 language, so that the OpenMP directive- and pragma-guided parallelization and a majority of optimization techniques are applicable through a single high-level intermediate code generation and transformation, irrespective of the source language. In this Section, we present several threaded code generation and optimization techniques in the Intel compiler.

Threaded Code Generation Technique

We developed a new compiler technology named Multi-Entry Threading (MET) [3]. The main motivation of the MET compilation model is to keep all newly generated multithreaded codes, which are captured by T-entry, T-region and T-ret nodes, embedded inside the user-routine without splitting them into independent subroutines. This method is different from outlining [10, 13] technique, and it provides later more optimization opportunities for higher performance. From the compiler-engineering point of view, the MET technique greatly reduces the complexity of generating separate routines in the Intel compiler. In addition, the MET technique minimizes the impact of OpenMP parallelizer on all well-known optimizations in the Intel compiler such as constant propagation, vectorization [8], PRE [12], scalar replacement, loop transformation, profile-feedback guided optimization and interprocedural optimization.

The code transformations and optimizations in the Intel compiler can be classified into (i) code restructuring and interprocedural optimizations (IPO); (ii) OpenMP directive-guided and automatic parallelization and vectorization; (iii) high-level optimizations (HLO) and scalar optimizations including memory optimizations such as loop control and data transformations, partial redundancy elimination (PRE), and partial dead store elimination (PDSE); and (iv) low-level machine code generation and optimizations such as register allocation and instruction scheduling. In Figure 2, we show a sample program using the parallel sections pragma.

Essentially, the multithreaded code generator inserts the thread invocation call __kmpc_fork_call(…) with T-entry node and data environment (source line information loc, thread number tid, etc.) for each parallel loop, parallel sections or parallel region, and transforms a serial loop, sections, or region to a multithreaded loop, sections, or region, respectively. In this example, the pre-pass first converts parallel sections to a parallel loop. Then, the multithreaded code generator localizes loop lower-bound and upper-bound, privatizes the section id variable for the T-region marked with [T_entry, T-ret] nodes. For the parallel sections in the routine “parfoo”, the multithreaded code generation involves (a) generating a runtime dispatch and initialization routine (__kmpc_dispatch_init) call to pass necessary information to the runtime system; (b) generating an enclosing loop to dispatch loop-chunk at runtime through the __kmpc_dispatch_next routine in the library; (c) localizing the loop lower-bound, upper-bound, and privatizing the loop control variable ‘id’ as shown in Figure 3. Given that the granularity of the sections could be dramatically different, the static or static-even scheduling type may not achieve a good load balance. We decided to use the runtime scheduling type for a parallel loop generated by the pre-pass of multithreaded code generation. Therefore, the decision regarding scheduling type is deferred until run-time, and an optimal balanced workload can be achieved based on the setting of the environment variable OMP_SCHEDULE supported in the OpenMP library.

In order to generate efficient threaded-code that gains a speed-up over optimized uniprocessor code, an effective optimization phase ordering had been designed in the Intel compiler to make sure that optimizations, such as, IPO inlining, code restructuring, Igoto optimizations, and constant propagation, which can be effectively enabled before parallelization, preserve legal OpenMP program semantics and necessary information for parallelization. It also ensures that all optimizations after the OpenMP parallelization, such as auto-vectorization, loop transformation, PRE, and PDSE, can effectively kick in to achieve a good cache locality and to minimize the number of redundant computations and references to memory. For example, given a double-nested OpenMP parallel loop, the parallelizer is able to generate multithreaded code for the outer loop, while maintaining the symbol table information, loop structure, and memory reference behavior for the innermost loop. This enables the subsequent auto-vectorization for the innermost loop to fully leverage the SIMD Streaming Extension (SSE and SSE2) features of Intel processors [3, 8]. There are a number of efficient threaded-code generation techniques that have been developed for OpenMP in the Intel compiler. The following sub- sections describe some such techniques.

Support Nested Parallelism

Both static and dynamic nested parallelisms are supported by the OpenMP standard. However, most existing OpenMP compilers do not fully support nested parallelism, since the OpenMP-compliant implementation is allowed to serialize the nested inner regions, even when the nested parallelism is enabled by the environment variable OMP_NESTED or routine omp_set_nested(). For broad classes of applications, such as imaging processing and audio/video encoding and decoding algorithms, the good performance gains are achieved by exploiting nested parallelisms. We provided the compiler and runtime library support to exploit static and dynamic nested parallelism. Figure 4(a) shows a sample code with nested parallel regions, and Figure 4(b) does show the pseudo-threaded-code generated by the Intel compiler.
As shown in Figure 4(b), there are two threaded regions, or T-regions, created within the original function nestedpar(). T-entry __nestedpar_par_region0() corresponds to the semantics of the outer parallel region, and the T-entry __nestedpar_par_region1() corresponds to the semantics of the inner parallel region. For the inner parallel region in the routine nestedpar, the variable id is a shared stack variable for the inner parallel region. Therefore, it is accessed and shared by all threads through the T-entry argument id_p. Note that the variable id is a private variable for the outer parallel region, since it is a local defined stack variable.
As we see in Figure 4(b), there are no extra arguments on the T-entry for sharing local static array ‘a’, and there is no pointer de-referencing inside the T-region for sharing the local static array ’a’ among all threads in the teams of both the outer and inner parallel regions. This uses the optimization technique presented in [3] for sharing local static data among threads; it is an efficient way to avoid the overhead of argument passing across T-entries.

Exploiting Irregular Parallelism

Irregular parallelism inherent in many applications is hard to be exploited efficiently. The workqueuing model [1] provides a simple approach for allowing users to exploit irregular parallelism effectively. This model allows a programmer to parallelize control structures that are beyond the scope of those supported by the OpenMP model, while still fitting into the framework defined by the OpenMP specification. In particular, the workqueuing model is a flexible programming model for specifying units of work that are not pre-computed at the start of the worksharing construct. See a simple example in Figure 5.

The parallel taskq pragma specifies an environment for the ‘while loop’ in which to enqueue the units of work specified by the enclosed task pragma. Thus, the loop’s control structure and the enqueuing are executed by single thread, while the other threads in the team participate in dequeuing the work from the taskq queue and executing it. The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued, hence preserving the sequential semantics. The workqueuing execution model is shown in Figure 6. Essentially, given a program with workqueuing constructs, a team of threads is created, when a parallel region is encountered. With the workqueuing execution model, from among all threads that encounter a taskq pragma, one thread (TK) is chosen to execute it initially. All the other threads (Tm, where m=1, …, N and m-K) wait for work to be enqueued on the work queue. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread TK, enqueues each task it encounters, and then the code inside the taskq block is executed single-threaded by the TK. The task pragma specifies a unit of work, potentially executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is enqueued on the queue associated with the taskq. The conceptual queue is disbanded when all work enqueued on it finishes, and when the end of the taskq block is reached.

The Intel C++ OpenMP compiler has been extended throughout its various components to support the workqueuing model for generating multithreaded codes corresponding to the workqueuing constructs as the Intel OpenMP extension. More code generation details for the workqueuing constructs are presented in the paper [1]. In the next Section, we describe the multimedia application SVM and AVSR modified with OpenMP pragmas for evaluating our multithreaded code generation and optimizations developed in the Intel compiler together with the Intel OpenMP runtime library.

Multimedia Workloads

Due to the inherently sequential constitution of the algorithms of multimedia applications, most of the modules in these optimized applications cannot fully utilize all the execution units available in the off-the-shelf microprocessors. Some modules are memory-bound, while some are computation-bound. In this Section, we describe the selected multimedia workloads and discuss our approach of parallelizing the workloads with OpenMP.

Workload Description

Audio-visual Speech Recognition

The second workload that we investigate is audio-visual speech recognition (AVSR). There are many applications using automatic speech recognition systems, from human computer interfaces to robotics. While computers are getting faster, speech recognition systems are not robust without special constraints. Often, robust speech recognition requires special conditions, such as, smaller vocabulary, or very clean signal of the voice.

In recent years, several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. Figure 7 shows a flowchart of the AVSR process. The use of visual feature in AVSR is motivated by the bimodality of the speech formation and the ability of humans to better distinguish spoken sounds when both audio and video are available. Additionally, the visual information provides the system with complementary features that cannot be corrupted by the acoustic noise of the environment. In our performance study, the system developed by Liang et al. L. Liang, X. Liu, M. Zhao, X. Pi, and A. V. Nefian, “Speaker Independent Audio-Visual Continuous Speech Recognition,” in Proc. of Int’l Conf. on Multimedia and Expo, vol. 2, pp. 25-28, Aug. 2002. is used.

Data-Domain Decomposition

A way of exploiting parallelism of multimedia workloads is to decompose the work into threads in data-domain. As described in Section 4.1.1, the evaluation of trained SVMs is well-structured and can, thus, be multithreaded at multiple levels. On the lowest level, the dimensionality K of the input data can be very large. Typical values of K range between a few hundreds to several thousands. Thus, the vector multiplication in the linear, polynomial, and sigmoid kernels as well as the L2 distance in the radial basis function kernel can be multithreaded. On the next level, the evaluation of each expression in the sum is independent of each other. Finally, in an application several samples are tested and each evaluation can be done in parallel. In Figure 8, we show the parallelized SVM by simply adding a parallel for pragma. The programmer intervention for parallelizing the SVM is minor. The compiler generates the multithreaded code automatically.

Functional Decomposition

The functional decomposition is another way to multithread an application for exploiting task-parallelism. The AVSR application has clearly four different functional components. These are audio processing, video processing, audio-video processing, and others. Therefore, a natural scheme of parallelizing the AVSR is to map a functional component to an OpenMP worksharing section [6] as shown in Figure 9. Streams of audio and video data can be broken into pieces and be processed in pipeline. In our multithreaded application, while the audio processing and the video processing are working on the current piece of the data, the AVSR processing is working on the previous piece of the data as well. We did parallelize not only the parallel tasks, but also the pipeline tasks.
Same as exploiting data-parallelism in the SVM application, the programmer intervention for parallelizing the AVSR is also pretty small. A few OpenMP pragmas are simply added to the original source code. The compiler performs the threaded code generation presented in Section 3 together with the OpenMP library support to execute the AVSR application in parallel.

Exploiting Dynamic Nested Parallelism

In addition to functional-decomposition of the AVSR application, we exploit the nested data-parallelism in the dynamic extent of the video processing section (or thread). The major motivation of further partitioning this thread into multiple threads is to achieve better load balance. The execution time breakdown of the AVSR workload is shown in Figure 10 in which the video processing takes around half of the time. To exploit task-level parallelism of the application on a single processor with Hyper-Threading technology or a dual-processor system without Hyper-Threading technology, the workload can be balanced well by having the video processing thread on one processor and having the rest on the other processor. However, on a dual-processor system with Hyper-Threading technology, pure functional decomposition cannot have balanced loads. This is because video processing takes ~50% of the total execution time. We further make dot-product of matrices/vectors and Fourier transform into multiple threads, as shown in Figure 11. Thus, as shown in Figure 12, we have totally three threading schemes in our experiment to evaluate the exploitation of static nested parallelism supported by the Intel compiler and OpenMP runtime library.

Figure 12 shows the application AVSR parallelized with OpenMP pragmas to exploit task and data parallelisms, where, A stands for audio processing, V stands for video processing, AV stands for audio-video processing, and O stands for other miscellaneous processing. Figure 12(a) shows the multi-threading model when we only have four threads via functional decomposition. Figure 12(b) and (c) show the nested parallelism when video processing is further threaded into 2 or 4 threads. The bottom nodes denote the additional threads created for executing the parallel for loop within the dynamic extent of the parallel sections.

Performance

We conducted our performance evaluation with two multimedia applications to examine the performance of multithreaded codes generated by the Intel compiler. The generated codes are highly optimized with architecture-specific, advanced scalar and array optimizations assisted with aggressive memory disambiguation. Our results show that Hyper-Threading technology and the Intel compiler offer a cost-effective performance gain (10%~28%) for our applications on a single processor (SP+HT), and offer up to 2.23x speedup on a dual-processor system with Hyper-Threading technology-enabled (DP+HT). The performance measurement of two multimedia applications SVM and AVSR is carried out on a dual–processor HT-enabled Intel Xeon™ system running at 2.0GHz, with 1024MB memory, an 8K L1-Cache, a 512K L2-Cache, and no L3-Cache. When we measure single-processor performance on a Dual-Processor (DP) system, we disable one physical processor from the BIOS. We disable the support of Hyper-Threading technology from the BIOS in order to measure the performance of our applications on the processor without using Hyper-Threading technology. To use the serial execution time as a base on the system experimentally in our lab setting, we disable one physical processor and Hyper-Threading technology, and run the highly optimized serial codes of applications.

Essentially, the performance scaling is derived from the serial execution (SP) with Hyper-Threading technology disabled and one physical processor disabled on our system. The multithreaded execution is done with three system configurations: (1) SP+HT (Single-Processor with HT-enabled), (2) DP (Dual Processor with HT-disabled), (3) DP+HT (Dual-Processor with HT-enabled). In Figure 13, we show the normalized speedup of our multithreaded execution of the SVMs (2 kernels). The workloads achieved very good performance gain using the Intel OpenMP C++ compiler for data-domain decomposition. For instance, from a single processor with HT-disabled to the single processor with HT-enabled, we achieve speedups ranging from 1.10x to 1.13x with 2-thread run. The speedup ranges from 1.92x to 1.97x for 2-thread run with DP configuration. The speedup ranges from 2.13 to 2.23x for 4-thread run with DP+HT configuration. This indicates that we utilize the microprocessor more efficiently.

Figure 14 shows the speedup of the OpenMP version of the AVSR with different amount of nested parallelism under different system configuration. Again, by changing from a single processor Hyper-Threading technology disabled to the single processor with Hyper-Threading technology-enabled, a speedup ranging from 1.18 to 1.28x is achieved with 2 threads under the SP+HT configuration. The speedup is 1.61x for 4 outer threads, 2.03x for 4 outer, 2 inner threads, and 1.95x for 4 outer, 4 inner threads with the DP configuration. The speedup is 1.57x for 4 outer threads, 1.99x for 4 outer, 2 inner threads, and 1.85x for 4 outer, 4 inner threads with DP+HT configuration. Clearly, we achieved ~2x speedup from a single-CPU system to a dual-CPU system.

One observation we have from Figure 14 is that the best speedup of AVSR workload with DP+HT configuration is 1.97% lower than the best speedup of the AVSR with the DP configuration. It attributes to one cause, that is, only three logical processors are effectively used when the A (2.5%) and O (8.8%) are completed for 4-outer-2-inner-thread execution. This means that the benefit from one physical processor with HT-enabled, which is evidenced with the performance gain under SP+HT configuration, is not enough to counteract the penalty of one idle logical processor caused by the unbalanced load. Our observation applies to the 4-outer-4-inter-thread execution scheme as well. The challenge here is how to exploit parallelism in AV (36.6%), which is one of our future research topics beyond the scope of this article.
Another observation we have from Figure 14 is that the speedup from the 4 outer and 2 inner threads is better than the speedup from the 4 outer and 4 inner threads under both DP and DP+HT configurations. This is simply due to the less threading overheads are introduced with a smaller number of threads. Later, we discuss more about controlling parallelism and controlling spin-waiting for getting a good trade-off between benefits and costs. In any case, we have achieved ~2x speedup under both DP and DP+HT configurations.

Functional decomposition may not deliver the best performance due to unbalanced load of all tasks among all processors in the system. Given the inherent variation of granularity for each task (or module), it is hardly to achieve the best potential performance without exploiting another level of parallelism. Essentially, for media workloads, we can exploit data-parallelism to overcome the issue of exploiting task-parallelism. As we show in Figure 14, by exploiting the inner parallelism with data-domain decomposition, we achieve much better speedups — the performance gain is around 40% with the 4 outer and 2 inner threads comparing to 4 outer threads (exploiting task-parallelism only). Thus, exploiting nested-parallelism is necessary to achieve better load balance and speedup. (Note: the inner-parallelism does not have to be data-parallelism always; it can be task-parallelism as well.) On the other hand, Figure 14 also shows that excessive threads introduce more extra threading overhead, the performance improvement with 4 inner threads is not better than that with 2 inner threads. Therefore, effectively controlling parallelism is still an important aspect to achieve the desired performance on a HT-enabled Intel Xeon processor system, even though the potential parallelism could improve the processor utilization. With Intel compiler and runtime, users are allowed to control how much time each thread should spend spinning at run-time. An environment variable KMP_BLOCKTIME is supported in the library. Also, the spinning time can be adjusted by using the kmp_set_blocktime() API call at runtime. On a HT-enabled processor more than one thread can be executing on the same physical processor at the same time. This indicates that both threads have to share that processor’s resources. It makes spin-waiting extremely expensive since the thread that is just waiting is now taking valuable processor resources away from the other thread that is doing useful work. Thus, when exploring the use of Hyper-Threading technology, the block-time should be very short so that the waiting thread sleeps as soon as possible allowing still useful threads to more fully utilize all processor resources. In our previous work, we use Win32 Threading Library calls to parallelize our multimedia workloads [5]. While we can achieve good performance, multi-threading them takes a huge amount of effort. With the Intel OpenMP compiler and OpenMP runtime library support, we demonstrated same or better performance with much less effort. In other words, the programmer intervention for parallelizing our multimedia applications is pretty minor.

Furthermore, we characterize the multimedia workloads by using Intel VTune Performance Analyzer under SP, SP+HT, DP, and DP+HT configurations to examine the HT benefits and costs instead of presenting speedup only. As shown in Table 1, although the numbers of instructions retired and cache miss rates (e.g., 2.7% vs 2.9% first-level cache miss rates for the linear SVM) are increased for both applications after threading due to execution resource sharing, cache and memory sharing, and contention, the overall application performance still increases. More specifically, the IPC is improved from 0.77 to 0.83 (8%) for SVM (linear) on SP, 17% for SVM (linear) on DP, 13% for SVM (RBF) on SP, 12% for SVM (RBF) on DP, and 30% for AVSR on SP. These results indicate the processor resource utilization is greatly improved for our multimedia applications with the Hyper-Threading technology.

Conclusions

In this article, we presented a set of compilation techniques that are developed in the Intel high-performance compiler for the OpenMP pragma-guided and directive-guided parallelization. Two multimedia applications were studied to demonstrate that the multithreaded codes generated and optimized by the Intel compiler are very efficient, together with the support of the well-tuned Intel OpenMP runtime library. The performance improvements achieved on three SP+HT, DP and DP+HT system configurations are pretty good for the multimedia applications (SVM and AVSR) studied in this article. The performance results and workload characteristics of SVM and AVSR demonstrated and evidenced our three main observations: (a) the multithreaded code generated by the Intel compiler yields a good performance gain with the parallelization guided by the OpenMP pragmas; (b) the exploited thread-level parallelism (TLP) causes inter-thread interference in caches, and places greater demands on the memory system. However, the Hyper-Threading technology hides the additional latency, so that there is only a very small impact on the whole program performance, and the overall performance gain makes this little impact not visible on Hyper-Threading enabled Intel platforms; (c) Hyper-Threading technology is effective on exploiting both task- and data-parallelism through functional and data decomposition in multimedia applications.

Acknowledgments

The authors thank all members of the Intel compiler team for their contribution in developing the Intel high-performance compiler. In particular, we thank Paul Grey, Hideki Saito, Dale Schouten for their contribution in PAROPT projects, Kund J. Kirkegaard for IPO support, Zia Ansari and Kevin B. Smith for PCG support, and , Max Domeika and Diana King for the C++ FE support, Bhanu Shankar and Michael Ross the Fortran FE support. Special thanks go to the library team at KSL for developing the OpenMP runtime library. We would like to thank Steven Ge and Rainer Lienhart for the development of speech recognition workloads.

References
E. Su, X. Tian, M. Girkar, G. Haab, S. Shah, and P. Petersen, “Compiler Support for Workqueuing Execution Model for Intel SMP Architectures”, in Proc. of European Workshop on OpenMP (EWOMP), Sep. 2002.
L. Liang, X. Liu, M. Zhao, X. Pi, and A. V. Nefian, “Speaker Independent Audio-Visual Continuous Speech Recognition,” in Proc. of Int’l Conf. on Multimedia and Expo, vol. 2, pp. 25-28, Aug. 2002.
X. Tian, A. Bik, M. Girkar, P. Grey, H. Saito, and E. Su, “Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance”, Intel Technology Journal, Q1, 2002. (http://www.intel.com/technology/itj)
D. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton, “Hyper-Threading Technology Microarchitecture and Architecture,” Intel Technology Journal, Vol. 6, Q1, 2002.
Y.-K. Chen, M. Holliman, E. Debes, S. Zheltov, A. Knyazev, S. Bratanov, R. Belenov, and I. Santos, “Media Applications on Hyper-Threading technology,” Intel Technology Journal, Q1 2002.
OpenMP Architecture Review Board, “OpenMP C++ Application Program Interface,” V2.0, Mar. 2002. (http://www.openmp.org)
D. M. Tullsen and J. A. Brown, “Handling Long-Latency Loads in a Simultaneous Multithreading Processor,” in Proc. of Micro-34, Dec. 2001.

Authors’ Biographies

Xinmin Tian ([email protected]) works on compiler parallelization and optimization. He manages the OpenMP Parallelization group. He holds B.Sc., M.Sc., and Ph.D. degrees in Computer Science from Tsinghua University. He was a postdoctoral researcher in the School of Computer Science at McGill University, Montreal. Before joining Intel Corp., he worked on a parallelizing compiler, code generation, and performance optimization at IBM.

Milind Girkar ([email protected]) received a B.Tech. degree from the Indian Institute of Technology, Mumbai, an M.Sc. degree from Vanderbilt University, and a Ph.D. degree in Computer Science from the University of Illinois at Urbana-Champaign in Computer Science. Currently, he manages the IA-32 Compiler Development group. Before joining Intel Corp., he worked on an optimizing compiler for the UltraSPARC platform at Sun Microsystems.

Yen-Kuang Chen ([email protected]) is a researcher in Microprocessor Research Labs, Intel Corporation. His research interests include computer architecture to embrace the emerging audio-visual applications, innovative technologies in intelligent human-computer interface, and multimedia signal processing. He received his B.Sc. degree from National Taiwan University and his Ph.D. from Princeton University, both in Electrical Engineering. He is on the editorial board of the Journal of VLSI Signal Processing Systems.

Aart Bik ([email protected]) received his M.Sc. degree in Computer Science from Utrecht University, The Netherlands, in 1992 and his Ph.D. degree from Leiden University, The Netherlands, in 1996. In 1997, he was a postdoctoral researcher at Indiana University, Bloomington, Indiana, where he conducted research in high-performance compilers for Java*. In 1998, he joined Intel Corporation where he is currently working in the vectorization and parallelization group.

Ernesto Su ([email protected]) received a B.Sc. degree from Columbia University, and M.Sc. and Ph.D. degrees from the University of Illinois at Urbana-Champaign, all in Electrical Engineering. He joined Intel Corp. in 1997 and is currently working in the OpenMP Parallelization group. His research interests include compiler performance optimizations, parallelizing compilers, and computer architectures.

(Performance results were measured using specific computer systems and reflect the approximate performance of Intel products. Any difference in system hardware or software design or configuration may affect actual performance.)

19 Comments

2003-03-13 3:04 am

Anonymous
that use this tech on Linux, BSD, or windows. when will IBM make this tech work in the 970 class proc?
2003-03-13 3:18 am

Anonymous
OpenMP works best on programs where you have a dataset that can be split into chunks and then the chunks assigned to worker threads. Much “multimedia” is great for OpenMP because you can easily segment the data.

What is problematic though is the software language side support for doing high-performance threading is immature compared to the compiler and some of the high-performance libraries (OpenMP, MPI, et al).

We have many C++ programming frameworks that are not even thread-safe much less amenable to thread-based optimization. Most current C++ GUI frameworks are classic examples of frameworks that were not designed with high-performance threading in mind.

As the hardware gets more and more evolved threading support, I would expect to see languages start tracking these developments and we will see new advanced parallelism constructs in our familiar languages.

To date, I know of only Erlang has having implemented pervasive multi-processing.
2003-03-13 4:06 am

Anonymous
good effort but it’s not in that 50-100 percent speed increase range where you say “wow, real cool”.

It’s nice.
2003-03-13 4:51 am

Anonymous
it is the law!!!
2003-03-13 6:30 am

Anonymous
Think again…:

An Intel P4 2800 costs 439 E, a 3,06 costs 699 E – now do the math and figure how many percent that is for the little increase in speed. I have seen videos from THG where two systems running head to head with video applications are equally fast. One of them is a plain 3,6 GhZ P4, the other 3,06 with HT enabled. Now, what does this tell us? In the above case you pay more than 70% extra for only 7% more CPU-power. With HT, you get 20% free and you don’t care..? – So be it..
2003-03-13 8:48 am

Anonymous
I thought pseudo-code should be written so it was easily readable.
2003-03-13 9:26 am

Anonymous
Who is that ?
2003-03-13 9:29 am

Anonymous
Read on the last page as to who is who. There are 5 of them, there was no space in the db field to mention all of them by name.
2003-03-13 9:35 am

Anonymous
> Who is that ?

Wonders too, who uses OSNews as intel PR

</rant>

To me that looks a bit too “scientific” for the average OSNews reader, but maybe I’m wrong

I’ll definitely read it all sometime.

I really wonder what this thing would give with a multithreading-crazy BeOS (where we don’t need optimizing compilers)…

Btw, I recently noticed VideoLan Client on BeOS was even more multithreaded than native media players

(is it too on other platforms ?)
2003-03-13 9:37 am

Anonymous
>To me that looks a bit too “scientific” for the average OSNews reader

I don’t think so. Supposedly most of our readers are actually programmers/engineers:

http://www.osnews.com/story.php?news_id=2037
2003-03-13 9:44 am

Anonymous
I read the first 2 pages, and then decided that i will try to read it again some other time when i can take some more time to digest it.

some more lay-men explanation with the examples

would have been nice though.

om a side note: i would very much like to run BeOS or OpenBeOS on a dual CPU hyperthreading system. for example dual XEON or so. this would allow 4 threads in parallel.

I already use BeOS on my dual PIII and it rocks.

on the other hand it might be worth waiting for XEON 32/64 bit. i still think intel will release 32 bit compatible CPU’s once AMD starts selling them. they had better, because i will not fork out 4000$ for a single CPU itanium2.

Int
2003-03-13 11:02 am

Anonymous
This looks like a draft for a peer-reviewed journal paper, and hence targeted at a different audience than me and presumably a lot of others. Can’t comment on the facts as I got lost on the 2nd para! I’m sadly not a computer scientist. I’ve got no problems with such stuff appearing on OSNEWS though – makes a break from looking at log files 🙂
2003-03-13 11:40 am

Anonymous
If you want to know what hyperthreading is all about, in understandable English, read the great article at Ars Technica: http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.htm…
2003-03-13 12:48 pm

Anonymous
> >To me that looks a bit too “scientific” for the average OSNews reader

> I don’t think so. Supposedly most of our readers are actually programmers/engineers:

Yes, though this looks more like a Ph.D. paper

(don’t have anything against that btw)
2003-03-13 1:05 pm

Anonymous
Page 5 lists the references and authors. The article was written by 5 PhD’s (to include other degrees). Intel has always been very good about providing in depth documentation about their microprocessor architecture. You have to wonder about it’s usefulness to AMD sometimes.
2003-03-13 1:57 pm

Anonymous
To me that looks a bit too “scientific” for the average OSNews reader, but maybe I’m wrong

Don’t let the math formulas fool you, these kind of scientic articles always have them but no one really reads them, unless of course there’s no source code examples and we really have to
2003-03-13 4:16 pm

Anonymous
I am surprized that OpenMP helps. It would seem the best case would be two instruction streams that are not related. OpenMP is usually used to create threads doing the same operations. In this case it is would be seem that they would be competing for the same resource. Perhaps this make up for the lack of registers in a p4. Does having the second state allow more data to flow to the same resources? Anyone know?

I did not see mention of the negatives. Is it just die space or do single streams get a performance/latency hit?

I would imagine with the poor state of smp in most OSS kernels that pretending to have two processors could easily more then make up for that performance increase.

I have seen lots of tests where 2 processors slows the linux kernel down instead of speed it up.

But perhaps on a HPTC machine having a separate virtual processor to handle os requests might not be too bad.

Anyone know what the big p4 Xeon linux clusters do about hyper threading?
2003-03-13 6:57 pm

Anonymous
Lets take a close look at their results, referring to figures 13 and 14. The hyperthreading is giving them at best a 13% speed boost over the non-hyperthreading scenario. This is evident in the single processing case. The inherent parallelism of the operation is evident by the fact that they get a nearly factor of 2 speed improvement in the dual processor case. The speedup in the hyperthreaded dual processors is simply an aggregate of the ten percent speed gains within each processor. What Figure 13 is therefore showing is that even in cases where parallelism is excellent in the algorithm, by evidence of the boost in the DP score, we still only get marginal speed improvements with hyperthreading.

Figure 14 shows an inherent problem with trying to fool the system into thinking there are four processors instead of only two as well. As it states, the algorithm is really only working on three processes simultaneously. The system, believing it has four full fledge processors, is therefore inefficiently distributing idle tasks among the two physical processors, in deference to the four simulated processors. This therefore shows that there can be a functional decrease in speed in a hyperthreaded system. The single processor hyperthreaded case for the algorithm used in Figure 14 did perform very well, but again it is evident that the algorithm itself lends itself to parallelism, by looking at the dual processor case.

This article therefore highlights two things in my mind:

1. OpenMP is effective in parallelizing algorithms “on the fly” so to speak.

2. Hyperthreading does increase performance, but not substantially.

Are there articles on simultaneous thread executions on completely different computations, rather than functionally parallel threads. For example, what kind of speed up would there be if one thread was doing the SVM calculation and the other was doing the AVSR one? Better still, what would happen if we distributed two threads for SVM execution and three for AVSR? Interesting thoughts….
2003-03-14 4:39 pm

Anonymous
Does any one know it AMD has any tools for openmp in works?

Anyone else owrking on this? Will intel Compiler (in 32 bit only:( will work on Optheron?

Thanx