posted by Intel Researchers (for OSNews) on Wed 12th Mar 2003 23:25 UTC
IconProcessors with Hyper-Threading technology can improve the performance of applications by permitting a single processor to process data as if it were two processors by executing instructions from different threads in parallel rather than serially. However, the potential performance improvement can be only obtained if an application is multithreaded by parallelization techniques. This article presents the multithreaded code generation and optimization techniques developed for the Intel C++/Fortran compiler. We conduct the performance study of two multimedia applications parallelized with OpenMP pragmas and compiled with the Intel compiler on the Hyper-Threading (HT) technology enabled Intel single-processor and multi-processor systems.

The performance results show that the threaded code generated by the compiler achieved up to 1.28x speedups on a HT-enabled single-processor system and up to 2.23x speedup on a HT-enabled dual-processor system. Our three key observations are: (a) the threaded code generated by the Intel compiler yields a good performance gain with the parallelization guided by OpenMP pragmas in multimedia applications; (b) exploiting thread-level parallelism (TLP) causes inter-thread interference in caches and places greater demands on the memory system, however, Hyper-Threading technology hides the additional latency and delivers a good performance gain of the whole program; (c) Hyper-Threading technology is effective on exploiting both task-parallelism and data-parallelism inherent in multimedia applications.

1. Introduction

Simultaneous Multi-Threading (SMT) [7, 15] was proposed to allow multiple threads to compete for and share all processor’s resources such as caches, execution units, control logic, buses and memory systems. The Hyper-Threading technology (HT) [4] brings the SMT idea to the Intel architectures and makes a single physical processor appear as two logical processors with duplicated architecture state, but with shared physical execution resources. This allows two threads from a single application or two separate applications to execute in parallel, increasing processor utilization and reducing the impact of memory latency by overlapping the latency of one thread with the execution of another Hyper-Threading technology-enabled processors offer significant performance improvements for applications with a high degree of thread-level parallelism without sacrificing compatibility with the existing software or single-threaded performance. These potential performance gains are only obtained, however, if an application is efficiently multithreaded. The Intel C++/Fortran compilers support OpenMP directive- and pragma-guided parallelization, which significantly increase the domain of various applications amenable to effective parallelism. A typical example is that users can use OpenMP parallel sections to develop an application where section-A calls an integer-intensive routine and where section-B calls a floating-point intensive routine, so the performance improvement is obtained by scheduling section-A and section-B onto two different logical processors that share the same physical processor to fully utilize processor resources with the Hyper-Threading technology. The OpenMP directives or pragmas have emerged as the de facto standard of expressing thread-level parallelism in applications as they substantially simplify the notoriously complex task of writing multithreaded applications. The OpenMP 2.0 standard API [6, 9] supports a multi-platform, shared-memory, parallel programming paradigm in C++/C and Fortran95 on all popular operating systems such as Windows NT, Linux, and Unix. This paper describes threaded code generation techniques for exploiting parallelism explicitly expressed by OpenMP pragmas/directives. To validate the effectiveness of our threaded code generation and optimization techniques, we also characterize and study two workloads of multimedia applications parallelized with OpenMP pragmas and compiled with the Intel OpenMP C++ compiler on Intel Hyper-Threading architecture. Two multimedia workloads, including Support Vector Machine (SVM) and Audio-Visual Speech Recognition (AVSR), are optimized for the Intel Pentium 4 processor. One of our goals is to better explain the performance gains that are possible in the media applications through exploring the use of Hyper-Threading technology with the Intel compiler.

The remainder of this article is organized as follows. We first give a high-level overview of Hyper-Threading technology. We then present threaded code generation and optimization techniques developed in the Intel C++ and Fortran product compilers for the OpenMP pragma or directive guided parallelization, which includes the exploitation of nested parallelism, and workqueuing model extension for exploiting irregular-parallelism. Starting from Section 4, we characterize and study two workloads of multimedia applications parallelized with OpenMP pragmas and compiled with the Intel OpenMP C++ compiler on Hyper-Threading technology enabled Intel architectures. Finally, we show the performance results of two multimedia applications.

2. Hyper-Threading Technology

Hyper-Threading technology brings the concept of Simultaneous Multi-Threading (SMT) to Intel Architecture. Hyper-Threading technology makes a single physical processor appear as two logical processors; the physical execution resources are shared and the architecture state is duplicated for the two logical processors [4]. From a software or architecture perspective, this means operating systems and user programs can schedule threads to logical CPUs as they would on multiple physical CPUs. From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources [4].

The optimal performance is provided by the Intel NetBurst™ microarchitecture while executing a single instruction stream. A typical thread of code with a typical mix of instructions, however, utilizes only about 50 percent of execution resources. By adding the necessary logic and resources to the processor die in order to schedule and control two threads of code, Hyper-Threading technology makes these underutilized resources available to a second thread, offering increased system and application performance. Systems built with multiple Hyper-Threading enabled processors further improve the multiprocessor system performance, processing two threads for each processor. Figure 1(a) shows a system with two physical processors that are not Hyper-Threading technology-capable. Figure 1(b) shows a system with two physical processors that are Hyper-Threading technology-capable. In Figure 1(b), with a duplicated copy of the architectural state on each physical processor, the system appears to have four logical processors. Each logical processor contains a complete set of the architecture state. The architecture state consists of registers including the general-purpose register group, the control registers, advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors required to store the architecture state is a very small fraction of the total. Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses. Each logical processor has its own interrupt controller or APIC. Interrupts sent to a specific logical processor are handled only by that logical processor.

With the Hyper-Threading technology, the majority of execution resources are shared by two architecture states (or two logical processors). Rapid execution engine process instructions from both threads simultaneously. The Fetch and Deliver engine and Reorder and Retire block partition some of the resources to alternate between the two intra-threads. In short, the Hyper-Threading technology improves performance of multi-threaded programs by increasing the processor utilization of the on-chip resources available in the Intel NetBurst™ microarchitecture.

Table of contents
  1. "Hyper-Threading Technology for Multimedia Apps, Page 1"
  2. "Hyper-Threading Technology for Multimedia Apps, Page 2"
  3. "Hyper-Threading Technology for Multimedia Apps, Page 3"
  4. "Hyper-Threading Technology for Multimedia Apps, Page 4"
  5. "Hyper-Threading Technology for Multimedia Apps, Page 5"
e p (0)    19 Comment(s)

Technology White Papers

See More