posted by Intel Researchers (for OSNews) on Wed 12th Mar 2003 23:25 UTC

"Hyper-Threading Technology for Multimedia Apps, Page 2"
Intel Compiler for Multithreading

The Intel compiler incorporates many well-known and advanced optimization techniques [14] that are designed and extended to fully leverage Intel processor features for higher performance. The Intel compiler has a common intermediate representation (named IL0) for C++/C and Fortran95 language, so that the OpenMP directive- and pragma-guided parallelization and a majority of optimization techniques are applicable through a single high-level intermediate code generation and transformation, irrespective of the source language. In this Section, we present several threaded code generation and optimization techniques in the Intel compiler.

Threaded Code Generation Technique

We developed a new compiler technology named Multi-Entry Threading (MET) [3]. The main motivation of the MET compilation model is to keep all newly generated multithreaded codes, which are captured by T-entry, T-region and T-ret nodes, embedded inside the user-routine without splitting them into independent subroutines. This method is different from outlining [10, 13] technique, and it provides later more optimization opportunities for higher performance. From the compiler-engineering point of view, the MET technique greatly reduces the complexity of generating separate routines in the Intel compiler. In addition, the MET technique minimizes the impact of OpenMP parallelizer on all well-known optimizations in the Intel compiler such as constant propagation, vectorization [8], PRE [12], scalar replacement, loop transformation, profile-feedback guided optimization and interprocedural optimization.

The code transformations and optimizations in the Intel compiler can be classified into (i) code restructuring and interprocedural optimizations (IPO); (ii) OpenMP directive-guided and automatic parallelization and vectorization; (iii) high-level optimizations (HLO) and scalar optimizations including memory optimizations such as loop control and data transformations, partial redundancy elimination (PRE), and partial dead store elimination (PDSE); and (iv) low-level machine code generation and optimizations such as register allocation and instruction scheduling. In Figure 2, we show a sample program using the parallel sections pragma.

Essentially, the multithreaded code generator inserts the thread invocation call __kmpc_fork_call(…) with T-entry node and data environment (source line information loc, thread number tid, etc.) for each parallel loop, parallel sections or parallel region, and transforms a serial loop, sections, or region to a multithreaded loop, sections, or region, respectively. In this example, the pre-pass first converts parallel sections to a parallel loop. Then, the multithreaded code generator localizes loop lower-bound and upper-bound, privatizes the section id variable for the T-region marked with [T_entry, T-ret] nodes. For the parallel sections in the routine “parfoo”, the multithreaded code generation involves (a) generating a runtime dispatch and initialization routine (__kmpc_dispatch_init) call to pass necessary information to the runtime system; (b) generating an enclosing loop to dispatch loop-chunk at runtime through the __kmpc_dispatch_next routine in the library; (c) localizing the loop lower-bound, upper-bound, and privatizing the loop control variable ‘id’ as shown in Figure 3. Given that the granularity of the sections could be dramatically different, the static or static-even scheduling type may not achieve a good load balance. We decided to use the runtime scheduling type for a parallel loop generated by the pre-pass of multithreaded code generation. Therefore, the decision regarding scheduling type is deferred until run-time, and an optimal balanced workload can be achieved based on the setting of the environment variable OMP_SCHEDULE supported in the OpenMP library.

In order to generate efficient threaded-code that gains a speed-up over optimized uniprocessor code, an effective optimization phase ordering had been designed in the Intel compiler to make sure that optimizations, such as, IPO inlining, code restructuring, Igoto optimizations, and constant propagation, which can be effectively enabled before parallelization, preserve legal OpenMP program semantics and necessary information for parallelization. It also ensures that all optimizations after the OpenMP parallelization, such as auto-vectorization, loop transformation, PRE, and PDSE, can effectively kick in to achieve a good cache locality and to minimize the number of redundant computations and references to memory. For example, given a double-nested OpenMP parallel loop, the parallelizer is able to generate multithreaded code for the outer loop, while maintaining the symbol table information, loop structure, and memory reference behavior for the innermost loop. This enables the subsequent auto-vectorization for the innermost loop to fully leverage the SIMD Streaming Extension (SSE and SSE2) features of Intel processors [3, 8]. There are a number of efficient threaded-code generation techniques that have been developed for OpenMP in the Intel compiler. The following sub- sections describe some such techniques.

Support Nested Parallelism

Both static and dynamic nested parallelisms are supported by the OpenMP standard. However, most existing OpenMP compilers do not fully support nested parallelism, since the OpenMP-compliant implementation is allowed to serialize the nested inner regions, even when the nested parallelism is enabled by the environment variable OMP_NESTED or routine omp_set_nested(). For broad classes of applications, such as imaging processing and audio/video encoding and decoding algorithms, the good performance gains are achieved by exploiting nested parallelisms. We provided the compiler and runtime library support to exploit static and dynamic nested parallelism. Figure 4(a) shows a sample code with nested parallel regions, and Figure 4(b) does show the pseudo-threaded-code generated by the Intel compiler. As shown in Figure 4(b), there are two threaded regions, or T-regions, created within the original function nestedpar(). T-entry __nestedpar_par_region0() corresponds to the semantics of the outer parallel region, and the T-entry __nestedpar_par_region1() corresponds to the semantics of the inner parallel region. For the inner parallel region in the routine nestedpar, the variable id is a shared stack variable for the inner parallel region. Therefore, it is accessed and shared by all threads through the T-entry argument id_p. Note that the variable id is a private variable for the outer parallel region, since it is a local defined stack variable. As we see in Figure 4(b), there are no extra arguments on the T-entry for sharing local static array ‘a’, and there is no pointer de-referencing inside the T-region for sharing the local static array ’a’ among all threads in the teams of both the outer and inner parallel regions. This uses the optimization technique presented in [3] for sharing local static data among threads; it is an efficient way to avoid the overhead of argument passing across T-entries.

Table of contents
  1. "Hyper-Threading Technology for Multimedia Apps, Page 1"
  2. "Hyper-Threading Technology for Multimedia Apps, Page 2"
  3. "Hyper-Threading Technology for Multimedia Apps, Page 3"
  4. "Hyper-Threading Technology for Multimedia Apps, Page 4"
  5. "Hyper-Threading Technology for Multimedia Apps, Page 5"
e p (0)    19 Comment(s)

Technology White Papers

See More