The Intel compiler incorporates many well-known and advanced optimization techniques [14] that are designed and extended to fully leverage Intel processor features for higher performance. The Intel compiler has a common intermediate representation (named IL0) for C++/C and Fortran95 language, so that the OpenMP directive- and pragma-guided parallelization and a majority of optimization techniques are applicable through a single high-level intermediate code generation and transformation, irrespective of the source language. In this Section, we present several threaded code generation and optimization techniques in the Intel compiler.


We developed a new compiler technology named Multi-Entry Threading (MET) [3]. The main motivation of the MET compilation model is to keep all newly generated multithreaded codes, which are captured by T-entry, T-region and T-ret nodes, embedded inside the user-routine without splitting them into independent subroutines. This method is different from outlining [10, 13] technique, and it provides later more optimization opportunities for higher performance. From the compiler-engineering point of view, the MET technique greatly reduces the complexity of generating separate routines in the Intel compiler. In addition, the MET technique minimizes the impact of OpenMP parallelizer on all well-known optimizations in the Intel compiler such as constant propagation, vectorization [8], PRE [12], scalar replacement, loop transformation, profile-feedback guided optimization and interprocedural optimization.
The code transformations and optimizations in the Intel compiler can be classified into (i) code restructuring and interprocedural optimizations (IPO); (ii) OpenMP directive-guided and automatic parallelization and vectorization; (iii) high-level optimizations (HLO) and scalar optimizations including memory optimizations such as loop control and data transformations, partial redundancy elimination (PRE), and partial dead store elimination (PDSE); and (iv) low-level machine code generation and optimizations such as register allocation and instruction scheduling. In Figure 2, we show a sample program using the parallel sections pragma.

In order to generate efficient threaded-code that gains a speed-up over optimized uniprocessor code, an effective optimization phase ordering had been designed in the Intel compiler to make sure that optimizations, such as, IPO inlining, code restructuring, Igoto optimizations, and constant propagation, which can be effectively enabled before parallelization, preserve legal OpenMP program semantics and necessary information for parallelization. It also ensures that all optimizations after the OpenMP parallelization, such as auto-vectorization, loop transformation, PRE, and PDSE, can effectively kick in to achieve a good cache locality and to minimize the number of redundant computations and references to memory. For example, given a double-nested OpenMP parallel loop, the parallelizer is able to generate multithreaded code for the outer loop, while maintaining the symbol table information, loop structure, and memory reference behavior for the innermost loop. This enables the subsequent auto-vectorization for the innermost loop to fully leverage the SIMD Streaming Extension (SSE and SSE2) features of Intel processors [3, 8]. There are a number of efficient threaded-code generation techniques that have been developed for OpenMP in the Intel compiler. The following sub- sections describe some such techniques.
Support Nested Parallelism
Both static and dynamic nested parallelisms are supported by the OpenMP standard. However, most existing OpenMP compilers do not fully support nested parallelism, since the OpenMP-compliant implementation is allowed to serialize the nested inner regions, even when the nested parallelism is enabled by the environment variable OMP_NESTED or routine omp_set_nested(). For broad classes of applications, such as imaging processing and audio/video encoding and decoding algorithms, the good performance gains are achieved by exploiting nested parallelisms. We provided the compiler and runtime library support to exploit static and dynamic nested parallelism. Figure 4(a) shows a sample code with nested parallel regions, and Figure 4(b) does show the pseudo-threaded-code generated by the Intel compiler. As shown in Figure 4(b), there are two threaded regions, or T-regions, created within the original function nestedpar(). T-entry __nestedpar_par_region0() corresponds to the semantics of the outer parallel region, and the T-entry __nestedpar_par_region1() corresponds to the semantics of the inner parallel region. For the inner parallel region in the routine nestedpar, the variable id is a shared stack variable for the inner parallel region. Therefore, it is accessed and shared by all threads through the T-entry argument id_p. Note that the variable id is a private variable for the outer parallel region, since it is a local defined stack variable. As we see in Figure 4(b), there are no extra arguments on the T-entry for sharing local static array ‘a’, and there is no pointer de-referencing inside the T-region for sharing the local static array ’a’ among all threads in the teams of both the outer and inner parallel regions. This uses the optimization technique presented in [3] for sharing local static data among threads; it is an efficient way to avoid the overhead of argument passing across T-entries.
- "Hyper-Threading Technology for Multimedia Apps, Page 1"
- "Hyper-Threading Technology for Multimedia Apps, Page 2"
- "Hyper-Threading Technology for Multimedia Apps, Page 3"
- "Hyper-Threading Technology for Multimedia Apps, Page 4"
- "Hyper-Threading Technology for Multimedia Apps, Page 5"



