Irregular parallelism inherent in many applications is hard to be exploited efficiently. The workqueuing model  provides a simple approach for allowing users to exploit irregular parallelism effectively. This model allows a programmer to parallelize control structures that are beyond the scope of those supported by the OpenMP model, while still fitting into the framework defined by the OpenMP specification. In particular, the workqueuing model is a flexible programming model for specifying units of work that are not pre-computed at the start of the worksharing construct. See a simple example in Figure 5.
The parallel taskq pragma specifies an environment for the ‘while loop’ in which to enqueue the units of work specified by the enclosed task pragma. Thus, the loop’s control structure and the enqueuing are executed by single thread, while the other threads in the team participate in dequeuing the work from the taskq queue and executing it. The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued, hence preserving the sequential semantics. The workqueuing execution model is shown in Figure 6. Essentially, given a program with workqueuing constructs, a team of threads is created, when a parallel region is encountered. With the workqueuing execution model, from among all threads that encounter a taskq pragma, one thread (TK) is chosen to execute it initially. All the other threads (Tm, where m=1, …, N and m-K) wait for work to be enqueued on the work queue. Conceptually, the taskq pragma causes an empty queue to be created by the chosen thread TK, enqueues each task it encounters, and then the code inside the taskq block is executed single-threaded by the TK. The task pragma specifies a unit of work, potentially executed by a different thread. When a task pragma is encountered lexically within a taskq block, the code inside the task block is enqueued on the queue associated with the taskq. The conceptual queue is disbanded when all work enqueued on it finishes, and when the end of the taskq block is reached.
Due to the inherently sequential constitution of the algorithms of multimedia applications, most of the modules in these optimized applications cannot fully utilize all the execution units available in the off-the-shelf microprocessors. Some modules are memory-bound, while some are computation-bound. In this Section, we describe the selected multimedia workloads and discuss our approach of parallelizing the workloads with OpenMP.
Audio-visual Speech Recognition
The second workload that we investigate is audio-visual speech recognition (AVSR). There are many applications using automatic speech recognition systems, from human computer interfaces to robotics. While computers are getting faster, speech recognition systems are not robust without special constraints. Often, robust speech recognition requires special conditions, such as, smaller vocabulary, or very clean signal of the voice.
In recent years, several speech recognition systems that use visual together with audio information showed significant increase in performance over the standard speech recognition systems. Figure 7 shows a flowchart of the AVSR process. The use of visual feature in AVSR is motivated by the bimodality of the speech formation and the ability of humans to better distinguish spoken sounds when both audio and video are available. Additionally, the visual information provides the system with complementary features that cannot be corrupted by the acoustic noise of the environment. In our performance study, the system developed by Liang et al. L. Liang, X. Liu, M. Zhao, X. Pi, and A. V. Nefian, “Speaker Independent Audio-Visual Continuous Speech Recognition,” in Proc. of Int’l Conf. on Multimedia and Expo, vol. 2, pp. 25-28, Aug. 2002. is used.
A way of exploiting parallelism of multimedia workloads is to decompose the work into threads in data-domain. As described in Section 4.1.1, the evaluation of trained SVMs is well-structured and can, thus, be multithreaded at multiple levels. On the lowest level, the dimensionality K of the input data can be very large. Typical values of K range between a few hundreds to several thousands. Thus, the vector multiplication in the linear, polynomial, and sigmoid kernels as well as the L2 distance in the radial basis function kernel can be multithreaded. On the next level, the evaluation of each expression in the sum is independent of each other. Finally, in an application several samples are tested and each evaluation can be done in parallel. In Figure 8, we show the parallelized SVM by simply adding a parallel for pragma. The programmer intervention for parallelizing the SVM is minor. The compiler generates the multithreaded code automatically.
The functional decomposition is another way to multithread an application for exploiting task-parallelism. The AVSR application has clearly four different functional components. These are audio processing, video processing, audio-video processing, and others. Therefore, a natural scheme of parallelizing the AVSR is to map a functional component to an OpenMP worksharing section  as shown in Figure 9. Streams of audio and video data can be broken into pieces and be processed in pipeline. In our multithreaded application, while the audio processing and the video processing are working on the current piece of the data, the AVSR processing is working on the previous piece of the data as well. We did parallelize not only the parallel tasks, but also the pipeline tasks. Same as exploiting data-parallelism in the SVM application, the programmer intervention for parallelizing the AVSR is also pretty small. A few OpenMP pragmas are simply added to the original source code. The compiler performs the threaded code generation presented in Section 3 together with the OpenMP library support to execute the AVSR application in parallel.
In addition to functional-decomposition of the AVSR application, we exploit the nested data-parallelism in the dynamic extent of the video processing section (or thread). The major motivation of further partitioning this thread into multiple threads is to achieve better load balance. The execution time breakdown of the AVSR workload is shown in Figure 10 in which the video processing takes around half of the time. To exploit task-level parallelism of the application on a single processor with Hyper-Threading technology or a dual-processor system without Hyper-Threading technology, the workload can be balanced well by having the video processing thread on one processor and having the rest on the other processor. However, on a dual-processor system with Hyper-Threading technology, pure functional decomposition cannot have balanced loads. This is because video processing takes ~50% of the total execution time. We further make dot-product of matrices/vectors and Fourier transform into multiple threads, as shown in Figure 11. Thus, as shown in Figure 12, we have totally three threading schemes in our experiment to evaluate the exploitation of static nested parallelism supported by the Intel compiler and OpenMP runtime library.
- "Hyper-Threading Technology for Multimedia Apps, Page 1"
- "Hyper-Threading Technology for Multimedia Apps, Page 2"
- "Hyper-Threading Technology for Multimedia Apps, Page 3"
- "Hyper-Threading Technology for Multimedia Apps, Page 4"
- "Hyper-Threading Technology for Multimedia Apps, Page 5"