Hyperthreading Technology and Digital Multimedia

Guest post by Roger Finger 2003-03-04 Intel 32 Comments

Digital media applications are unique in that they can generally consume all the performance they can get. Unlike other tasks that execute in a few seconds, the rendering of stills, audio and video can take several minutes or even hours. Applications in the digital media space can translate increases in performance to increases in end-user productivity, and it is therefore beneficial for them to take advantage of the latest platform technologies.

Overview

The Pentium® 4 Processor with Hyperthreading Technology delivers performance and architectural features that dramatically reduce the overall processing time and improve the responsiveness of the system. Processors equipped with Hyper-Threading Technology have multiple logical CPU’s per physical package. The state information necessary to support each logical processor is replicated while sharing and/or partitioning the underlying physical processor resources. Multiple threads running in parallel can achieve higher processor utilization and increased throughput.

In Section 1, Video Production is used as an example of application workflow to show how Hyperthreading Technology benefits digital media production. Each of the four major steps of video production is examined in detail.

In Section 2, the multi-tasking characteristics of a system with HT technology are considered. When multiple applications are running on a system, HT technology helps reduce stalls and task switching delays caused by the interaction of two or more independent programs.

Section 3 discusses some software and system level design considerations for optimizing multi-threaded applications in a multitasking environment. In this section we’ll look at how application developers can use the Intel compilers to develop optimized code and then use the VTune® Performance Analyzer to identify hotspots and optimize the code.

Section 1: Video Production Case Study

Video production is a complex multi-step process that often involves using multiple programs to achieve the desired output. There are four major steps to this process:

Acquire: Capture movies and pictures, capture audio
Build/Edit: Edit, mix, preview, store your project
Render: Apply compression and format the file
Output: Store the end result on hard drive, or burn to disk

Acquire

Digital Video Cameras connect to the PC using IEEE 1394 (Firewire), USB, or through an analog connection. They transmit at a fixed rate of 25 or 30 frames per second (depending on format: PAL or NTSC), so the capture step can never go faster than the actual play time of the video. Five minutes of video takes five minutes to capture.

The data rates are high (about 4 Mbps) and the PC has to keep up with the source, or else dropped frames will result. Dropped frames degrade the quality of the video, so most software packages warn not to do anything else on your system while capture is under way.

On non-Hyper Threaded systems and slower systems that’s still good advice. But with HT technology, multi-tasking capability of the system is enhanced: A background task is less likely to get pre-empted by other programs. Multi-tasking allows the user to continue using their PC for other activities. Video capture is not a CPU intensive activity – it typically consumes less than 15% of a 3.06 GHz Pentium® 4 processor (figure 1). Why not allow the end user to use that time for something else?

Figure 1: DV Capture from IEEE 1394

Some capture applications simultaneously encode the incoming Digital Video stream into Windows Media or MPEG formats. The advantages are that smaller files are created, and the media is in the desired output format early in the process. The disadvantage is that some quality will be lost early in the production cycle.

Figure 2 shows DV capture from IEEE 1394, with encoding to MPEG2 for the output. Capture time is still the rate-limiting step, but the CPU is kept very busy with the encoding task. The completion times of both were approximately equal, but with HT technology there is more CPU capability available for other tasks to use. This translates into faster UI responsiveness, even under heavy multi-tasking loads.

Figure 2: DV Capture from IEEE 1394 with MPEG 2 Encoding

It takes a lot of multi-tasking activity during video capture to cause frame drops on Hyper Threaded systems. The only caveat is to watch out for disk conflicts. The I/O rates during DV capture create a continuous demand on the hard disk of about 2-3%. The data rates are not that high, but streaming must be maintained. If these disk updates cannot happen in real-time, then frame dropping can result. Application developers can avoid some of these problems by locking I/O resources during critical real-time operations – but should do so with the full understanding that other applications may stall as a result.

Build/Edit

During the editing phase of production, audio, video, and stills are mixed together from various sources. During video preview, decoders for MP3, MPEG2, AVI, and other formats will be running. Individually, these decoders do not demand high CPU utilization. Playback is very smooth – until you throw in additional audio tracks, transitions and special effects where multiple codecs and filters must run simultaneously. The more complex transitions can be very CPU intensive and usually involve decoding two or more media streams at the same time. In figure 3, the peaks that are seen every 10 seconds are video transitions.

Figure 3: Video Preview with Transitions

Render

Rendering involves taking the edit decision list and creating video file on the hard disk. Rendering is very CPU intensive – it can use all the performance you can through at it and scales well with faster processors. Audio and Video encoders run simultaneously during rendering so this step is well suited to threading and parallelism.

Figure 4: Video Encoding to MPEG2

With the speed of a 3 GHz Pentium® 4 processor and HT technology, it is now possible to encode full resolution NTSC video faster than real-time! In Figure 4 the source was a 180 second DV video. Without HT technology, the video was encoded in 136 seconds. With HT technology enabled, the time to encode MPEG2 decreased to 111 seconds. For a one-hour video project, the encode time was about 37 minutes.

One surprising difference between Hyper Threaded and non-Hyper Threaded systems is the responsiveness of the User Interface. New tasks launch right away and the cursor is rarely in an hourglass. The encoding task is spread across both processors, and there is plenty of headroom for other applications to run. Without Hyper Threading technology, Figure 4 shows the CPU is 70% consumed with the video encode task – leaving limited resources for other programs.

Output to media

After video encoding is complete, the next part of the process is to create a disk image and write it to CD or DVD so that it can be distributed and played back on a DVD player connected to a television. This phase of the process actually has two major phases with many sub-steps that utilize different parts of the system.

In the first phase the video and audio files are converted to the proper format. Depending on the playback target, the format may be MPEG2 (for consumer DVD players), MPEG4 (for posting on the web), or VCD (a lower resolution format for writable CD’s). In the following example, the output media will be assumed to be high quality DVD-compatible 720×480 30 frames per second. Figure 5 shows the two phases of the Output cycle for writing a DVD.

Figure 5: Output to DVD

Phase 1 involves transcoding or re-encoding the audio and video streams into a compatible format. This is a CPU intensive process, and hard disk activity is also high as files get read in, modified, and then written back out to disk. Without Hyperthreading technology, the CPU is 100% consumed and takes about three (3) times longer for the encoding phase.

Phase 2 consists of file operations to prepare the image for burning and then write it to the media. It is not CPU intensive since the rate-limiting step is the CD or DVD burner. Multitasking during optical disk writing can be risky. On older systems there is a event known as a “Buffer Under-run” that can occur if the CPU is not able to produce data fast enough to keep the disk writer sufficiently stocked.

This problem has been largely overcome as the new drives have larger input buffers (typically 2 Mbit) and there are now protection mechanisms such as “Burn Proof” technology that ensure the disk will get written properly. Most DVD writers do have buffer under-run protection, but the drives are slower (typically 2x) and DVD writing can take up to an hour. Your system is still available, but you should avoid operations involving heavy disk activity.

Section 2: Multitasking Software in a Hyperthreaded Environment

Figure 6: Video Encoding while Multitasking

Hyper Threading technology is enabled by the multi-processing support in the Windows® XP and Windows® 2000 operating systems. Figure 6 shows that HT technology improved the execution times of some common digital media activities – even while the system was under heavy load during a video encoding session.

Hyper Threading does not guarantee that your application will run faster. To benefit from hyperthreading, programs need to have executable sections that can run in parallel. Threading improves the granularity of an application so that operations can be broken up into smaller units whose execution is scheduled and controlled by the operating system. Now two threads can run independently of each other without requiring task switches to get at the resources of the processor.

Figure 7 shows a comparison of code executing on a CPU with HT On versus HT Off. When HT is off, a process can stall while waiting for I/O to complete or another task to provide information it needs. The CPU is blocked from further execution. When HT is on, other threads continue running so the system does not hang or stall.

Figure 7 – Separate data paths enable the CPU to continue working
on other threads, even when one becomes blocked

How the Windows® Operating System handles Multi-threading and Multitasking

Multitasking occurs at the user interface level every time a user is runs multiple programs at once. Some applications also perform multitasking internally by creating multiple processes. Each process is given a time-slice during which time it executes. Creation of a process involves the creation of an address space, the applications image in memory, which includes a code section, a data section and a stack. Parallel programming using processes requires the creation of two or more processes and an inter-process communication mechanism to coordinate the parallel work.

Threads are tasks that run independently of one another within the context of a process. A thread

shares code and data with the parent process but has its own unique stack and architectural state

that includes an instruction pointer. Threads require fewer system resources than processes. Intra-process communication is significantly cheaper in CPU cycles than inter-process communication.

The life cycle of a thread begins when the application assigns a thread pool and creates a thread from the pool. When invoked, the thread gets scheduled by the Windows® XP operating system according to a round-robin mechanism. The next available thread with the highest priority gets to run.

When the thread is scheduled the Operating System checks to see which logical processors are available, then allocates the necessary resources to execute the thread. Each time a thread is dispatched, resources are replicated, divided, or shared to execute the additional threads. When a thread finishes, the operating system idles the unused processor, freeing the resources associated with that thread.

Section 3: Software Design Considerations for Hyperthreading:

In a processor with HT technology, software developers should be aware that architectural state is the only resource that is replicated. All other resources are either shared or partitioned between logical processors. This introduces the issue of resource contention, which can degrade performance, or in the extreme case – cause an application to fail. Synchronization between threads is another area where problems can arise. The following section contains a brief discussion of some of the most common issues in multi-threaded software design. For more complete information, a collection of technical papers is available at the Intel Developer Services website:

www.intel.com/ids

T hread Synchronization:

Synchronization is used in threaded programs to prevent race conditions (e.g., multiple threads simultaneously updating the same global variable). A spin-wait loop is a common technique used to wait for the availability of a variable or I/O resource.

Consider the case of a master thread that needs to know when a disk write has completed. The master thread and the disk write thread share a synchronization variable in memory. When this variable gets written, it can cause an out-of-order memory violation that forces a performance penalty. Inserting a PAUSE instruction in the master thread read loop can greatly reduce memory order violations.

Spin-wait loops consume execution resources while they are cycling. If other tasks are waiting to run, the thread performing the spin lock can insert a call to Sleep(0) which releases the CPU. If no tasks are waiting, this thread immediately continues execution.

Another alternative to long spin-wait loops is to replace the loop with a thread-blocking API, such as WaitForMultipleObjects. Using this system call ensures that the thread will not consume resources until all of the listed objects are signaled as ready and have been acquired by the thread.

Avoid 64K aliasing in L1 Cache:

The first level data cache (L1) is a shared resource on HT technology processors. Cache lines are mapped on 64KB boundaries, so if two virtual memory addresses are modulo 64KB apart, they will conflict for the same L1 cache line. Under Microsoft Windows® operating systems, threads are created on megabyte boundaries, and 64K aliasing can occur when these threads access local variables on their stacks. A simple solution is to offset the starting stack address by a variable amount using the _alloc function.

False Sharing in the Data Cache

A cache line in the Pentium® 4 processor consists of 64 bytes for write operations, and 128 bytes for reads. False sharing occurs when two threads access different data elements in the same cache line. When one of those threads performs a write operation, the cache line is invalidated, causing the second thread to have to fetch the cache line (128 bytes) again from memory. If this occurs frequently, false sharing can seriously degrade the performance of an application.

False sharing can be diagnosed using the VTUNE Performance Analyzer to monitor the “machine clear caused by other thread’ counter. Some techniques to avoid False Sharing include partitioning data structures, creating a local copy of the data structure for each thread, or padding data structures so they are twice the size of a read cache line.

Write Combining Buffers

The Intel NetBurst™ architecture has 6 Write Combine store (WC) buffers, each buffering one cache line. The Write Combine buffers allow code execution to proceed by combining multiple write operations before they get written back to memory through the L1 or L2 caches. If an application is writing to more than 4 cache lines at the same time, the WC store buffers will begin to be flushed to the second level cache. This is done to help insure that a WC store buffer is ready to combine data for writes to a new cache line.

To take advantage of the Write Combining buffers, an application should not write to

more than 4 distinct addresses or arrays inside an inner loop. On Hyper-Threading enabled processors, the WC store buffers are a shared resource; therefore, the total number of simultaneous writes by both threads running on the two logical processors must be considered. If data is being written inside of a loop, it is best to split inner loop code into multiple inner loops, each of which writes no more than two regions of memory.

Cache Blocking Techniques

Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. This is accomplished by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, Image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.

The effectiveness of the cache blocking technique is highly dependent on data block size,

processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using Intel’s CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache for

non-Hyper-Threading processors and one-quarter to one-half the physical cache size for a

Hyper-Threading enabled processor supporting two logical processors.

Adjusting Task Priorities for Background Tasks

In some applications there are background activities that run continuously, but have little impact on the responsiveness of the system. In these cases, consider adjusting the task or thread priority downward so that this code only runs when resources become available from higher priority tasks.

Conversely, if an application requires real-time response, it can increase task priority so that it runs ahead of other normal priority tasks. This technique should be used with caution, since it can degrade the responsiveness of the user interface, and may affect the performance of other applications running on the system.

Load Balancing

On a multiple processor system or on Hyper Threading enabled processors, load balancing is normally handled by the operating system, which allocates workload to the next available resource. In some cases it is possible that one of the logical CPU’s becomes idle, while the other is overloaded. In this case a developer can address load imbalance by setting Processor Affinity.

Processor affinity allows a thread to specify exactly which processor (or processors) that the operating system may select when it schedules the thread for execution. When an application specifies the processor affinity for all of its active threads, it can ensure that load imbalance will not occur among its threads and eliminate thread migration from one logical processor to another.

Simultaneous Fixed and Floating Point Operations

With Hyper Thread technology, there are several ALU’s (for integer logic), but only one shared floating-point unit. If your application uses floating-point calculations, it may be beneficial to isolate those threads and set the processor affinity of the threads to minimize the processor resource contention.

Avoiding Dependence on Timing Loops

Relying on the execution timing between threads as a synchronization technique is not reliable because of speed differences between host systems. Delay loops are sometimes used during initialization as well, and should be avoided for the same reasons.

Software Design Considerations for Multitasking:

Most of the rules above also apply for Multitasking, plus there are a few additional considerations. Task switches are much slower than thread context switches because each task operates in its’ own address space. The state of the previously running task must be saved and data residing in the cache will be invalidated and reloaded.

Hyperthreading enhances multitasking because the state information for each task is stored on a separate logical processor. Cache invalidation will still occur, but the need for a task switch is eliminated since both tasks can run at once. Since cache is a shared resource on Hyper Threading enabled processors, all of the above rules regarding data alignment and blocking still apply.

Contention for resources can be a problem when multitasking. It can occur in memory, on the system busses, or on I/O devices. Consider the case of video capture while creating an MP3 file. Both applications use the hard disk intensively, but video capture has to occur in real-time. The result of contention is that the video drops frames, and the MP3 file skips.

Applications should check the status of an I/O device before attempting to pass data to it. If necessary, peripherals can be locked to avoid access by other applications. This makes sense for a CD or DVD writer, which is essentially a single use device. Locking the hard drive is not recommended, since it is a critical OS resource.

Task and thread priority can have a dramatic effect in a multitasking environment. If priority is raised in a task that runs continuously, other tasks will starve until the high priority task releases the processor. But lowering priority can be good for background processing tasks. Consider the case of a video encoder – which normally takes 100% of the processor. If you lower the priority, the user will still be able to use their computer on demand, but the video encode will still run 100% of the time when the CPU is otherwise available.

Load balancing within applications can actually degrade multi-tasking performance. If one application assumes it has full control of both processors, resource contention can occur when a second application attempts to load. This highlights a fundamental issue with multitasking programs – you never know what other software will be running concurrently with your program. It is usually best to not lock up resources that other programs are likely to need.

Optimized Compilers and Libraries Help Avoid Multi-threading Problems

The best way to design, implement, and tune for Hyper-Threading enabled processors is to start with components or libraries that are thread-safe and designed for Hyper-Threading enabled processors. The operating system and threading libraries are likely to already be optimized for various processors. Use operating system and/or threading synchronization libraries instead of implementing application specific mechanism like spin-waits. Existing applications can take advantage of enhanced code modules by re-linking or through the use of dynamic link libraries.

Intel compilers enable threading by supporting both OpenMP* and auto-parallelization. OpenMP is an industry standard for portable threaded application development, and is effective at threading loop-level parallel problems and function level parallelism. The C++ compiler supports OpenMP API version 1.0 and performs code transformation for shared memory parallel programming.

The Intel® C++ Compiler for Windows with auto-parallelization uses a high-level symmetric multi-processing (SMP) programming model to enable migration to multiprocessing machines (multiple physical CPU’s). This option detects which loops are capable of being executed safely in parallel and automatically generates threaded code for these loops. Automatic parallelization relieves the user from having to deal with the low-level details of iteration partitioning, data sharing, thread scheduling and synchronizations.

Tools to identify performance bottlenecks

The Intel® VTune™ Performance Analyzer allows you to visualize how software utilizes CPU resources. By seeking out ‘hotspots’ in your code, you can focus on optimizing the sections of code that occupy most of the computation time. VTune enables you to view to potential problem areas by memory location, functions, classes, or source files. You can double-click and open the source or assembly view for the hotspot and see more detailed information about the performance of each instruction. The Intel® Thread Checker which works in conjunction with the Intel VTune™ Performance Analyzer automates detection of most threading errors such as:

Deadlocks (detection and prediction)
Memory access issues
Race conditions
Thread stalls (potential deadlocks), waits
Potential and realized rata races / dependencies improperly synchronized I/O
Invalid threading-library calls, arguments, returns
Threaded calls to non-reentrant routines

Summary and Conclusion

Digital video production is among the most demanding of applications you can run on your computer, yet the Pentium® 4 processor with Hyperthreading technology delivers smooth, real-time performance even while other programs are running. These benefits go beyond simple clock rate.

HT technology improves the availability of the CPU for multi-tasking and background processing. The User Interface is noticeably more responsive under heavy system loads. All of these factors contribute to a favorable, and more productive, end-user experience when using your programs with other digital media tools.

Threaded applications take this one step further by improving parallel execution within a task. Application developers can produce optimized code using the Intel Compiler, and then use the VTUNE Performance Analyzer to identify code hotspots and bottlenecks as described in this paper.

A substantial collection of whitepapers and application notes are available to provide Hyper Threading technical help for application developers. For more information search on Hyper Threading at the Intel software developer website:

http://www.intel.com/sites/developer/

Software used in this whitepaper:

Pinnacle Studio Version 8
Ulead Movie Producer Version 7
Sonic Solutions MyDVD
Roxio Movie Creator 5
Roxio EZ CD Creator 5.3
Musicmatch 7.0

All brands and trademarks are the property of their respective owners.

A Note about the System Activity Graphs:

The performance graphs in this whitepaper were generated from traces taken with Perfmon – a standard utility program with the Windows® XP and Windows® 2000 operating systems. Counters were set up to monitor both virtual CPU’s, disk, and network activity. The results were exported as a CSV file and graphed from a spreadsheet.

About the Author:
Roger Finger is a 21 year employee of Intel Corporation, currently working as a Market Segment Specialist in the Software Solutions Group. In this capacity, he works with ISV’s to advise them of new architectural features on Intel processors – providing them with tools and resources to enable features such as HyperThreading technology in their applications. Since the early 1990’s when multimedia first became practical on personal computers, Roger has held positions as applications engineer and product manager for audio/video technologies such as Indeo(tm), ProShare(tm), and the Intel Pocket Concert(tm) MP3 player. “As an amateur videographer, I’m incredibly excited about the tools that are now available to consumers. The cameras are affordable, the software has become much easier to use, and with writable DVD, the PC is an ideal editing and mastering platform for creative expression.

32 Comments

2003-03-04 7:36 pm

Anonymous
How about aiming for quieter and cooler systems rather than going for the speed and paying the price of having to live in the same room with a machine that makes as much noise as a jet engine?

Plus I still cant see how having one CPU emulating two CPUs can make much difference? Sure, latency would drop a bit but for the same money or even for less you could get a real dual CPU system.

How about comparing the new Pentium with HT against one of those real dual CPU systems?

Oh and no offence but is this a review or an advertisement?

Hmm… I reached the end of the article. And yes, it is indeed an advertisement
2003-03-04 7:37 pm

Anonymous
I read something a while back about the improvements in performance of the 2.5 smp-kernel when Hyperthreading was turned on. Is it possible to use the smp-kernel on a single-processor machine using HT to fool the kernel into thinking there are more available processors? Are there benefits to this set-up over the straight-up regular kernel?
2003-03-04 7:40 pm

Anonymous
>Oh and no offence but is this a review or an advertisement?

Neither. It is a “paper” about HT, explaining in generic terms what’s up with HT and multimedia. Advertisements are paid, and we certainly weren’t paid for this article (neither we paid for it . In fact, we have one more HT article to publish next week. And we are looking for articles about P4 optimizations, which is of interest for our developer readers.
2003-03-04 7:43 pm

Anonymous
How about some countering criticism to bring out the slightly bad/worse sides of Intel’s new puppy? I heard it lost to a 1.6ghz AMD CPU in a Tom’s Hardware test once.
2003-03-04 8:16 pm

Anonymous
pnut, hyperthreading shows up to the os as 2*physical processors. To take advantage, you have to have an smp kernel, but it helps if the scheduler is ht aware mostly because of cache reasons. Im not sure how much has been integrated back to 2.4, but a lot has been going on in linux 2.5 with ht scheduling, and freebsd is working on it also. I’ve heard a possible 30% speed up from HT, but I havent done any benchmarks myself since i’m still running on a P3 800 and an Athlon 1ghz just fine.
2003-03-04 8:20 pm

Anonymous
True I read the some review (or part of it) but you forgot to mention the review said that the AMD processor did out perform it on some test but was totally smoked in (most) other areas.
2003-03-04 8:25 pm

Anonymous
>How about aiming for quieter and cooler systems […]

Well Intel works also on low power CPU: nothing prevents you to buy one, or to buy a liquid-cooled computer.

It’s about choice, YOUR choice: computers makers provides both possibilities.

>Plus I still cant see how having one CPU emulating two CPUs can make much difference?

Think about it as the next step above normal superscalar CPU: you want to fill all those rarely used execution units by executing several threads at the same time.

Simple economics tells me that SMT CPU will be much,much cheaper than SMP setup: much less silicium used, a much simpler motherboard. And most of all: much more single-CPU computer sold than SMP computers..
2003-03-04 8:56 pm

Anonymous
Great article. Will code written explicitly for hyperthreading processors work in older procesors?.
2003-03-04 9:23 pm

Anonymous
The whole article looks like it was made by copy & pasting some marketing departments press releases, technical errors included.

If a few choice words were replaced in about every other paragraph, the article would be fine.

Nice to see some of the results with pretty graphs though.
2003-03-04 9:28 pm

Anonymous
Given that 4 processor systems are not generally available without a bank loan, what about a pair of these in a dual system? Any boards support it yet?

Just a thought,

David Stidolph,

Austin, TX
2003-03-04 9:47 pm

Anonymous
Multi-threaded code runs just fine on older processors, though a stall in one thread will block other threads that are attempting to run. Many applications are already multi-threaded and can take advantage of HT technology without modification. For developers, the existing Microsoft threading API’s are all you need to take advantage of this new feature.
2003-03-04 10:00 pm

Anonymous
Alright, the guy’s picture looks anything but 21..maybe 31..but probly 41. Expecially considering his experience
2003-03-04 10:04 pm

Anonymous
Although this is not an advertisement and there is no attempt to conceal the fact that it’s by an Intel employee, I find including this paper on a supposedly neutral magazine-style site a bit of a shame. It’s like reading an article about the benefits of quattro four wheel drive in a car magazine that’s written by an employee of Audi. That wouldn’t happen: a journalist would write it. Editorial content here should likewise be written by independent people. Anything else just undermines the credibility of the site.
2003-03-04 10:23 pm

Anonymous
Elver Loho:

While your computer is doing one thing, most of your CPU is going unused. The point of HT is to try and use as much of the processor as possible.

If you want a good technical article on HT, I suggest Arstechnica, I just diont feel like looking it up, sorry.

pnut:

http://www.kerneltrap.com has some good sumamries of Linux discussions on HT. Liek someone said, it shows up as 2 processors. A good example of this is Windows. In XP, it shows as 2 logical processors, but in 2k, it shows as 2 physical processors. I cannot remember what it is, maybe cache or programs migratiung between processors, but an OS that has special SMP scheduler for HT can be faster then one that doesnt. Linus has not merged the HT specific optimizations. Instead he merged some NUMA (*drool*) stuff and he believes the same principles work for HT, so it will b an even better solution with less chance of bugs
2003-03-04 10:37 pm

Anonymous
So it looks like hardware is taking over the concepts of software to a new degree..

Why don’t we just make better software from the beginning? Threading is almost completely a software issue (and a smart one), why reinvent a dual-chip as one instead of making the best use cycles in the first place? I think this greatly overcomplicates the CPU internals (and opens oh-so-many cans of code – er worms..). Athlon chips already run with up to 9 internal RISC processors with an X86 face – no special rewrites needed. I’m sorry but if your code can’t make the best of that (running@N*Ghz), you might rather invest in some programming lessons instead of a newly-hatched schizophrenic CPU!

Sorry, could you tell – I’m a Be user. Latency? I guess somebody must be having some latency issues….

If all it takes is an SMP kernel, an OpenBeOS implementation would blow everyone(OS) else out of the water.
2003-03-04 10:49 pm

Anonymous
Friday I installed RedHat 8.0 on a brand new Dell server pe2650, dual Xeon 2.4Ghz (HT capable), 2GB RAM. After booting, top actually reported 4 CPUs… but… 16MB free! just a plain RH8 without any application running was using 2GB RAM! 1.6GB was used in buffers, and there were tons of strange errors in syslog, apparently somer context-switching problems. I switched HT off, and it works fine now, I mean with no application running I have around 1700MB free
2003-03-04 11:22 pm

Anonymous
I have still haven’t encoded any mp3s let alone a video, so Intel still has away to go to persude me to up & buy one HT or not. I am still inclined to go dual MP, atleast I can count on real 2x speed for some threaded apps or at least responsiveness, but I can also count on 2x heat & noise and very limited choice of mobo/case without latest & greatest built in features (USB2, FW, SATA etc). The Toms HW article tells me it will only double my 1GHz Athlon speed most of the time without HT, sometimes 3x.

Anyway I think Intel understands this and is concentraing on the lower end with more & more integrated systems which can more than satisfy most peopls need.
2003-03-04 11:38 pm

Anonymous
At the beginning I thought this too, but I’m sure it means that the guy works for Intel since 21 years.
2003-03-05 12:31 am

Anonymous
My brother has a cheap Dell Server with 1 P4 Xeon HT. He just needs to add another P4 Xeon HT and voilà!

And he also needs Win2K server but that’s another story
2003-03-05 12:36 am

Anonymous
Why don’t desktop manufacturers provide fast dedicated multimedia processors instead. Realtime video is possible on a reasonably modest CPU (eg 1.8 GHz) if you have a hardware MPEG encoder.

A 3 GHZ processor is still to slow (hot and noisy) for realtime video editing with software.

Current realtime encoding solutions are still too expensive for most home users.
2003-03-05 12:44 am

Anonymous
eheh@RH8 taking all but 16megs of 2gigs of RAM. You seem to be reading the results slightly wrong. Linux (and windows aswell i’d suppose) uses RAM to cache parts of your harddrive for faster access. So instead of giving all of the ram to the apps right off the bad and having slower harddrive access, it buffers up a bunch, and when apps allocate memory, they are favoured and less stuff is cached. If you want to know how much RAM the apps are actually taking, read the “used” line from the output of “free”. If you only had 16megs free you’d swap in a hurry, but thats not the case, or atleast i hope it’s not, or else i’ll have another reason to be biased against redhat .
2003-03-05 1:04 am

Anonymous
Notice the wording of the article, it says he is a 21 year employee not a 21 year old employee. The term 21 year employee implies that he’s been there for 21 years.
2003-03-05 1:20 am

Anonymous
eheh@RH8 taking all but 16megs of 2gigs of RAM. You seem to be reading the results slightly wrong. Linux (and windows aswell i’d suppose) uses RAM to cache parts of your harddrive for faster access. So instead of giving all of the ram to the apps right off the bad and having slower harddrive access, it buffers up a bunch, and when apps allocate memory, they are favoured and less stuff is cached. If you want to know how much RAM the apps are actually taking, read the “used” line from the output of “free”. If you only had 16megs free you’d swap in a hurry, but thats not the case, or atleast i hope it’s not, or else i’ll have another reason to be biased against redhat .

Used + Free memory equals total available memory. So no – he WAS reading the output of ‘top’ correctly. He never mentions what kernel he was using … not having SMP could be the problem
2003-03-05 1:40 am

Anonymous
Jacob

Actually HW should take over the SW role in fine grained processing. The more cpu resources there are on a chip, the more likely they will be idle and warming your house without HT. With HT at least they can do more work more often to justify the heat.

Think how much sand can flow though an hour glass. Smaller grains flows though the narrow hole faster. Bigger grains block or even stick. A bigger grain compares to a memory op that isn’t in cache.

You say it complicates the design of the cpu chip, how would you know? It can actually dramatically simplify the design since a whole slew of other complexities can be thrown away, betting everything on HT will clean up the design. I won’t be including including much of that junk prediction & speculation, out of order logic that was previously tech of the day in my project. The only reason HT didn’t take off earlier is because SW folks have been avoiding PAR programming and forcing Intel to fix up the clock, well it don’t work so well, threading will eventually lead to more real simpler cpus on chip instead of 1 uber fast monster design.

HT has been around at least 20yrs, only now are most folks getting exposed to it. If HT were done really well, it wouldn’t be limited to 2 or 4 threads but would be open to any no. In addition, the threads should be able to communicate, syncronize & pass messages with each other, then it would look like a modern Transputer.

Eventually the OS will fine tune the scheduling so that cooperating threads of each program will share the HT threads. If HT is used to timeshare a cpu over many single threaded apps, then the cache will also be shared between them and that will slow things down.

Also the Athlon may have umteen internal ALUs or units but it certainly doesn’t have 9 internal processors. Try and write some C code for a benchmark, optimize it and look at the asm output, measure the no of opcodes that must have been executed per sec. Guess what, its closer to your clock speed, ie you only get about one op per cycle on random memory intensive apps. Keep everthing in cache, and it can get a bit better.

Even a BeOS user should see some benefit, but I would expect it to be <<30% claimed by Intel.

JJ
2003-03-05 2:40 am

Anonymous
The P4 has a 20 stage pipeline and only 8 registers; thus, it gets a lot of stalls because it guessed wrong on a branch or it couldn’t process the instruction because the register was updated from the previous instruction yet. To get around this, they just added a second set of registers and a second front end to issue instructions. Now when the front end detects a register conflict or doesn’t want to miss predict the branch, it just turns control over the other front end.

The bad part is that they can show a 30% speed increase! That means that 30% of the time in a non-HT processer is wasted because of processor stalls (can you say bad design).

The reason you want the kernel to know about HT instead of treating them as 2 processors is simple. Both front ends share the same L2 cache and memory mapping registers; this means that it’s better to run two thrieds of the same program on the processor because they share memory and will have better cache hits.
2003-03-05 4:07 am

Anonymous
I personally prefer Intel over AMD but this ‘technical paper’ read like a 4 page advertisement/OSNews endorsement.

A little better ‘balanced’ writing about the functions of HT would have made me actually stick around to read pages 2,3, and 4 instead of feeling like I was watching an infomercial disguised as a documentary.
2003-03-05 4:21 am

Anonymous
if you have say two xeon’s with HT making the computer seem to have four CPU’s can Windows XP pro or 2000 pro use all four? if not any plans for a patch or something along those lines?
2003-03-05 5:10 am

Anonymous
Joe P: The bad part is that they can show a 30% speed increase! That means that 30% of the time in a non-HT processer is wasted because of processor stalls (can you say bad design).

It’s excellent design from a marketing standpoint. While the P4’s pipeline may be too deep and its branch predictor too inaccurate for the combination of both to make for an efficient processor, Intel knows one thing: clock speed sells.

I accually did the calculations, and it turns out that the percentage of CPU cycles wasted by the Pentium 4 is approximately equal to the percentage of branch instructions in the code being executed (it’s just how the figures worked out for the P4, it’s not a general rule or anything)

However, Intel has won the clock speed competition hands down. That’s all that matters to them.

Elver Loho: How about aiming for quieter and cooler systems rather than going for the speed and paying the price of having to live in the same room with a machine that makes as much noise as a jet engine?

I’m writing this from a Dell Precision 4550 workstation. It runs completely silent.
2003-03-05 5:30 am

Anonymous
To clear up soem confusion, please go read this

http://arstechnica.com/paedia/h/hyperthreading/hyperthreading-1.htm…

If I rememebr right, HT only required like 10% more transistors

Jacob Munoz: Even if Athlons had 9 internal independent processors, it could not handle it transparently. You would need multi-threaed code or else it would starve all the other processors. Also Saying HT is making hardware around software is liek saying SMP is.

I would imagine BeOS could do soem nice stuff with HT due to all of its threading. Any body know if there are even approprioate patches to get it to run?
2003-03-05 7:39 am

Anonymous
@ Elver.. well, your question is a tell tale. Only because the cheapest systems that can be had may be loud due to el cheapo fans, this does not at all mean that you can’t have quite systems, even with the strongest CPU. If your system is loud this only means that _you_ personally opted for the cheap fan and now complain about it. Even these little Shuttle bare bones are almost silent…
2003-03-05 3:26 pm

Anonymous
As far as I know HT already works with BeOS. I am looking for a good dual processor board with HT. To BeOS that would look like a four way system.

Does anyone know if Intel plans to take HT or MP further, ie a single chip that looks like or is four processors?
2003-03-05 5:27 pm

Anonymous
I owuld guess 2 logical/physical would be the sweet spot for HT or else you would get too much time wasted from different threads waiting on one another.

Now I know IBM with their Power4 has 2 procs/die which I guess has some advantages. Im not sure if Intel will do that.

I would really like to see IBM add HT soon, especially shortly after they release the 970. My next computer will be in two years, I would love a dual HT IBM 970 like chip, but if I had to, I could drop the dual or HT from my dream. IBM said the 970 is meant for desktop/workstation, so whether apple even uses it or not, IBM might have someone else in mind. If I cant get that dream, then just a Dual HT Pentium