Linked by David Adams on Wed 4th Aug 2010 18:28 UTC, submitted by estherschindler
Hardware, Embedded Systems Anyone contemplating a new computer purchase (for personal use or business) is confronted with new (and confusing) hardware choices. Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading; Intel does it, AMD does not. This article explains what that really means, with particular attention to the way different server OSes take advantage (or don't). Plenty of meaty tech stuff.
Permalink for comment 435466
To read all comments associated with this story, please click here.
To clear things up a little...
by JPowers on Thu 5th Aug 2010 03:06 UTC
JPowers
Member since:
2007-11-10

A core is normally:
1 set of registers
1 set of execution paths

x86 is CISC (complex instruction set computer), so they break the CISC instructions down into a series of RISC (reduced instruction set computer) instructions and then reorder them for execution. On each clock one or more RISC instructions may be issued to an available execution path.

The problems arise when data needs to be fetched before the next set of instructions can issues (this causes a bubble/hole in the execution path). The other place where problems happen is when doing a conditional branch (the compare can take up to 7 clocks to get a result before the processor knows what instruction is going to be ran next). The data fetch holes can mostly be fixed by intelligent pre-fetching and is not an issue with modern CPUs. The conditional branch is normally handled by making a best guess and being ready to roll back to the guess point once it's determined that the guess was wrong.

Intel created HT as a way to solve the conditional branch and bubble problems. Basically all HT does as add a 2nd set of registers to the set of execution units. When a bubble is found, the core issues an instructions from the 2nd thread. When you find a conditional branch, just switch to the 2nd thread until we know what the correct path is.

The main problem is good compiler optimization will reduce the number of bubbles. Thus, the only time the threads switch state is when doing conditional branching; however, this is also something that good compilers try to avoid. This is why HT only gives a 20-30% speed boost; a 2nd core would give you something closer to 95% boost (you lose some speed due to memory bandwidth).

HT is cheaper; but it's main drawback is that you get primary threads and secondary threads. The primary will run at full speed and the secondary one will at about 25% of the primary's speed. Any application that has been optimized for multi-threading will have issues since they are internally breaking the task at hand up into small units and it expects each unit to run at the same speed.

I need to process 400 transactions and I have 2HT cores (4 threads) so I create 4 threads and have each thread process 100 transactions. I can't report the task as finished until all 4 threads end; thus, once the threads are started I have to wait for them all to end before continuing. My actual run time would be the time to process 100 records (the 2 main threads) + the time to process 75 records (25% of the records where processed during the 1st time period and I'm assuming that the secondary threads become primary threads once the 2 primary threads end). On a 4 core machine the actual run time would be the time to process 100 records.

If the software assumes that each virtual thread is a core and schedules task accordingly, then HT can slow things down instead of speed them up.

If the software is single threaded, then HT will speed things up because a program getting some time is better then getting no time.

AMD bulldozer is HT but designed more like IBM's style instead of Intel's. IBM's design is: create 2 instruction decode and issue engines and attach them to the same set of execution paths. If the decode engine can only issue 4 RISC instructions per clock and you have 6 execution paths (2 complex integer, 1 logical integer, 2 simple floating point, 1 complex floating point), why not just have the 2nd decode issue as many instructions into the unused paths as possible (it can issue to to 4 but in most cases they should be able to issue at lease 2). IBM did this with some of the PPC chips and the secondary thread was running at about 60% of the primary's speed.

Edited 2010-08-05 03:16 UTC

Reply Score: 6