Linked by David Adams on Wed 4th Aug 2010 18:28 UTC, submitted by estherschindler
Hardware, Embedded Systems Anyone contemplating a new computer purchase (for personal use or business) is confronted with new (and confusing) hardware choices. Intel and AMD have done their best to differentiate the x86 architecture as much as possible while retaining compatibility between the two CPUs, but the differences between the two are growing. One key differentiator is hyperthreading; Intel does it, AMD does not. This article explains what that really means, with particular attention to the way different server OSes take advantage (or don't). Plenty of meaty tech stuff.
Thread beginning with comment 435483
To view parent comment, click here.
To read all comments associated with this story, please click here.
RE: To clear things up a little...
by Panajev on Thu 5th Aug 2010 08:24 UTC in reply to "To clear things up a little..."
Member since:

A core is normally:
1 set of registers
1 set of execution paths

x86 is CISC (complex instruction set computer), so they break the CISC instructions down into a series of RISC (reduced instruction set computer) instructions and then reorder them for execution. On each clock one or more RISC instructions may be issued to an available execution path.

The problems arise when data needs to be fetched before the next set of instructions can issues (this causes a bubble/hole in the execution path). The other place where problems happen is when doing a conditional branch (the compare can take up to 7 clocks to get a result before the processor knows what instruction is going to be ran next). The data fetch holes can mostly be fixed by intelligent pre-fetching and is not an issue with modern CPUs. The conditional branch is normally handled by making a best guess and being ready to roll back to the guess point once it's determined that the guess was wrong.

Intel created HT as a way to solve the conditional branch and bubble problems. Basically all HT does as add a 2nd set of registers to the set of execution units. When a bubble is found, the core issues an instructions from the 2nd thread. When you find a conditional branch, just switch to the 2nd thread until we know what the correct path is.

Again, this seems more like SoE MT and not SMT which are two different kinds of MT implementations altogether. Switching when a branch is encountered seems odd to me unless you employ something similar to HW scout threads like Niagara does (more on this later) or you are doing predication like Itanium/EPIC does and work on both paths of the conditional branch and resolve it at the end.
What you described kind of sounds like a good ol' branch delay slot technique in a way, you defer the computation of the branch condition until you are sure that the operands are ready (not to stall when the branch needs to be processed, but it is mostly used in in order CPU's that lack a high 90's %success rate branch predictor Intel CPU's have now.

Still, switching on a bubble (say a data dependency bubble, or a cache miss) is a way to do MT, but it is not the SMT way... SMT seems designed to allow the CPU to exploit parallelism in threads where the current thread does not have enough work to do.

The main problem is good compiler optimization will reduce the number of bubbles. Thus, the only time the threads switch state is when doing conditional branching; however, this is also something that good compilers try to avoid. This is why HT only gives a 20-30% speed boost; a 2nd core would give you something closer to 95% boost (you lose some speed due to memory bandwidth).

HT is cheaper; but it's main drawback is that you get primary threads and secondary threads. The primary will run at full speed and the secondary one will at about 25% of the primary's speed. Any application that has been optimized for multi-threading will have issues since they are internally breaking the task at hand up into small units and it expects each unit to run at the same speed.

I need to process 400 transactions and I have 2HT cores (4 threads) so I create 4 threads and have each thread process 100 transactions. I can't report the task as finished until all 4 threads end; thus, once the threads are started I have to wait for them all to end before continuing. My actual run time would be the time to process 100 records (the 2 main threads) + the time to process 75 records (25% of the records where processed during the 1st time period and I'm assuming that the secondary threads become primary threads once the 2 primary threads end). On a 4 core machine the actual run time would be the time to process 100 records.

If the software assumes that each virtual thread is a core and schedules task accordingly, then HT can slow things down instead of speed them up.

If the software is single threaded, then HT will speed things up because a program getting some time is better then getting no time.

AMD bulldozer is HT but designed more like IBM's style instead of Intel's. IBM's design is: create 2 instruction decode and issue engines and attach them to the same set of execution paths. If the decode engine can only issue 4 RISC instructions per clock and you have 6 execution paths (2 complex integer, 1 logical integer, 2 simple floating point, 1 complex floating point), why not just have the 2nd decode issue as many instructions into the unused paths as possible (it can issue to to 4 but in most cases they should be able to issue at lease 2). IBM did this with some of the PPC chips and the secondary thread was running at about 60% of the primary's speed.

I do not think its primary purpose was HW support for scouting threads (see Niagara), but primarily a way not to leave execution units unfed if the current thread of execution does not contain a high degree of parallelism at the instruction level (ILP) but there are other threads with work that can start at that moment and run in parallel to what the CPU is doing at the moment (TLP or Thread Level Parallelism).

Niagara's scout threads:

Edited 2010-08-05 08:34 UTC

Reply Parent Score: 2