What the Hell is Hyper-Threading?

Eugenia Loli 2002-06-19 Intel 19 Comments

“Announced last autumn, Intel’s Hyper-Threading technology has finally made it to market, courtesy of the latest [Pentium4-based] Xeon processors. Hyper-Threading is a clever way of making a single chip operate like two separate devices without implementing two cores on one die. That, claims Intel, makes for higher performance without having to resort to significantly larger chips or even adding a second processor to the system.” The story is at TheRegUS. Alan Cox says that the technology can bring up to 30% more performance than the same CPU running without Hyper-Threading, but special conditions have to be met, for example, the applications need to be programmed as multi-threaded. The right hardware for the right software.

About The Author

Eugenia Loli

Ex-programmer, ex-editor in chief at OSNews.com, now a visual artist/filmmaker.

Follow me on Twitter @EugeniaLoli

19 Comments

2002-06-19 7:01 pm

Anonymous
Cool, gimme =)

From what I have read of hyper threading it looks like a really cool technology. Basically it is just a way of efficiently using the instructions so that the processer can execute them faster. Big bottleneck reduction. I would like to have one of the new P4 Xeons. Too bad it is way out of my price range.

Skipp
2002-06-19 7:10 pm

Anonymous
Dual Xeons are not that expensive. Myself and my husband wanted to buy a Dual Xeon 2 Ghz just a month ago, and with all the goodies (scsi stuff, 1 GB of memory, two Xeons at 2 Ghz, SuperMicro mobo) it did not cost more than $2000 USD. I mean, having such a great server power for 2 grand, is not much at all.

Heck, even the PowerMac G4 933 Mhz costs even more, and it is times slower.
2002-06-19 7:15 pm

Anonymous
This is just some sort of marketting push by Intel. Anybody worth their salary in the IT bussiness will know that that you can not emulate dual cores no matter what you do. The cpu can only perform one instruction at a time. period.
2002-06-19 7:19 pm

Anonymous
IBM with Power5 are expecting 100% speed increase, The same was predicted for the (now cancelled) Alpha 464.

AMDs version should also be interesing…
2002-06-19 7:20 pm

Anonymous
<QUOTE>Equally, there’s a small performance hit when the OS switches from one thread to two, but these events occur infrequently, Intel claims</QUOTE>

That’s from the article. Intell claims that content switches occur “infrequently” and that there is a “small” performance hit? BWAHAHAHAHA. I want some of what ever their smoking.
2002-06-19 7:28 pm

Anonymous
Riiiight.

Study a little bit the P6 microarchitecture. You’ll see that there are plenty of bottlenecks in the architecture that limit it to 3 instructions and 3 uops per clock, in the very best case – dependencies between instructions and uops impose extra restrictions. Still, there are 5 execution units, which can almost all process one uop per clock. Obviously, there’s a big waste there, and the idea of hyperthreading is to fill in that waste.

The story on the p4 is slightly different, because the trace cache helps remove some of the bottlenecks, but the higher number of execution units increase the stalls caused by uop interdependencies.

JBQ
2002-06-19 7:46 pm

Anonymous
Also, the P4 isn’t using a p6 core.
2002-06-19 8:29 pm

Anonymous
Here is a link which talks about a discussion at a conference where Intel basically admitted that the pie in the sky Hyper-Threading they promised when this first came out is not all it’s cracked up to be:

http://www.theregus.com/content/3/25284.html
2002-06-19 8:42 pm

Anonymous
Yes, I read that article at RegUS the other day, and I beg to differ (and I know that that the author (and friend of mine) will read this… *blush* .

SMT will do you no good if you don’t have threaded applications (see: BeOS). This is a known fact.

And this is why you don’t use Xeon to run Quake or Solitaire. You use it for special server cases, where the applications are extremely threaded (for example, an IRC server or web server, or for Java or .NET apps). THERE is where the machine shines mostly. This is what it was made for!
2002-06-19 9:17 pm

Anonymous
30%? Bah.

If SMP, real SMP.
2002-06-19 9:51 pm

Anonymous
ruprecht

“This is just some sort of marketting push by Intel. Anybody worth their salary in the IT bussiness will know that that you can not emulate dual cores no matter what you do. The cpu can only perform one instruction at a time. period.”

You obviosly are clueless about how cpus actually work, what do you think happens when a cache fail happens, the x86 stalls for a few hundred to thousand CLK cycles doing no usefull work. Even if every 200th instruction misses cache, the throughput falls. The pt of HyperThreading is to reduce the pain, the price is that you get N cpus each running 1/N of true full speed or some variation of!!!

Some people must be under the impression that 266MHz DDR ram allows memory to do random accesses at that speed (fairly close to cpu speed), well true random RAS cycles are closer to 10MHz (50ns RAS access time just to start). Only the sequential burst accesses are at the DDR rates, fine if the code wanted a whole cache line & not just 1 byte or for watching video streams but lousy for irregular code.

Write some C code that beats the crap out of your cache with millions of random mem refs to see this happen (try hashing), this is partly why cpus often deliver far less than expected falling down to P100 levels. Thats why Xeons Alphas, Sparc, Power have much bigger caches so they can run huge databases etc. The alternatives are Level3 cache, HyperThreading, or true SMP or even SRAM main memory all of which have costs.

Many cpus claim to issue up to 8 ops/cycle, but the results are usually closer to 1.5 ops/cycle actually getting retired.

The article is ok if short, but repeats what was already out there.

Hyper-Threading has been some 20yrs in the making, since doubling or quadrupling the sizes of register files & other components is no longer significant part of die size, it is certainly ready for prime time, assuming the OS guys are willing to work with it MS???. It does slow down the clock a little, bigger register files are a little slower. It also will make the system look like it has less cache, since that is now working for multiple independant threads, but less of a problem for Xeon, Alpha, Power.

In its simplest form, it essentially decimates the gross latencies inherent in any cpu design, it doesn’t remove them, just spreads them out in time. In a rigid scheme that cycles through a few unrelated Processes, the total performance is the sum of the partial cpus.

I would be alot more interested to see wide spread adoption of true HW multi threading with HW message passing & scheduling between related Processes, ie the cpu is only trying to be 1 cpu but supports true Par programming. Of course thats called a Transputer. The world wasn’t ready for it 1st time around, perhaps 2nd time around. In this form the no of threads (Processes P) is unlimited since a low level HW scheduler swaps the lite Processes & the register files become reg caches for mem mapped reg sets.

In addition, the cpu should have fast serial links to glue multiple cpus together Lego style. Funny thing is the Hammer, the Alpha, the Sharc, the TIdsps all have these links, but they still don’t get it, (the DSPs are a special case). The Hammer marketing people are even showing pics of Hammers with 2,3,4 links to build hyper cubes with the Hyper Transport (HT) links. These pictures could have come from INMOS 20yrs ago.

Anyways, anybody interested enough with FPGA tools can build their own cpu (sub 100MHz) design to explore how cpus could or should be designed. Even without HW tools, you can still design in C & Verilog/VHDL, such a project is in the same order of effort as an OS.

Happy architecting
2002-06-19 11:06 pm

Anonymous
Well, the title says it all. At work, we’ll upgrade our existing developement boxen to Xeon 2.2GHz (from Dell). We already have some here, one of them is a Dual Xeon 1.8GHz box. We run mathematical simulations overnight, and the Dual Xeon box isn’t particulary spectacular compared to classic boxes, but then again, the calculation is single threaded. There is a lot of disk and network activity during the calculations, but since they’re probably synchronous, we dont get to see the benefits of hyperthreading. Once we get the Xeons, I might try to parallelise the calculations (or get the I/O asynchronous), to judge any benefits the Xeons bring.
2002-06-20 12:28 am

Anonymous
http://www.cs.washington.edu/research/smt/
2002-06-20 2:53 am

Anonymous
I mentioned the P6 because it’s a more established architecture, with its bottlenecks extremely well-understood and well-explained e.g. here: http://www.agner.org/assem/ – and the P6 is enough to make a point against a claim like “The cpu can only perform one instruction at a time. period.” – I don’t know about you, be reading the intel optimization manual http://developer.intel.com/design/pentium4/manuals/24896606.pdf makes me sleep.

I also wrote a short paragraph on how the p4 is not a P6, how some bottlenecks will not be as bad and how some others will be real problems. Obviously, the trace cache will help remove the bottlenecks of instruction decoding and register allocation. When running legacy code however, instruction interdependency will still be a major cause of stalls because classic x86 doesn’t allow for out-of-order writes. Furthermore, the penalty of a mispredicted branch on the p4 is very high and will stall all the units. This is in my opinion where SMT can shine – as opposed to streaming SSE2 (fewer bottlenecks) or x87 (primarily limited by the x87 execution unit itself).

JBQ
2002-06-20 5:12 pm

Anonymous
Heck, even the PowerMac G4 933 Mhz costs even more, and it is times slower.

Bad girl. Macs are the best computers in the industry. Please don’t bash Macs. Bash Linux, it’s way more fun.
2002-06-20 6:42 pm

Anonymous
The gain from “Hyper-Threading” is mostly on cache misses, either data or instruction. When a virtual cpu misses instead of waiting an eternity for the memory to make its way to the cpu it can see if the other virtual cpu has its data already in the cache. This makes cache misses cost much less because the cpu can often do something instead of just waiting.

The downside nobody bothers to tell you about is that now these two virtual cpus are sharing the caches. This has the effect of essentially halving the cache size. Now anybody with a server will tell you that cache size is very very very important.

So if the active working set for both threads on the virtual processors is small then hyper-threading is great. However if the working set is bigger than half the cache size all you do is increase your cache miss rate and ruin your performance.
2002-06-20 8:40 pm

Anonymous
Couldn’t agree more, said the same thing about cache sharing, this is probably why it won’t take off in PCeees if the threads are chosen by the OS without considering this.

One of the ways the Server RISCs can alleviate this cache sharing is to include off chip huge (16M+) Level3 cache that is half way between the cpu (1ns) & sdram (100ns) in random access cycles ie 10ns or so. Something PCees will not likely see unless an extremely low latency DRAM cache (Mosys anybody) can be productised (IIRC ISSCC has seen a 5ns DRAM design). The industry can build affordable <10ns 4MByte Sram, but the PC industry has costed that one out, even high end PCs don’t have that option. Of course the really high end cpus can even consider SRAM (CRAY) as main memory.

Now in the Transputer model or any similar env where the Processes are nested by the developer, the threads (Processes P) are closely related & cooperating through messages, so cache sharing is natural & desirable. The OS could still share the cpu over many apps, but only one app gets to use the Threading (Process Scheduling & Messaging) support. Of course this means the developers will have to learn to use Par whereever they can & to think of SW more like HW. This can be quite natural in areas like DataPumps, Codecs, Math blocks etc, a little harder in other areas.

Believe me, some SW-HW design really is interchangeable in design style, but the HW (written in Verilog/VHDL) can still be run as SW by translating back to C & linking. HW languages are naturally Parallel & this code would benefit enormously from SMT. The same code written in say OCCAM or HandelC style would be an equivalent representation.
2002-06-20 9:03 pm

Anonymous
I have seen many benchmarks and comparisons of Xeons going head to head with Athlon XP’s and Mp’s. The fact of the matter is that hyper-threading works, but not well enough to merit the extra cost of the processors and motherboards. Amd will once again come out on top, damn I wish I had one
2002-06-21 3:25 pm

Anonymous
All of the above.

Makes me think that perhaps it’s time to re-think MP specifications and re-write chapter 7.

4-way box will now need 8-cpu BIOS.

Intel’s evolutional CPU-design upgrade strategy has been making development of low-level MP soft difficult for years.

Maybe it’s time for something new?