Intel’s Hyperthreading Technology is being blamed for server performance problems. With both SQL Server and Citrix Terminal Server installations, HT-enabled motherboards show markedly degraded performance under heavy load. Disabling HT restores expected levels, according to reports from within the IT industry.
I did a lot of testing on HT performance when the first HT 3.06 Ghz CPU came out. What I found was that HT was only useful for multi-tasking, not for heavy CPU loads. In other words, doing a bunch of parallel makes while compiling will not reveal any performance advantage, but running several concurrent desktop applications would.
HT is not, and was never meant to be, a substitute for multiple CPUs. Intel’s original stated purpose for it was as a desktop/workstation enhancement, NOT as a way of enhancing server performance.
HT was supposed to help in some scenarios, and not help in others, but not help is some, hurt in others.
Agreed. Hyperthreading was only supposed to help transition the market to dual-core (and later multi-core) CPUs.
Why would someone turn H/T on in a server, anyway? You need two physical CPUs (or at least a dual-core’d CPU) if you want legit parallel processing.
The way that hyper threading works is to use the ‘unused’ parts of your cpu. For most tasks, the cpu is either doing integer or floating point operations, not both at the same time. Thus part of the CPU remains idle, while the other is taxed. Hyperthreading tries to get 2 different tasks executing at the same time. Server workloads on the other hand, are all about doing the same task over and over and over again, so the same parts of the cpu are always being stressed, and hyperthreading doesn’t really help much. Either the server is doing all FP (like in a rendering farm for CGI) or all integer like in web servers. Either way, HT is not going to help as servers need to do one thing really well to excel, not be able to handle a varied load.
Linux 2.6 has scheduling intelligence built into the kernel to ensure that full tasks do not overload the CPU’s virtual processor. HT is much more useful in a deeply mult-threaded app where that same app is doing two things at once, and nearly useless in any other function.
On Linux 2.6, HT can help the kernel’s performance but rarely helps the performance of anything else. Maybe Java. But it also rarely hurts the performance.
Linux 2.6 has scheduling intelligence built into the kernel to ensure that full tasks do not overload the CPU’s virtual processor. HT is much more useful in a deeply mult-threaded app where that same app is doing two things at once, and nearly useless in any other function.
There was some debate on the kernel mailing list some time ago (probably with that whole BSD hyperthreading vulnerability thing) as to the actual usefulness of hyper threading. Linus Torvalds maintained that hyper threading was pretty useful for reducing latency, but of course, with the sorts of applications servers perform that’s more suited to desktop tasks. With server environments there comes a definite point where throughput comes into the picture, although with hyper threading I just think that Intel’s idea, design and implementation of it is crap – although it probably has quite a bit to do with Windows as well. I’ve not seen a Linux system where hyper threading has really had the negative effect they’re describing here, but I’ve not seen a terribly positive one either.
With multiple cores now the need for hyper threading is actually very questionable. Of course if you want a decent multiple/dual core system these days you go AMD ;-).
We’ve had issues on some of the boxes my department maintains with HT enabled running Linux/MySQL/Java. Linux is certainly not immune to this. The only solution was to disable HT in the BIOS and things ran smoothly again.
“Linux 2.6 has scheduling intelligence built into the kernel to ensure that full tasks do not overload the CPU’s virtual processor. HT is much more useful in a deeply mult-threaded app where that same app is doing two things at once, and nearly useless in any other function.
On Linux 2.6, HT can help the kernel’s performance but rarely helps the performance of anything else. Maybe Java. But it also rarely hurts the performance.”
I am not familiar with the Linux 2.6 HT optimizations but the NT 5.1 and 5.2 (XP,2k3) kernels also have scheduling logic for dealing with hyperthreading. Specifically, the kernel will not schedule a thread to run on a CPU with one or more busy logical processors if a physical CPU in the system has no busy logical processors (e.g both logical CPU’s are executing idle threads).
The problem described by Slava is one in which the system is heavily loaded and illustrates a situation wherein one thread that performs scanning against large random sets of data in memory forces cache flushes impacting a thread executing on the same physical CPU. This is less a result of Intel’s implementation of SMT (except perhaps in how well the caching algorithms work) or Windows HT scheduling logic and more a result of how Microsoft SQL Server works with regards to the lazy writer.
I am far more skeptical of the Citrix claims as there is little correlation between how a multiuser / multi-app Windows Terminal Server functions and the structured issue described by Slava. Additionally, most applications, such as Exchange, do see a moderate performance increase or no impact running with HT enabled.
But it also rarely hurts the performance.
Doesn’t it halve the cache, allotting one half to each thread which could really hurt performance for some codes? Especially ones like database code that loves cache?
Hypethreading can be of benefit even in CPU heavy environments under the proper conditions. I have seen up to a 50% speedup in floating point computations by turning hyper threading on. That is the maximum speedup one would expect. In those cases I had floating point operations that were bound by heavy disk-I/O needs. Those same floating point operations when they were given a static dataset, thus there was on waits for disk-I/O, had a negligible speedup. I had never seen a significant slow down in performance from hyperthreading, but it makes sense under the scenarios that they are discussing above.
I mostly use my computer for audio and recording. With HT turned off I get more “raw horsepower.” With HT turned on the system is much more responsive and I’m able to effectively perform other tasks even when the system is overloaded. This is much more important to me in an audio software environment. The benefit of HT to server environments is pretty obvious too.
Pretty much obsoleted by dual-core processors, no?
“Pretty much obsoleted by dual-core processors, no?”
Yeah, basically.
Actually, not really. What SMT (symmetrical multithreading, which is the technical name for Hyperthreading) really does is to utilize unused execution hardware to avoid bubbles in pipeline/instruction stream. This is useful for any processor, multicore or not, since it improves the efficency of every core.
That it worsens performance under heavy loads seems like a fault in Intels implementation (which has been improved continously since the early, non-enabled versions in the willamette cores). SMT as such is still useful but perhaps it is hard to design a proper implementation.
For instance, a future version (now scrapped, obviously) of the Alpha-processor (EV8) was planned to have 4-way SMT.
Maybe intel’s specific implementation, but not in general. If anything it’s popularity should increase. Sun’s Niagara IIRC does the same thing as HT on each of it’s 8 cores and calls it Chip Multi-Tasking (CMT). Its in theory a great way to cover the performance hit from I/O requests which are just as slow as ever.
One thing I still can’t understand is why memory is still so slow. Could they manufacture RAM the same way they manufacture CPUs ? I mean sure you’d need more power and a HSF per bank but for the performance boost it’d be worth it for many people and you’d always find a market for it in the OCing circles. The number I heard is that accessing RAM wastes 100 cycles.
“Its in theory a great way to cover the performance hit from I/O requests which are just as slow as ever.”
IO is interrupt driven so its not like the CPU is ever in a loop polling; however, you maybe right in the way that I could see HT perhaps help with interrupt handling. Do keep in mind that for a modern CPU I/O is really just a ram read becuase the data was put into memory via device directy via DMA.
Making memory on CPU like dies would be much much more expensive. If a die the size of a modern P4 has all cache I would guess it would about 3-4 MB. Think how much 256 MB of that would cost.
As to main memory latency, yeah its there, the number of wasted cycles varies for every CPU/BUS/RAM combination. Do keep in mind the whole reason caches, on board memory controllers and many modern CPU architectures exist is to make this latency less painful.
Edited 2005-11-19 02:35
IO is interrupt driven so its not like the CPU is ever in a loop polling
Yes, but this way you can perhaps say yourself a context switch.
If a die the size of a modern P4 has all cache I would guess it would about 3-4 MB.
Are you saying 50% of a chip’s die is cache? And I believe the next Itaniumn will have 13MB cache so I doubt that.
Think how much 256 MB of that would cost
Why 256? why not just use it as another layer of cache?
Oh, I agree it would be better spent as cache. You were talking the possibilty of SRAM being used as main memory. As to the context switch, the the kind of IPS modern CPUs put out, its not as big of a deal as might be thought. Although, I agree haveing a “SMT cpu” take care of it would be a good thing.
DRAM has only gained 2x latency improvement over 20 odd years while cpu cores have gained maybe 100x so that can only mean cache misses are 50x or more worse when they happen. Typically that can now reach many 100s of wait cycles for full L1,L2,TLB miss. This assumes DRAM full random cycles at about 60ns although the control path through the bridge increases that significantly.
Faster DRAM
For a cpu designed around SMT a special type of DRAM can be used that effectively makes all DRAM accesses look like 2.5ns latency by having 8 threads cover the actual 20ns true latency. It works becuase the RLDRAM allows upto 8 concurrent threads to be in flight in both the processor AND the memory. When DRAM effectively gives data access in a few cycles, the need for L2 cache goes away, in effect highly interleaved DRAM with many threads works much better than a single threaded cpu with inf amount of SRAM. Ever larger SRAM caches above 1MB don’t work too well anyway, the leakage starts to dominate power. DRAM by definition is designed to not leak, its not so difficult to also make it much faster by exploiting the multiple banks with an SRAM pipelined interface. As a nice bonus, RLDRAM is supposed to be about 2x the price of SDRAM for same size but has 20x the random access throughput, but right now x86 can’t do anything with it, while true SMT cpus can.
See Micron RLDRAM website for details.
transputer guy
Actuall, Intel’s new Paxville processors are dual-core AND Hyperthreaded.
And lets not forget the Power platform. Power5 has dual and quad core chips that support two threads per core.
Hey, but have you seen how they preform aganst current Opterons? They get spanked, here are some numbers http://www.gamepc.com/labs/view_content.asp?id=paxville&page=1
P.S. The IBM Power5 does seem to have a very solid SMT implimentation though.
Found this out 2 yrs ago when we deployed 4 CPU SQL server and we were seeing high volume of CPU locking. Disabling HT resolved the CPU locks.
Jim
Problem: HyperThreading isn’t real SMT
Solution: Abandon HyperThreading or switch to a CPU architecture which supports real SMT (POWER/SPARC)
>We’ve had issues on some of the boxes my
>department maintains with HT enabled running
>Linux/MySQL/Java. Linux is certainly not immune to
>this.
But 2.6 or 2.4? 2.4 doesn’t have the HT scheduling.
I had mixed results with HT on WXP Pro. Sometimes I was convinced of getting something out of the virtual processor (usually in video apps), and sometimes I got the opposite. Same with Linux, only in that case I usually had no negative effect that I could quantify (parallel compiles ARE faster with HT, marginally, for example) – but since most Linux software isn’t multi-threaded (such as mplayer/mencoder), I didn’t get much benefit, either.
I have a dual-processor machine with hyperthreading (and Linux 2.6), and I use it to do statistical computations — which consist of lots and lots of floating point calculations with a high locality of reference (and hence relatively little memory usage compared to the CPU usage). With hyper-threading, I can run four processes simultaneously on two processors without any noticeable per-process slowdown. So I’m happy. Though I’m just one data point and I’m sure there are people for whom it doesn’t work.
We have tried to run several production servers (under both Linux and Windows) with hyperthreading, and it has always dropped performance to around 40% of when it is turned off. The first time, it took us hours to figure out hyperthreading was the culprit. Afterwards, when we ran into performance problems, we would immediately turn off hyperthreading and see the speed improvements. We have had such poor experience with hyperthreading (couldn’t figure any scenario in the server room where it actually helped), that now we make sure to turn it off right away.
Really. I was doing somewhat CPU intensive Berkeley DB recovery of many independent BDB databases (on Linux 2.4.x). On two-way Xeon HT machine starting revocery of about 100 databases in 4 (nearly) independent threads reduced overall time to less than 30% of time needed to recover these databases in a single thread. It was nearly 4 times faster! HT really works! The trick is that switching from uniprocessor kernel to SMP one is a _real_ performance hit (sometimes even more than 50 percent drop). I mean, synchronization between contemporary processors is really expensive. So, judging on 2-way machine (or HT) is misleading. You gain on two ‘virtual’ processors but lose in synchronizaton. But on 4-way, 8-way on more one may see real performance gains granted that software works parallel.
“And lets not forget the Power platform. Power5 has dual and quad core chips that support two threads per core.”
Lets not confuse multi-chip modules with quad core.
http://www.research.ibm.com/journal/rd/494/sinharoy.html
>Lets not confuse multi-chip modules with quad core.
It now comes in a standard POWER5+ CPU package:
http://www.ibm.com/chips/photolibrary/server/1032.jpg
Intel also uses chips with 2 separate dice.
I believe Intel is too much relying that people will just adopt their whatever new technologies. Itanium was a good example of this. So even if HT is x86 based, it requires a non-x86 way to distribute processing.
I agree.
What it is.
http://en.wikipedia.org/wiki/Hyperthreading
Read Cache conflicts.
http://arstechnica.com/articles/paedia/cpu/hyperthreading.ars/5
so HT sucks on SQL………anyone know about Oracle performance and HT?
Sun is targetting their new Niagara chips (8-core and 4 threads per core) for server-use, are they not?
Does the article imply Niagara will suffer from HT all the same, or does the different architecture ensure it doesn’t?
>so HT sucks on SQL………anyone know about Oracle
> performance and HT?
I don’t know about Oracle particularly, but I do know that multi-threaded DB suites tend to show improvement with HT as long as they are only dedicated to only the DB itself and not individual apps. Multi-tasking can kill this performance improvement.
For example, a stand-alone multi-threaded SQL server which serves data to the network will often show considerable improvement with HT. I can’t say that all OS’ are capable of providing that improvement, but that is my experience with Linux 2.6.