Two threads, one core: how simultaneous multithreading works under the hood

Thom Holwerda 2024-07-25 Hardware 9 Comments

Simultaneous multithreading (SMT) is a feature that lets a processor handle instructions from two different threads at the same time. But have you ever wondered how this actually works? How does the processor keep track of two threads and manage its resources between them?
In this article, we’re going to break it all down. Understanding the nuts and bolts of SMT will help you decide if it’s a good fit for your production servers. Sometimes, SMT can turbocharge your system’s performance, but in other cases, it might actually slow things down. Knowing the details will help you make the best choice.
↫ Abhinav Upadhyay

Some light reading for the (almost) weekend.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

9 Comments

2024-07-25 9:57 pm
Xanady Asem
FWIW SMT is a general concept. Some SMT implementations support more than 2 threads.
Also “However, many experts believe that when absolute maximum performance is needed for a program, it is best to disable SMT so that the single thread will have all the resources available to it.” Those experts are “trust me bro,” most literature offers little to no degradation in performance from SMT. It’s a resource utilization multiplier that operates at timings orders of magnitude lower than the cache/predictor misses for example.

2024-07-25 11:28 pm
Alfman verbose=1
Xanady Asem,
Also “However, many experts believe that when absolute maximum performance is needed for a program, it is best to disable SMT so that the single thread will have all the resources available to it.” Those experts are “trust me bro,” most literature offers little to no degradation in performance from SMT.
I’d generally agree that SMT is well optimized for general workloads, especially when pipelines would otherwise stall. But there are benchmark data that actually show a benefit from disabling SMT.
Timespy Ryzen 9 7950x
https://forums.overclockers.co.uk/threads/time-spy-standard-dx-12-bench.18740536/page-137
SMT Enabled:
Score – 32,800
GPU – 38,616
CPU – 17,698
SMT Disabled:
Score = 34,304 (4.6% higher)
GPU – 39,352
CPU – 19,865
3dmark Ryzen 9 5950X
https://www.3dmark.com/spy/31358849
https://www.3dmark.com/spy/27064960
w/SMT enabled = 19,204
w/SMT disabled = 21,509 (12.0% higher)
Different benchmarks may show that SMT enabled is better too. It really depends on the nature of the task. One should perform benchmarks on tasks of interest to be certain. Of course in practice, I don’t expect most users to worry about it.

2024-07-26 4:10 am
The123king
It depends on the workload. Predictable, repetitive tasks like rendering might benefit from SMT being disabled, but general purpose workloads will likely benefit from having it enabled.

2024-07-26 11:13 am
Alfman verbose=1
The123king,
It depends on the workload. Predictable, repetitive tasks like rendering might benefit from SMT being disabled, but general purpose workloads will likely benefit from having it enabled.
(My emphasis.)
I think everyone should be in agreement there. From the author’s own conclusion…
So, should you use SMT? It really depends. If your workloads need the highest performance and lowest latency, turning SMT off might give you that edge. But if you’re running general-purpose tasks that can benefit from more parallelism, keeping SMT on could be a win.
IMHO SMT provides more benefit on lower core systems due to a shortage of cores to execute all the threads concurrently. Today’s processors obviously come with a lot more cores, so many that sometimes the software doesn’t even take advantage of them all. It’s quite common for games to max out a few cores to 100% while leaving remaining cores at 0%. Consequently SMT’s thread concurrency doesn’t help them and resource latency/sharing can cause some performance degradation. As usual, benchmarking is key to finding out which way works best. It can make a measurable difference although unless someone really cares about scores, I wouldn’t bother.

2024-07-26 11:46 am
ndrw
I have disabled SMT on our computer nodes, which was quite painful – going down from 128 threads to 64 per machine. This meant more tasks will spilling over to other machines and introducing network overheads. This wasn’t a decision taken lightly.
SMT was improving *throughput*, especially in memory and IO bound applications, which is most of the time (even compute bound applications have to load data or save results).
SMT did not scale well for single task speed, though, especially that most our task involve FPU operations and there is only one FPU available per core. So, while the machine could have more work done, each task ran slower. Not the outcome we were after, as machines are cheaper than software licenses.
I also encountered problems with memory and IO bandwidth. trying to run more than 64 threads in parallel usually ended up bottlenecking one or the other.

2024-07-26 12:45 pm
Alfman verbose=1
ndrw,
I also encountered problems with memory and IO bandwidth. trying to run more than 64 threads in parallel usually ended up bottlenecking one or the other.
Yeah, there’s definitely diminishing returns at such high core counts. The best case scenario is if you have tasks that can run entirely inside of local cache without having to wait on saturated memory channels, but I suspect many/most HPC tasks need tons of ram. At that point you may be better off with a cluster than processors.with so many cores.
Designing software for GPUs is completely different than for CPUs, but regardless I do think many computational workloads might use such high core CPUs are probably better off using GPUs that offer more efficiency and more parallelism than CPUs do.

2024-07-26 2:15 pm
ndrw
I would like to have many cores and I’m impatiently waiting for new epic servers. The more cores the better – more jobs can run locally. But these should be fully featured cores that don’t oversubscribe resources (too much).
My point was: SMT is good if (1) you need more cores (not that obvious anymore), (2) you care about throughput, not speed and (3) the rest of the system can keep up with the increased throughput.

2024-07-26 3:53 pm
Alfman verbose=1
ndrw,
I would like to have many cores and I’m impatiently waiting for new epic servers. The more cores the better – more jobs can run locally. But these should be fully featured cores that don’t oversubscribe resources (too much).
I’d like more cores as well. But oversubscribed resources can easily become bottlenecks for such core heavy systems. Particularly when the software experiences a lot of cache misses. It was a long time ago, but I tested memory bound applications on an 8 core *consumer* CPU and the memory subsystem was already completely saturated by 2-3 cores. The available memory bandwidth wasn’t even close to keeping up. Incidentally servers have more memory channels, which helps. Although servers are usually associated with slightly slower single threaded performance and there’s not really enough bandwidth to keep 64 cores/128 threads busy in memory insensitive loads. This model works best with CPU bound code.
I had a project programming a 24core server to process a computationally demanding 2GS/s data streams in real time. That worked out well because algorithms were CPU-bound and parallelism was more important than individual thread performance. I’m not trying to dismiss the model, just point out it’s limits. I feel that GPGPU can bring a lot to the table, potentially displacing the need for such high core CPUs.

2024-07-27 8:47 am
Xanady Asem
Yeah, compute kernels that don’t scale at all (in terms of CPUs) will benefit from monopolizing a core. That’s one of the corner use cases for SMT hindering performance. That doesn’t seem a candidate use case for a 64-core SKU with or without SMT FWIW.