Linked by Thom Holwerda on Fri 11th Aug 2017 19:46 UTC
AMD

In this review we've covered several important topics surrounding CPUs with large numbers of cores: power, frequency, and the need to feed the beast. Running a CPU is like the inverse of a diet - you need to put all the data in to get any data out. The more pi that can be fed in, the better the utilization of what you have under the hood.

AMD and Intel take different approaches to this. We have a multi-die solution compared to a monolithic solution. We have core complexes and Infinity Fabric compared to a MoDe-X based mesh. We have unified memory access compared to non-uniform memory access. Both are going hard against frequency and both are battling against power consumption. AMD supports ECC and more PCIe lanes, while Intel provides a more complete chipset and specialist AVX-512 instructions. Both are competing in the high-end prosumer and workstation markets, promoting high-throughput multi-tasking scenarios as the key to unlocking the potential of their processors.

As always, AnandTech's the only review you'll need, but there's also the Ars review and the Tom's Hardware review.

I really want to build a Threadripper machine, even though I just built a very expensive (custom watercooling is pricey) new machine a few months ago, and honestly, I have no need for a processor like this - but the little kid in me loves the idea of two dies molten together, providing all this power. Let's hope this renewed emphasis on high core and thread counts pushes operating system engineers and application developers to make more and better use of all the threads they're given.

Order by: Score:
v Comment by Licaon_Kter
by Licaon_Kter on Fri 11th Aug 2017 20:39 UTC
Threads
by kwan_e on Sat 12th Aug 2017 02:20 UTC
kwan_e
Member since:
2007-02-18

make more and better use of all the threads they're given


More use? No. Better use? Yes. Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.

Reply Score: 3

RE: Threads
by dpJudas on Sat 12th Aug 2017 09:57 UTC in reply to "Threads"
dpJudas Member since:
2009-12-10

More use? No. Better use? Yes. Programs should definitely be thread agnostic and thus structured (layered) for usage patterns like work-stealing queues etc.

Problem is that once NUMA enters the picture it becomes much more difficult to be thread agnostic. A generic threadpool doesn't know what memory accesses each work task is going to do, for example.

Reply Score: 3

RE[2]: Threads
by Alfman on Sat 12th Aug 2017 14:55 UTC in reply to "RE: Threads"
Alfman Member since:
2011-01-28

dpJudas,

Problem is that once NUMA enters the picture it becomes much more difficult to be thread agnostic. A generic threadpool doesn't know what memory accesses each work task is going to do, for example.


I agree, multithreaded code can quickly reach a point of diminishing returns (and even negative returns). NUMA overcomes those bottlenecks by isolating the different cores from each other's work, but then obviously not all threads can be equal and code that assumes they are will be penalized. These are intrinsic limitations that cannot really be fixed in hardware, so personally I think we should be designing operating systems that treat NUMA as clusters instead of as normal threads. And our software should be programed to scale in clusters rather than merely with threads.

The benefit of the cluster approach is that software can scale with many more NUMA cores than pure multithreaded software. And without the shared memory constraints across the entire set of threads, we can potentially scale the same software with NUMA or additional computers on a network.

Reply Score: 4

RE[3]: Threads
by FortranMan on Sat 12th Aug 2017 17:34 UTC in reply to "RE[2]: Threads"
FortranMan Member since:
2011-12-21

This is really why I still use MPI for parallel execution even when running on a single node. This approach also has the added benefit of scaling up to small computer clusters without much extra effort.

I mostly write engineering simulation codes though, so I'm pretty sure this does not make sense for entire classes of program.

Reply Score: 4

RE[4]: Threads
by Alfman on Sat 12th Aug 2017 20:40 UTC in reply to "RE[3]: Threads"
Alfman Member since:
2011-01-28

FortranMan,

I mostly write engineering simulation codes though, so I'm pretty sure this does not make sense for entire classes of program.



Oh cool, I'd really like to learn more about that. I've played around with inverse kinematic software and written some code to experiment with fluid dynamics, but nothing sophisticated. I've long wanted to try building physical simulations with a GPGPU, though obviously it requires a different approach!

Reply Score: 2

RE[5]: Threads
by FortranMan on Mon 14th Aug 2017 07:00 UTC in reply to "RE[4]: Threads"
FortranMan Member since:
2011-12-21

Most of my codes are simulations of experiments; tens to hundreds of thousands of simulations are run of the same setup with slight variations in input to help predict the uncertainties expected in the actual experiment. These are usually done in the design phase of an experimental project.

I have also written programs to approximate solutions to partial differential equations on various grids in parallel and serial, using shared memory, distributed memory, and even hybrid architectures. It is interesting stuff, albeit very challenging.

Reply Score: 1

RE[6]: Threads
by Alfman on Mon 14th Aug 2017 13:37 UTC in reply to "RE[5]: Threads"
Alfman Member since:
2011-01-28

FortranMan,

Most of my codes are simulations of experiments; tens to hundreds of thousands of simulations are run of the same setup with slight variations in input to help predict the uncertainties expected in the actual experiment. These are usually done in the design phase of an experimental project.


So are the individual simulations multithreaded? Or do you use the threads to run many singlethreaded simulations at once?

I have also written programs to approximate solutions to partial differential equations on various grids in parallel and serial, using shared memory, distributed memory, and even hybrid architectures. It is interesting stuff, albeit very challenging.


Yea it sure blows my line of work out of the water, the businesses I get work from don't offer intellectually interesting or challenging work. Your lucky!

Reply Score: 2

RE[3]: Threads
by tylerdurden on Sun 13th Aug 2017 00:04 UTC in reply to "RE[2]: Threads"
tylerdurden Member since:
2009-03-17

It seems you want to get the worst of both worlds in order to not get the benefits of NUMA.

You can simply pin threads if you're that concerned with NUMA latencies. Otherwise let the scheduler/mem controller deal with it.

Edited 2017-08-13 00:06 UTC

Reply Score: 2

RE[4]: Threads
by Alfman on Sun 13th Aug 2017 00:38 UTC in reply to "RE[3]: Threads"
Alfman Member since:
2011-01-28

tylerdurden,

It seems you want to get the worst of both worlds in order to not get the benefits of NUMA.

You can simply pin threads if you're that concerned with NUMA latencies. Otherwise let the scheduler/mem controller deal with it.


That's the problem, the operating system scheduler/mem controller CAN'T deal with it.

If you had 32 cores divided into 8 NUMA clusters, the system would start incurring IPC overhead between NUMA clusters at just 5+ threads. You can keep adding more threads, but the system will be saturated with intra-NUMA IO.

To scale well, software must take the NUMA configuration into account. IMHO using separate processes is quite intuitive and allows the OS to effectively manage the NUMA threads. It also gives us the added benefit of distributing the software across computers on a network if we choose to. But obviously you can do it all manually pinning threads to specific cores and developing your own custom NUMA aware memory allocators, or you could allowing the OS to distribute them by process, it achieves a similar result. Personally I'd opt for the multiprocess approach, but you can choose whatever way you want.

Edited 2017-08-13 00:40 UTC

Reply Score: 2

RE[5]: Threads
by tylerdurden on Sun 13th Aug 2017 02:28 UTC in reply to "RE[4]: Threads"
tylerdurden Member since:
2009-03-17

Sure. But remember, the whole point of NUMA is not to incur in IPC overhead.

I think, if I'm correct, you're viewing threads as basically being at the process level. There, sure message passing makes sense, since you're not dealing with shared address spaces. But that's not what NUMA is trying to deal with.

You only have issues with NUMA when you have a very shitty memory mapping, when every core is referencing contents in another chip, but those pathological cases are rare.

Edited 2017-08-13 02:33 UTC

Reply Score: 2

RE[6]: Threads
by Alfman on Sun 13th Aug 2017 05:37 UTC in reply to "RE[5]: Threads"
Alfman Member since:
2011-01-28

tylerdurden,

Sure. But remember, the whole point of NUMA is not to incur in IPC overhead.


Yes, but only when multithreaded software takes the NUMA topology into account. If you simply increase the number of worker threads to equal the number of cores, but you don't take NUMA topology into account, your going to end up with lots of IO crossing the NUMA boundaries. It's certainly possible for a developer to build algorithms that avoid this, but it adds a lot of complexity to an already complicated topic. For example, avoid sharing structures between threads of different NUMA regions. Also avoid sharing mutexes/futexes/etc across NUMA regions since synchronization primitives based on cache coherency protocols incur significant overhead when accessed across NUMA regions.

A multiprocess design greatly simplifies this and can enforce NUMA boundaries with no additional work. In other words, using a multiprocess design, there's zero implicit IPC between NUMA regions and only explicit IPC calls will trigger IO between NUMA regions.

I think, if I'm correct, you're viewing threads as basically being at the process level. There, sure message passing makes sense, since you're not dealing with shared address spaces. But that's not what NUMA is trying to deal with.


I don't rule out multi threaded entirely, just trying to avoid crossing NUMA boundaries.

So for example, assume we have a dual CPU NUMA configuration where half the RAM is physically connected to one CPU and the other half of RAM is physically connected to the other CPU. Each CPU has four native cores. The pure MT approach would be to run one process with eight threads, but naive MT code will end up with allocations spanning both NUMA regions and resulting in lots of chatter between CPUs having to fulfill each other's remote requests.

It's much more efficient if NUMA CPUs never have to fulfill each other's memory requests unless explicit IPC is taking place. So for this example I would create two processes, each with four multithreaded workers. Since the address space for each of the threads is only ever shared with other workers on the same CPU, there's zero unnecessary NUMA chatter.


You only have issues with NUMA when you have a very shitty memory mapping, when every core is referencing contents in another chip, but those pathological cases are rare.


Not really, you might have 32 threads waiting for a socket operation, but you don't have much control over which thread will actually receive the operation, and so naive code could very easily end up in a pathological case where the data structure it needs is located on the wrong NUMA CPU, which will kill the performance.

In naive MT code, it's very common for all threads to block on shared synchronization primitives, but in a NUMA configuration this produces a pathological case since all the cores will constantly have to perform IO across NUMA boundaries check the mutexes/semaphores.

I think that solving the pathological cases will ultimately require MT algorithms that resemble the multi-process design anyways, so IMHO it makes sense just to start there and not have to reinvent the wheel.

Edited 2017-08-13 05:41 UTC

Reply Score: 2

RE[7]: Threads
by tylerdurden on Mon 14th Aug 2017 00:47 UTC in reply to "RE[6]: Threads"
tylerdurden Member since:
2009-03-17

I think the problem is that you're seeing "Threads" as full processes, not at the fine grained streams that NUMA deals with.

Reply Score: 2

RE[5]: Threads
by kwan_e on Sun 13th Aug 2017 04:32 UTC in reply to "RE[4]: Threads"
kwan_e Member since:
2007-02-18

It also gives us the added benefit of distributing the software across computers on a network if we choose to.


But trying to be too general in your approach will mean getting the worst of both worlds. If the software doesn't require such a thing, they shouldn't pay the cost of the underlying implementation.

That's the problem, the operating system scheduler/mem controller CAN'T deal with it.
.
.
.
But obviously you can do it all manually pinning threads to specific cores and developing your own custom NUMA aware memory allocators, or you could allowing the OS to distribute them by process, it achieves a similar result. Personally I'd opt for the multiprocess approach, but you can choose whatever way you want.


To me, that just means the OS should open up a way for a process to say "these bunch of threads/tasks/contexts should be clustered together" and the software can say "these work units are of type X" and the OS can schedule them appropriately. Something like Erlang's lightweight processes?

Edited 2017-08-13 04:33 UTC

Reply Score: 2

RE[6]: Threads
by Alfman on Sun 13th Aug 2017 05:51 UTC in reply to "RE[5]: Threads"
Alfman Member since:
2011-01-28

kwan_e,

But trying to be too general in your approach will mean getting the worst of both worlds. If the software doesn't require such a thing, they shouldn't pay the cost of the underlying implementation.


I'm not sure what your criticism is specifically, what is it you don't like?


To me, that just means the OS should open up a way for a process to say "these bunch of threads/tasks/contexts should be clustered together" and the software can say "these work units are of type X" and the OS can schedule them appropriately. Something like Erlang's lightweight processes?


Sure, you could bundle some threads together, and then write code such that those threads avoid sharing memory or synchronization primitives with other bundles, and then make sure network sockets are only accessed by threads in the correct bundle associated with the remote client. This is all great, but it should also sound very familiar! We've basically reinvented the "process" ;)

Edited 2017-08-13 06:10 UTC

Reply Score: 2

RE[7]: Threads
by kwan_e on Sun 13th Aug 2017 08:55 UTC in reply to "RE[6]: Threads"
kwan_e Member since:
2007-02-18

I'm not sure what your criticism is specifically, what is it you don't like?


Having programs that can be offloaded onto the network is fine, but it is not necessary. To take advantage of that, it would affect a program's design in a way that would make it substandard for its common use case.

"Something like Erlang's lightweight processes?
This is all great, but it should also sound very familiar! We've basically reinvented the "process" ;) "

Pretty sure lightweight processes a la Erlang aren't processes. Context switching between processes is much more expensive than those lightweight processes.

And also, why not have multiple levels of automated task management? The top level is the process, but why have one level? OS level processes are there for security purposes, and one could argue putting other responsibilities onto that one abstraction is inefficient.

Reply Score: 2

My thoughts on this
by ahferroin7 on Mon 14th Aug 2017 12:34 UTC
ahferroin7
Member since:
2015-10-30

This push for parallelization over raw speed is great for certain use cases. For an average desktop user though, you really don't need more than 4 cores with 2 threads a piece. In fact, outside of certain use cases such as simulations, virtualization, and multimedia editing, it's pretty serious overkill. Most desktop apps aren't heavily multi-threaded because it makes no sense to do so. You don't need a Twitter client or a mail reader to use more than a few threads, and even then it's I/O parallelization you need, not computational parallelization, and that doesn't require more cores, just better coding.

For games, yeah it would be nice if they better utilized system resources, but many of them are already pretty heavily parallelized, they just do most of the work on the GPU. For a case like that where you're doing most of your processing on the GPU, it doesn't make sense to use more than a few threads on the CPU. You don't need to parallelize input processing or network access internally, and just like with regular desktop apps, disk access doesn't need more cores to be more parallel.

The reality is that most stuff other than games that benefits from parallelization already does a pretty good job of it, especially on non Windows systems. Modeling, CAD, and simulation software has been good at it for decades now. Virtualization, emulation, and traditional server software software has also been pretty good at this for years. Encoding software (both multimedia and conventional data compressors) could be a bit better, but for stream compression there are limits to how much you can parallelize things from a practical perspective, and that's about the only thing. In fact, I'd argue that the push for more cores over higher speed is a reflection of the demands of modern software (think machine learning), not the other way around.

Aside from all that, while I would love a Threadripper system from a bragging rights perspective, in all seriousness I don't need one. I upgraded the system that I would put this in from a Xeon E3-1275 v3 to a Ryzen 7 1700 about a month after launch, and that alone was enough that I don't need any more processing power. The system in question is running anywhere between 10 and 30 VM's, plus 8 BOINC jobs, Syncthing, GlusterFS (both storage and client), and distcc jobs for a half dozen other systems, and despite all that I still have no issue rebuilding the system (I run Gentoo) in a reasonable amount of time while all that is going on. In fact, the only issues I have that aren't just minor annoyances are all related to AMD not releasing full datasheets for Zen, and thus would be issues with a Threadripper CPu too.

Reply Score: 2