AMD Rome second generation EPYC review: 2x 64-core benchmarked

Thom Holwerda 2019-08-08 AMD 8 Comments

So has AMD done the unthinkable? Beaten Intel by such a large margin that there is no contest? For now, based on our preliminary testing, that is the case. The launch of AMD’s second generation EPYC processors is nothing short of historic, beating the competition by a large margin in almost every metric: performance, performance per watt and performance per dollar.
Analysts in the industry have stated that AMD expects to double their share in the server market by Q2 2020, and there is every reason to believe that AMD will succeed. The AMD EPYC is an extremely attractive server platform with an unbeatable performance per dollar ratio.

This is one stunning processor family.

About The Author

Thom Holwerda

Follow me on Mastodon @[email protected]

8 Comments

2019-08-08 8:20 pm

Alfman verbose=1
I love it, that’s an insane amount of cores 🙂

I kind of wish they hadn’t rushed the benchmarks though, it put intel at an artificial disadvantage. They should have compared the best tech from each company.

Unfortunately we could not test the Intel Xeon 8280 in time for this data. However, the Intel Xeon 8280 will deliver very similar results, the main difference being that it runs a 5% higher clock (4 GHz vs 3.8 GHz). So we basically expect the results to be 3-5% higher than the Xeon 8176.

Anyways, I’d love to have one! This is the future of CPU scalability, To support this many cores they needed to split into more NUMA regions. For data center servers, NUMA gets you much higher parallelism for tasks that don’t need IPC between the regions. This is perfect for VMs and application gateways where processes execute mostly independently from each other. However for heavily multithreaded processes NUMA creates bottlenecks for SMP software. All of the shared data access and synchronization primitives used by multithreaded algorithms can quickly saturate a distributed NUMA memory architecture and anandtech’s benchmarks show a fairly severe latency cost for the 128core 2*AMD EPYC 7742 system. For this reason, I think this type of hardware is best for data center usage, which is good because it costs alot, haha. While I wouldn’t predict very high scores for gaming, it’d be fun to see those benchmarks never-the-less 🙂

2019-08-09 12:27 am

galvanash
To support this many cores they needed to split into more NUMA regions.

But it has less NUMA regions… That is the thing they fixed. Dual Socket EPYC used to have 8 NUMA domains and 3 different paths with different latencies, now there are only 2 domains/2 paths.

Did I misunderstand something?

2019-08-09 12:38 am

galvanash
Nvm… Im dumb. I didn’t see the bit about the NPS tuning. Now I get what you were saying. Sorry.
2019-08-09 10:12 am

Alfman verbose=1
I think it would make sense to treat these kinds of configurations more as a “virtual cluster” than as a high core SMP system. This way rather than ending up with high latency SMP, it could be treated as a low latency cluster! Maybe we could use containers and process pinning (something short of full virtualization).to achieve this? There must be someone doing this already. If it weren’t above my pay grade, I’d love to work on this stuff myself.

2019-08-09 11:21 am

Flatland_Spider
tuned and numad are supposed to help with this. They’ll dynamically adjust processes to get better performance. Of course, building a smarter scheduler for the OS would be a better solution.

Maybe do some async multiproccessing. Something like ARM’s big/little architecture. Each chip has a small quad core chip for the OS, and big chips for jobs. This might only be possible when RAM starts ending up on die.

Another thought is building a hardware hypervisor into the platform. Old Sun systems could carve up a high core count SPARC system into multiple domains, I don’t remember the actual name of the technology, and each one would be a bespoke server sharing baremetal hardware. It was part of the firmware to bootstrap the system, so the equivalent would be to build something into UEFI.

So many thoughts, and old examples.

2019-08-09 12:31 pm

Alfman verbose=1
Flatland_Spider,

tuned and numad are supposed to help with this. They’ll dynamically adjust processes to get better performance. Of course, building a smarter scheduler for the OS would be a better solution.

Indeed, I’m not sure I’ll ever get to work on this, but I’ll bookmark and try to remember it for future reference, thanks!

https://linux.die.net/man/8/numad

Numad is a system daemon that monitors NUMA topology and resource usage. It will attempt to locate processes for efficient NUMA locality and affinity, dynamically adjusting to changing system conditions. Numad also provides guidance to assist management applications with initial manual binding of CPU and memory resources for their processes. Note that numad is primarily intended for server consolidation environments, where there might be multiple applications or multiple virtual guests running on the same server system. Numad is most likely to have a positive effect when processes can be localized in a subset of the system’s NUMA nodes. If the entire system is dedicated to a large in-memory database application, for example — especially if memory accesses will likely remain unpredictable — numad will probably not improve performance.

(emphasis mine).

Maybe do some async multiproccessing. Something like ARM’s big/little architecture. Each chip has a small quad core chip for the OS, and big chips for jobs. This might only be possible when RAM starts ending up on die.

This reminds me of the PS3 Cell Processor
https://arcb.csc.ncsu.edu/~mueller/cluster/ps3/doc/CellProgrammingTutorial/BasicsOfCellArchitecture.html

It had a great deal of potential, but the developers rarely used it to it’s full potential. I’m not sure if developers will ever warm up to extremely asymmetric hardware designs even though it can help to address hardware problems..

Another thought is building a hardware hypervisor into the platform. Old Sun systems could carve up a high core count SPARC system into multiple domains, I don’t remember the actual name of the technology, and each one would be a bespoke server sharing baremetal hardware. It was part of the firmware to bootstrap the system, so the equivalent would be to build something into UEFI.

That would be awesome! It could look just like a real blade server, without requiring all those physical blade modules.

2019-08-09 8:26 am

Bill Shooter of Bul Platinum Prime
Depends on what system builders do. CPU bench marks are great, but especially for servers, it really depends on the server configurations that HP, DELL, supermicro, etc put together and how they price them. Last time with opterons, the pricing was great per cpu, but there weren’t a lot of options when it came time to purchase the servers. Also, I’m kind of concerned that for many companies owning physical servers isn’t something they want. There are still needs for them, but its not the default obvious choice anymore to have one’s own hardware.
2019-08-09 5:14 pm

Adurbe
I love this. I’m computing terms I’m of the AthlonXP generation. I saw a market where 2 companies battled it out for x86, meanwhile there was the G4 chips and SPARC were keeping the market honest.
Now we are seeing that building again, with ARM keeping them on their toes. Again, loving it!