Linked by Thom Holwerda on Fri 13th Apr 2018 22:50 UTC, submitted by Doeke
Hardware, Embedded Systems

Cloudflare, which operates a content delivery network it also uses to provide DDoS protection services for websites, is in the middle of a push to vastly expand its global data center network. CDNs are usually made up of small-footprint nodes, but those nodes need to be in many places around the world.

As it expands, the company is making a big bet on ARM, the emerging alternative to Intel’s x86 processor architecture, which has dominated the data center market for decades.

The money quote from CloudFlare's CEO:

"We think we're now at a point where we can go one hundred percent to ARM. In our analysis, we found that even if Intel gave us the chips for free, it would still make sense to switch to ARM, because the power efficiency is so much better."

Intel and AMD ought to be worried about the future. Very worried. If I were them, I'd start work on serious ARM processors - because they're already missing out on mobile, and they're about to start missing out on desktops and servers, too.

Order by: Score:
The real problem...
by galvanash on Fri 13th Apr 2018 23:37 UTC
galvanash
Member since:
2006-01-25

It's honestly not an ARM vs x86 thing...

Really. Intel has made x86 cores (atom) that are both performance and power competitive with ARM. They mostly failed in the marketplace historically, but most of that failure was in mobile. The Xeon C3000 series is very price, performance, and power competitive with anything coming to servers using ARM (on a per core basis).

The real problem isn't Intel sticking with x86, its that they still haven't figured out that they package their cores wrong, or haven't figured out how to package them right at least.

The latest top-end atom based Xeon C series has 16 cores/16 threads, runs at 2.1Ghz with a 32W TDP. The ARM based Cavium ThunderX has 48 cores/48 threads at 2.5Ghz with a 120W TDP. So basically it has 4x the thread count at 4x the TDP, i.e. its mostly a wash from a power point of view.

So why does no one use these chips and instead flock to Cavium?

Density.

You can fit 4 dual socket ThunderX (96 cores) with a terrabyte of ram each into a 2U rack. Thats 384 cores and 4TB of ram. Intel has nothing remotely this dense. The whole thing is probably sucking down 1000W fully loaded, but that is significantly better than most 4 node Xeon 2U servers, and you get 296 extra cores... Even if you take into account hyperthreading (which doesn't work on all workloads), you still have the ability to run about 200 more threads on a Cavium rack.

Its not ARM being more power efficient, its that Intel isn't servicing the market that Cavium is - the guys that need the maximum number of threads in the minimum amount of space at low power. It doesn't matter too much that the Cavium machines are slower on a per thread basis when you get almost double the number of cores per square inch of rack space (at similar power efficiency).

From a technical perspective I see no real reason why Intel couldn't build similarly dense atom based Xeons (and probably at a lower TDP to boot), they just don't. I haven't a clue why at this point.

If they can put 24 high end cores running at 3.4Ghz in a single chip, I don't understand why they can't put at least double the number of atoms into one (or more even).

Until they figure out how to do that, they are going to lose customers to ARM, not because of power efficiency, but because of density.

ps. Cloudflare seems to be going with Qualcomm Centriq based ARM servers instead of Cavium, but the basic argument is exactly the same (they are both 48 cores per CPU).

Edited 2018-04-13 23:49 UTC

Reply Score: 10

RE: The real problem...
by Kochise on Sat 14th Apr 2018 16:26 UTC in reply to "The real problem..."
Kochise Member since:
2006-03-03

Perhaps going the ARM path also ensure a better competition, not depending on Intel duopoly (x86 and fab). I think the x86 legacy cost is making us lag behind, despite whatever Atom/Xeon good implementation you may have, you'll only depend on Intel, perhaps AMD, to deliver performance in a market segment that don't need to rely on Windows because data servers can run on almost anything, provided they follow some standards.

Reply Score: 2

RE: The real problem...
by tidux on Sat 14th Apr 2018 21:36 UTC in reply to "The real problem..."
tidux Member since:
2011-08-13

> From a technical perspective I see no real reason why Intel couldn't build similarly dense atom based Xeons

They can't get the SMP scale out on a single die to work well enough. Even AMD's Ryzen/EPYC line was a game changer for x86 due to how many threads it sticks on one chip. ARM chip vendors don't have coming up on 40 years of IBM PC history weighing them down with extra silicon, so they're free to build smaller cores in more novel configurations.

Reply Score: 3

RE[2]: The real problem...
by Treza on Sat 14th Apr 2018 23:00 UTC in reply to "RE: The real problem..."
Treza Member since:
2006-01-11

Except that, as the article indicates :
“Every request that comes in to Cloudflare is independent of every other request, so what we really need is as many cores per Watt as we can possibly get,”

It is not really SMP, or it is an easy form with very little data sharing between cores. Maintaining coherency between tens or hundreds of cores is very power consuming and inefficient, you need busses with lots of coherency traffic, many-ported large caches, coherency adds latency...

Of course, the arguably simpler ARM architecture compared to x86 and the many cores available (proprietary from Apple, Qualcomm and others or from the ARM catalog) allows lots of flexibility.

Cloudflare may even one day ask for custom CPUs, with more networking interfaces, minimal floating point performance, some special accelerator for their niche...

Reply Score: 4

RE[3]: The real problem...
by Alfman on Sun 15th Apr 2018 03:21 UTC in reply to "RE[2]: The real problem..."
Alfman Member since:
2011-01-28

Treza,

It is not really SMP, or it is an easy form with very little data sharing between cores. Maintaining coherency between tens or hundreds of cores is very power consuming and inefficient, you need busses with lots of coherency traffic, many-ported large caches, coherency adds latency...



Obviously shared state is a bottleneck. SMP quickly reaches diminishing returns. NUMA is more scalable, but it is harder for software to use NUMA effectively if the software was designed with SMP in mind.

I think CPU architectures with explicit IO rather than implicit coherency could increase the hardware performance, especially with good compiler support, but it would require new software algorithms and break compatibility so it would be unlikely to succeed in the market.

I think the way forward will be hundreds & thousands of independent cores like you say that will function more like a cluster of nodes than SMP cores with shared memory.

I can see such a system benefiting from a very high speed interconnect which will serve a similar function to ethernet but will offer much faster and more efficient IO between nodes. Since a fully connected mesh becomes less feasible at high core counts, we'll likely see more software algorithms evolving to support native (high performance) mesh topologies. Most of these algorithms will be abstracted behind libraries. For example we'll probably see sharded database servers that expose familiar interfaces but distribute and reconstruct data across the mesh in record speeds.


I for one am excited by the prospects of such highly scalable servers!

Edited 2018-04-15 03:22 UTC

Reply Score: 3

RE[4]: The real problem...
by Treza on Sun 15th Apr 2018 12:43 UTC in reply to "RE[3]: The real problem..."
Treza Member since:
2006-01-11

And let's call these massively parallel architectures, with huge memory bandwidth, hundreds of cores and multithreading to hide memory latency...

GPGPUs !!!

Reply Score: 3

RE[5]: The real problem...
by Alfman on Sun 15th Apr 2018 14:03 UTC in reply to "RE[4]: The real problem..."
Alfman Member since:
2011-01-28

Treza,

And let's call these massively parallel architectures, with huge memory bandwidth, hundreds of cores and multithreading to hide memory latency...

GPGPUs !!!



Obviously GPGPUs have their uses, but they target different kinds of problems. For cloudflair's example of web hosting, a massively parallel GPGPU isn't very useful, but massively parallel cluster is.

In the long term, FPGAs could eventually unify GPUs and CPUs so that we no longer have to consider them different beasts for different workloads. Instead of compiling down to a fixed instruction set architecture, software can be compiled directly into transistor logic.

I'm not happy with the price of GPUs these days, so I think there may be an opportunity for FPGAs to grow out of a niche status to become more of a commodity. However, IMHO, it will be many years before software toolchains are actually ready to target FPGA. What we have is a sort of chicken and egg problem.

Reply Score: 4

RE[6]: The real problem...
by tidux on Mon 16th Apr 2018 04:47 UTC in reply to "RE[5]: The real problem..."
tidux Member since:
2011-08-13

FPGA toolchains are so proprietary they make Microsoft look like Richard Stallman. That has to change before they can get any real use in general computation.

Reply Score: 0

RE[7]: The real problem...
by Alfman on Mon 16th Apr 2018 06:47 UTC in reply to "RE[6]: The real problem..."
Alfman Member since:
2011-01-28

tidux,

FPGA toolchains are so proprietary they make Microsoft look like Richard Stallman. That has to change before they can get any real use in general computation.


Yea, I'm pretty sure this could be addressed by FOSS projects, but obviously we're not there yet. If the industry wants to fight FOSS, that would be a shame and it might well hurt access especially for smaller developers.

Reply Score: 2

RE: The real problem...
by gilboa on Sun 15th Apr 2018 08:51 UTC in reply to "The real problem..."
gilboa Member since:
2005-07-06

ARM core != Xeon cores so counting cores are metrics is rather useless. (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).

More ever, a Supermicro Big Twin (4 x 2S x Xeon Gold 6152) can pack 160 cores and 320 threads and 6TB RAM in 2U. (224 cores / 448 threads and 12TB RAM if opt for the far more expensive Xeon Platinum 6176/6180Ms) and should be ~2-3 faster (again, YMMV) compared to a Cavium based machine.

Now, I've added the YMMV a couple of time, and for a good reason.
ARM has two advantages (and density is *not* one of them).
1. Price per transaction. Intel Xeon price, especially the high end parts and the M parts, is unreasonable. AMD might be able to pull another Opteron and force Intel to lower the price, but that remains to be seen.
2. Power per transaction. ARM cores are more efficient. If your application requires a lot of slow threads and you have limited power budget, ARM is definitely the answer.

- Gilboa

Edited 2018-04-15 08:52 UTC

Reply Score: 3

RE[2]: The real problem...
by viton on Sun 15th Apr 2018 13:36 UTC in reply to "RE: The real problem..."
viton Member since:
2005-08-09

(In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).

So what “high-end” ARM did you tested and how?
Do you have an experience with Centriq or ThunderX2?
ThunderX was really weak.

Reply Score: 1

RE[3]: The real problem...
by gilboa on Mon 16th Apr 2018 06:52 UTC in reply to "RE[2]: The real problem..."
gilboa Member since:
2005-07-06

Tegra TX1 and very short time with ThunderX (which, as you point out, has very weak cores).


We plan to test ThunderX2 when we have some free time (and when its freely available).

Please note that our proprietary application is heavily CPU/cache/memory bandwidth limited and has zero acceleration potential, so (even) ThunderX2 limited inter-core/CPU interconnect bandwidth might be major performance handicap.

- Gilboa

Edited 2018-04-16 06:53 UTC

Reply Score: 3

RE[4]: The real problem...
by viton on Tue 17th Apr 2018 03:25 UTC in reply to "RE[3]: The real problem..."
viton Member since:
2005-08-09

Please note that our proprietary application is heavily CPU/cache/memory bandwidth limited and has zero acceleration potential

Centriq 2460 has 60MB L3 Cache and 120GB/s bandwidth.
ThunderX2 has haswell-level perf, 33MB(?) L3 and 170GB/s (theoretical) that is higher than any Intel part.

What compiler do you use?

so (even) ThunderX2 limited inter-core/CPU interconnect bandwidth might be major performance handicap.

This is definitely a sign of non muilticore-friendly workload/programming practices.

Edited 2018-04-17 03:25 UTC

Reply Score: 2

RE[2]: The real problem...
by galvanash on Sun 15th Apr 2018 18:37 UTC in reply to "RE: The real problem..."
galvanash Member since:
2006-01-25

ARM core != Xeon cores so counting cores are metrics is rather useless. (In my experience high-end AARM64 cores performance level is ~25-30% compared to a Broadwell / Skylake core, but YMMV).



Atom cores are more or less in the same ballpark as ARM cores. Broadwell and Skylake are not. They pretty much disqualify themselves from this discussion for using too much power and they perform much better single threaded... ARM servers don't really compete with Broadwell/Skylake and don't even really try to (yet).

More ever, a Supermicro Big Twin (4 x 2S x Xeon Gold 6152) can pack 160 cores and 320 threads and 6TB RAM in 2U. (224 cores / 448 threads and 12TB RAM if opt for the far more expensive Xeon Platinum 6176/6180Ms) and should be ~2-3 faster (again, YMMV) compared to a Cavium based machine.


I was using atom based Xeons in my example. Why are you bring up machines that literally cost 10x-15x as much and use many times as much power? My whole post was about competing with ARM - atom based Xeons compete with ARM (or at least try to). High end Xeons cost way too much, use too much power, etc. - it isn't the same market at all.

So let me clarify... I thought the context was obvious in my post, but maybe not. Intel has nothing remotely as dense as Cavium/Centriq with competitive power/core and cost/core. My argument is simply that they could if they wanted to using atom cores - they don't need to switch to ARM to compete...

Reply Score: 2

RE[3]: The real problem...
by gilboa on Mon 16th Apr 2018 08:18 UTC in reply to "RE[2]: The real problem..."
gilboa Member since:
2005-07-06

Density.

You can fit 4 dual socket ThunderX (96 cores) with a terrabyte of ram each into a 2U rack. Thats 384 cores and 4TB of ram. Intel has nothing remotely this dense. The whole thing is probably sucking down 1000W fully loaded, but that is significantly better than most 4 node Xeon 2U servers, and you get 296 extra cores... Even if you take into account hyperthreading (which doesn't work on all workloads), you still have the ability to run about 200 more threads on a Cavium rack.


You talked about Density which usually translates to MIPS per U.
You claimed that Intel has nothing remotely close (your words, not mine) to ARMs density.
I proved otherwise.

A yet-to-be released high end Cavium Thunder X2 based solutions can "shove" 2 x 48 x 4 (384 cores) in 2U and require ~190w per socket.
An already shipping Intel Xeon Platinum based solution can pack 224 fast cores (448 threads) in 2U and require ~165w per socket (205w if you go super-high-end).
An already shipping AMD Eypc based solution can packet 256 cores (512 threads) in and require 180w per socket.

As this product is still soft launched, pricing information is not available and if ThunderX 1 is any indication, pricing will be ~40-50% of a comparable AMD/Intel based solution (A far cry from your 10-15x claim).

- Gilboa

Edited 2018-04-16 08:18 UTC

Reply Score: 3

RE[4]: The real problem...
by galvanash on Mon 16th Apr 2018 16:15 UTC in reply to "RE[3]: The real problem..."
galvanash Member since:
2006-01-25

Xeon 8180 is a $11k chip. ThunderX2 is (at most) a $2k chip - pricing info is still hard to find but is likely about the same as ThunderX (which was around $800).

https://www.anandtech.com/show/10353/investigating-cavium-thunderx-4...

Thats $90k vs $12k just on the CPUs alone. Cavium motherboards will obviously be far cheaper (its a SOC so they are far simpler) and cooling/power components will be cheaper as well. Rest of the components are irrelevant as they are not platform specific for the most part.

10x-15x could be a bit of an overstatement, but its still at least 5x-10x cheaper to go with Cavium (and far lower power usage on a per thread basis), and if they are really pricing them the same as the ThunderX (say $1k) the difference really is 10x-15x...

As far as performance goes, I think your missing the point. If your running a bunch of redis/memcache instances you don't want all that performance - its a waste of silicon. You just want a shit ton of cores with a bunch of cheap memory hanging off of them that occupy as little rack space as possible and use minimal power... This is exactly the kind of thing ARM/Atom are good for.

Why on earth would anyone buy a Xeon Platinum to do this? I'm not arguing that that high-end Xeon's are bad (hell, they are awesome!) - I'm arguing that low-end Xeons (atom based ones) are bad. They are simply built the wrong way to compete in the market they would actually be competitive in. Its not because they are too slow, and its not because they are too power hungry, its because they are not dense enough for the market they should be targeting...

The market Cavium primarly targets doesn't care about MIPs/U, they care about threads/U. Latency is all that matters...

Edited 2018-04-16 16:20 UTC

Reply Score: 3

AMD Opteron A
by zdzichu on Sat 14th Apr 2018 06:27 UTC
zdzichu
Member since:
2006-11-07

Opteron A isn't serious?

Reply Score: 2