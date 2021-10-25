On the CPU side, doubling up on the performance cores is an evident way to increase performance – the competition also does so with some of their designs. How Apple does it differently, is that it not only scaled the CPU cores, but everything surrounding them. It’s not just 4 additional performance cores, it’s a whole new performance cluster with its own L2. On the memory side, Apple has scaled its memory subsystem to never before seen dimensions, and this allows the M1 Pro & Max to achieve performance figures that simply weren’t even considered possible in a laptop chip. The chips here aren’t only able to outclass any competitor laptop design, but also competes against the best desktop systems out there, you’d have to bring out server-class hardware to get ahead of the M1 Max – it’s just generally absurd.
On the GPU side of things, Apple’s gains are also straightforward. The M1 Pro is essentially 2x the M1, and the M1 Max is 4x the M1 in terms of performance. Games are still in a very weird place for macOS and the ecosystem, maybe it’s a chicken-and-egg situation, maybe gaming is still something of a niche that will take a long time to see make use of the performance the new chips are able to provide in terms of GPU. What’s clearer, is that the new GPU does allow immense leaps in performance for content creation and productivity workloads which rely on GPU acceleration.
These are excellent processors and GPUs, especially when taking their power consumption into account. Sure, a lot of it is optimised only for Apple’s approved frameworks and applications, but if you’re deep into the Apple ecosystem, these are simply no-brainer machines for any creator.
I’m a very niche segment here really aiming to get one of these processors. I do a ton of HPC and I have backends for our code using CUDA, OpenCL and Metal. Our GPU code is around at least 2 orders of magnitudes faster than a top-class Xeon processor. The investment in Metal came precisely when Apple started to announce plans to leave behind OpenCL. At the time, it wasn’t fully clear the transition to Apple Silicon, but as Mac systems are important in my work, we did the work to support Metal in AMD GPUs, and the results were quite good. An AMD Pro W6800 using accelerated computing with Metal is neck to neck in performance to an Nvidia RTX A6000 with CUDA. The RTX A6000 edges the W6800 in the sense it has 48 GB RAM vs 32 GB with the W6800. In my work, every GB on the GPU counts dramatically. When the M1 was announced, we were quite excited as plans started to appear those systems would be eventually moving to high-end versions, increasing the memory available to the GPU.
Early benchmarks place the M1 Max in the ballpark of the RTX 3070-3080, which I’d take any day as much I have more memory, and that is where the 64 GB version comes into play. I’d take a lower performance of the RTX 20XX family just to have more memory. So I already ordered one system last Mon seconds after the Apple store was taking orders. If the M1 Max is truly on the ballpark of RTX 3070-3080… that is a huge accomplishment for Apple. And I couldn’t wait for any second to get whatever processor may come in the pipeline for the Mac Pro. Beyond video specialists who will benefit out of the box, I can see the scientific community losing their minds with the possibilities that this new generation of CPU+GPU may bring. I’d be NVidia and I’d be quite nervous that Apple may get into their market of HPC with GPU , where NVidia has pretty much total dominance, but using insanely expensive and complex technology to interface their GPUs to provide more memory, while still depending on having a high-end CPU to sync all the GPU work and memory transfers. Exciting times.
I don’t think NVIDIA is losing any sleep over Apple in regards to HPC.
Its just not a makes Apple is interested at all. Plus CUDA is too entrenched on the HPC and datacenter.
The amount of power Apple is packing on their SoCs is disruptive, IMO, but mostly for creative applications. For video editors based on ProRes, I read the M1 Max can do several concurrent 8K timelines. Which is crazy on a mobile form factor. Basically you have the editing right next to where the content is created.
I’m interested in seeing where Apple is going for the MacPro. As they still can double the die area in the current 5n process, and can put 2 of those huge dies on the same package using the new TSMC interporser. So you could have a Desktop SoC with 40 cores and 64GPU cores, and 256GB of on package memory. Which would be nuts.
Most people don’t need more processing or GPU power. You may as well offload all of that to an open standard interface running on an external box, distributed network, or remote provider such as another business with spare capacity or a “cloud” provider.
Of course they won’t do that as like washing machines your laptop/desktop would last 10-20 years or more.
The question is the API.
The GPU is really powerful on paper. And I am sure they will integrate well with Adobe products.
On the other hand OpenGL and OpenCL are officially deprecated a few years ago: https://appleinsider.com/articles/18/06/04/opengl-opencl-deprecated-in-favor-of-metal-2-in-macos-1014-mojave
And obviously CUDA, DirectX are not even in the picture. And Vulkan only comes via a third party: https://www.phoronix.com/scan.php?page=news_item&px=Apple-Silicon-Vulkan-MoltenVK
And Proton/Wine will only work through emulation: https://www.reddit.com/r/macgaming/comments/lyv35d/future_of_steam_on_the_m1/
That leaves out many opportunities for open programming on that platform. Yes, “Metal” is available, but will people really spend the additional effort to optimize on that platform? Or will they just port the software to bare minimum essentials?
I am once again on the brink of ordering an Apple product, and once again, I decide it is not worth the effort. No proper Linux, No more Windows, No public APIs means I would be pretty much locked in to whatever Apple decided to give us.
The only hope is that this would be the proper kick for other platforms to actually clean up their acts.
As someone who went the rabbit hole of Metal for compute kernels… if you are versed on CUDA and OpenCL, the differences in programming are very minimal for regular kernels. Programmatically, all 3 (CUDA, OpenCL and Metal) are just variants on “how to prepare data for the GPU, copy data back and forth to the GPU and run a kernel), the biggest innovation in Metal in Apple Silicon is the opportunity to eliminate completely the data transfers. Still, to be confirmed if we can access directly in a Metal compute kernel the memory coming directly, for example, from a Numpy array, that would be so awesome. There are a couple of examples how you can interface Swift+Metal in Python, and from there, you have a ton of possibilities. The big missing pieces on Metal are more advanced libraries that not even AMD offers for OpenCL such as FFT with CUDA that really shines. There are some important considerations, but if you have worked on OpenCL and CUDA, you will know the devil is in the details. For example, some OpenCL driver’s implementations can’t address more than 32 bits of memory, which is stupid when your card has more than that these days. Please note this 32 bit limitation is only for OpenCL kernels, not for other GPU operations. I have that problem with OpenCL in Macs, quite annoying, but with that limitation is not present with Metal, you can access all the 32 GB RAM in a GPU fully dedicated for computing in an external enclosure that is not displaying anything, Metal has a weird limitation that you have up to 32 input buffer arrays where each can’t have more than 2.5 GB in size (go and ask S Jobs on his grave why that weird limit), still, 32 x 2.5 GB = 80 GB of buffer memory for GPU calculations, which is more than the M1 MAX has in total memory, Metal (as it was CUDA back in its early days) has had important upgrades in the first iteration of versions, so I’d not be surprised these limitations would be relaxed as newer generations of Apple Silicon continue to evolve. One limitation so far is that Apple is not invested too much (differently to Nvidia with CUDA) in providing much more advanced compute examples with Metal. But I can see the open-source community will start jumping on board now there is a real motivation to put the effort, I can see one mile away a pyMetal will be developed and sit next to pyOpenCL and pyCuda, and from there, it is going be serious fun.
SamAskani,
I’ve done both Cuda and opencl. While they’re variations on the same ideas, they’re not really the same and there’s a lot of nuance. Unless you stick to the high level abstraction libraries (or write your own), I’d agree with sukru that it’s not necessarily a trivial port especially if you already have a lot of code invested in one platform. Unlike portable C code, you’re now looking at writing, debugging, optimizing, supporting bug reports for multiple targets. It’s not like a normal software port than can use the same code everywhere. And because these targets support different sets of features it’s may not end up being a one to one code mapping, meaning the code may end up being vastly different. I guess you can use the greatest common denominator, but you may be missing out of some of the interesting GPGPU features.
While the M1 max can offer good specs for a laptop, there’s no upgrade path today if you want to transfer your GPGPU work into a desktop with more cores and more powerful GPUs. In the past apple users could plug in eGPUs and get more performance that way, but apple does not support this with any M1 macs. So for the moment at least it seems like the M1 max is the highest you can go. With any other platform there are upgrade paths with higher performing GPUs for consumers and even enterprise.
I’m still skeptical because the benchmarks in the article still had a lot of gaps. Beyond the synthetics only a couple real world cross platform applications were benchmarked and they didn’t perform anywhere as well as the specs or synthetic benchmarks suggested. We don’t know whether these are representative of real world performance in general, or they could just be anomalies, Time will tell as we get more numbers in.
Still, I am in agree that we need more competition. It pushes everyone to try harder. I heard that when the rtx 3080 ti was nearing market nvidia was so close to releasing the 3080ti with 20GB of ram and had working prototypes and everything anticipating a breakthrough by AMD that didn’t happen, so they lowered the specs.
https://www.tweaktown.com/news/81517/nvidia-geforce-rtx-3080-ti-with-20gb-is-real-100mh-mining-power/index.html
This is why we need competition.
The easiest way to explain why Apple’s computers (with M1, M1Pro & M1Max) are so fast is:
On most computers, every part of the computer (CPU, GPU, etc) are literally all separate pieces on the motherboard with wires connecting them all together and to the RAM in the computer. Anytime the CPU feels that the GPU needs to be used, everything about that file has to move from normal RAM to VRAM (video RAM) and the same is true for Machine Learning and everything else. Everything has its own memory and as things get moved around, well that takes a lot of power and that is why computers, for the most part, get so hot. The less efficient the computer the hotter it gets.
Meanwhile … with the new Apple computers, once something is placed in memory, that thing never moves during the time it is being used. It stays in what is called “Unified Memory” which just means you don’t need to move files from one type of memory to another. Since the main thing that creates heat in a computer no longer happens, the computer is a lot cooler. It also makes it much faster since you don’t have to wait for part or all of a file, especially gigantic files, to finish moving. The computer just tells the program the address of where the file is and it just goes there and uses it.
Another way of putting it would be like this.
Let’s say you have 7 brothers and sisters. It doesn’t matter how many are which type. They remember that they all do different things.
Now, if you need all 7 siblings to do something, the non-Apple way would be to grab the boxes that the stuff is in, let’s say there are 80 boxes which is enough to make this seem very horrible to have to physically do.
As you physically move the 80 boxes from one room to another you get very tired from exerting a LOT of energy which makes you all hot and sweaty.
Wouldn’t it just be a LOT smart to leave the boxes in one place and just have all your brothers and sisters come and go through the boxes and do whatever they need to do without having to move any of them? They can move stuff around in the boxes, they just wouldn’t the boxes as a whole unless they were moving them to a USB drive or something.
Nobody but nobody would get hot and sweaty in a room with plenty of space and fans blowing at the perfect speed to keep you nice and cool and happy and non-sweaty since you aren’t moving a whole lot of big boxes from room to room to room.