In late 2020, Apple debuted the M1 with Apple’s GPU architecture, AGX, rumoured to be derived from Imagination’s PowerVR series. Since then, we’ve been reverse-engineering AGX and building open source graphics drivers. Last January, I rendered a triangle with my own code, but there has since been a heinous bug lurking:
The driver fails to render large amounts of geometry.
Spinning a cube is fine, low polygon geometry is okay, but detailed models won’t render. Instead, the GPU renders only part of the model and then faults.
A very deep dive into the cause and fix for this bug, and on top of that, some sleuthing to figure out where it comes from. A very fun and interesting read.
Since the M1 screams in comparison to the Intel CPU/GPU combo, something obviously is working very well. That and the low amount of power confused and also the low amount of heat that is generated by the M1. So either something must be VERY right in the way that Apple is doing this, OR Apple knows about this and felt that, we need to ship this thing and we’ll fix it later and the true power of the Apple SoC is about to have a HUGE anchor released from the hardware.
Also, maybe Apple knew about this and their better versions of the chips ran so fast that they were saying, wait a minute, we only have to be 300% faster than Intel to make a big splash. We don’t have to be 3,000% or or 30,000% faster than Intel. Maybe Apple is –Purposely– limiting its hardware so that if anyone figures out how to make their system as fast or nearly as fast as Apple then they can drop the anchor and say, “well we found this way of better optimizing …” and here is the FULL power of our SoC.
Also, if there is such a huge bug in Apple’s M1, why hasn’t any other company blown past them? Hmmm. Maybe Apple just didn’t have time to release a better version or they are keeping a much better version held back “just in case”. No matter what, I can’t wait to see what Apple comes up with for the M3 chip in a couple of years.
Sabon,
What data are you basing that on? Relative comparisons are always changing as the vendors put out new products. When the M1 was introduced it beat intel on single core performance and lost on multicore performance. But today it’s the opposite with intel CPUs beating apple on single core performance by about 6.7%…
cpubenchmark.net/cpu.php?cpu=Intel+Core+i9-12900&id=4729
cpubenchmark.net/cpu.php?cpu=Apple+M1+Ultra+20+Core&id=4782
The multicore benchmark puts the M1 ultra at 8% faster than the same i9-12900 CPU, but then they don’t have the same core and thread configuration. Intel also sells much more powerful CPUs for heavy multi-threading that can beat the M1 ultra’s performance by 50.1%.
cpubenchmark.net/cpu.php?cpu=Intel+Xeon+Platinum+8380+%40+2.30GHz&id=4483
When it comes to multicore performance I would say AMD comes out ahead of both apple and intel by a long shot with a 163.1% lead over the M1 ultra…
cpubenchmark.net/cpu.php?cpu=AMD+Ryzen+Threadripper+PRO+5995WX&id=4764
The thing to remember is that when you optimize for one metric, you may have to sacrifice another. You can choose between performance (ie singlecore or multicore), efficiency, cost, etc, This is true not only for us as consumers, but it’s true for CPU engineers themselves. The point is there is no one size fit’s all CPU that wins at everything because there are implicit compromises; asking “which CPU is best” cannot be answered without some application context and moreover this answer naturally changes with time.
IMHO the M1 achieves a good balance but objectively we have to admit that it’s not at the top of performance charts for either singlecore or multicore performance.
Yes, I’d agree that energy efficiency has long been an advantage for ARM as well as the M1 specifically.
You’ll need to supply sources for any of those claims to be taken seriously.
Alfman,
M1 offers a very specific customization. And as you said, there are compromises in every design, and it will fall short in many areas.
However, just by being a known configuration, it can actually have faster software. I have seen this action in the past. By optimizing for a specific cache configuration, RAM size, and co-processor (GPU/TPU) availability, the same software can easily run 10x faster than its generic version.
I would expect, or rather would be 99% sure, they would have companies like Adobe optimize for M1, and those versions will be highly competitive with the top-of-the-line Intel or AMD workstation chips.
I remember hearing about an M1 version of those software, but nothing for a ThreadRipper optimized Adobe Aftereffects, for example:
https://www.pugetsystems.com/labs/articles/After-Effects-CPU-performance-AMD-Threadripper-3990X-64-Core-1658/
The ThreadRipper would probably benefit from splitting those 64 cores into multiple virtual machines, and render in “distributed” mode, than using it all in the same instance. Which is a very large overhead.
Okay, I speak too much.
It just comes back to memory access not being O(1) unlike our theoretical models: http://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
sukru,
I agree. Some applications do well with massively multicore CPUs but It turns out that many desktop applications and games don’t benefit from more than a handful of cores. For them fewer faster cores is better than more slower ones.
This would be a very interesting article to explore on it’s own. I don’t quite agree with the author’s semantics, which are non-standard and debatable, but regardless of that it certainly is true that local access patterns will improve cache hits, NUMA performance, etc. Things aren’t so simple.
Alfman,
The difference is academic, but there seems to be two approximations: O(√ N), and O(log N) for memory access. One person here discusses their phd research: https://news.ycombinator.com/item?id=12383012
(The memory is laid out on a two dimensional configuration, hence square root. However it is also divided into discrete groups like registers, L1/2/3/4 cache, RAM, NUMA or nvme, increasing in 2^N sizes, hence log N).
Memory access is much slower than arithmetic. But at the same time it is parallel, meaning access patterns matter, and affect overall throughput.
Add in the difference between “if()” instructions and arithmetic ones: https://www.youtube.com/watch?v=g-WPhYREFjk. (This talk mentions how doing more work can result in faster execution).
Not only having cores “idle” is a problem, seemingly 100% utilized cores could be just wasting time waiting for memory or with pipeline trashes.
That is how a “weaker” system can perform better if there is one person on the team that can do these micro optimizations..
And I am 99% sure, Adobe would have at least one such person to optimize their binaries for the M1.
sukru,
It’s conceivable that early computers and simple micro-controllers actually have constant time memory access, but it becomes harder to maintain that as we push memory to get bigger. Commonly found DDR ram today is not constant access. Requests that change DDR banks/rows will be slower and with random access patterns. The thing about constant time is that we’d have to give up on things like caching as well as data locality (having to activate a new address for every single memory access). Sometimes improving average access times for a given workload is more beneficial than constant access time.
https://www.systemverilog.io/ddr4-basics
Academically the author is considering an infinite computer (or at least an arbitrarily large one) such that it becomes physically impossible to stuff more bits into a given space without making it physically bigger and also slower. It’s fair to look at the extremes and doing so clearly puts limits on what can be physically built, which ultimately has real consequences for what is tractable in computer science itself. However for most of us working in practical computer science we are dealing with finite problems on finite computers that physically exist, and in such a context I would argue the author’s semantics are different than those most in CS use. And despite his assertions to the contrary I think these semantics differences account for his disagreements with others more than anyone actually being wrong.
Consider that the simplification isn’t just happening with memory, but even arithmetic itself often gets simplified under big-O notation. For example in a looping algorithm where we perform N loops containing multiplications & divisisions we often chalk it up as O(N), the implied assumption is that those multiplications were all O(1). But a pedantic person might point out that multiplication and division are not truely constant time and if an arbitrarily sized inputs will necessarily require more because physically the operations depend on the size of the inputs. If we were to allow for arbitrarily sized inputs then we’d have to conclude that our O(N) loop is wrong. The conclusion was always based on the assumption that the hardware’s register sizes was enough for the problem at hand and this is often true in practice.
Of course if one has a problem at the extremes that doesn’t fit the assumptions then they’d have to revisit any simplifications that were used to check whether they’re still valid or not.
Yes clearly access patterns make a huge difference, but I would argue that some of those differences are not physically fundamental, it just happens to be a byproduct of a specific hardware implementation.
I agree that optimization helps. But in the future I think computers will become better at it than humans.
Alfman,
I think bitmap images could be a great case here. Just iterating row-column, or column-row order will bring out a huge difference. And this would exist for all systems with more than one level of memory (i.e.: any use of cache at all).
As you said we cannot go back to “constant time” memory. This is fundamentally dictated by physics. As you are separated by a distance from your data, the access time will be slower. This also true for very large scale distributed systems.
(One of the things I had fixed back in the day was a slow moving pipeline. We were moving many TBs of data every day across a continent. The pipeline got really slow over time, and just by rearranging the processing I was able to improve the runtime significantly).
Yes, they are getting better. In the branch free programming talk, the presenter mentions GCC already having some of these optimizations, but CLang being behind.
The compilers can, and do rearrange loop orders for example:
https://en.wikipedia.org/wiki/Loop_interchange
But of course optimizers are limited to simple loops. And when the code has side effects, it requires real human intervention. (Still have some job security).
sukru,
Yep makes sense.
For now. But I think the future of optimization will be won by the machines just like chess, go, jeapardy have been and computer vision is starting to. We as programmers will be able to specify inefficient (but correct) algorithms and the optimizer will use that input as a specification to write optimal algorithms on it’s own. As an example the source could list an O(n^2) algorithm and the compiler would be able to substitute it for an O(n*log(N)) or better while also adhering to specified output. On top of this the optimizers will be able to custom tailor the output for local system configurations like registers, memory, cache, etc. It could even run it’s own benchmarks against real hardware just like a real human would to further optimize it’s output. The idea is that everything humans do to optimize software can be automated.
I do believe this will all be possible, but it may require a new way to specify software in such a way as to remove constraints that impede optimization. Today’s languages don’t provide a way to differentiate between required behaviors and mere side effects. For example, even something as trivial as a debug output in our inefficient algorithm would make the optimizer extremely dependent on reusing our inefficient algorithm in order to exactly reproduce its output. Without this debug output, the optimizer would be free to switch algorithms.
With this in mind I envision a language that is good at specifying correctness without having to specify an implementation since it will be the compiler’s responsibility to come up with one. This will fundamentally transform the act of “programming” from writing algorithms to writing verification rules and defining priorities. A verification routine might contain an implementation to mimick, but it wouldn’t necessarily have to. “The output set contains every element of the input set and visa versa” “When iterated, every element of the output set is greater than or equal to the preceding element”.