On Friday afternoon, The Wall Street Journal reported Intel had been approached by fellow chip giant Qualcomm about a possible takeover. While any deal is described as “far from certain,” according to the paper’s unnamed sources, it would represent a tremendous fall for a company that had been the most valuable chip company in the world, based largely on its x86 processor technology that for years had triumphed over Qualcomm’s Arm chips outside of the phone space.
↫ Richard Lawler and Sean Hollister at The Verge
Either Qualcomm is only interested in buying certain parts of Intel’s business, or we’re dealing with someone trying to mess with stock prices for personal gain. The idea of Qualcomm acquiring Intel seems entirely outlandish to me, and that’s not even taking into account that regulators will probably have a thing or two to say about this. The one thing such a crazy deal would have going for it is that it would create a pretty strong and powerful all-American chip giant, which is a PR avenue the companies might explore if this is really serious.
One of the most valuable assets Intel has is the x86 architecture and the associated patents and licensing deals, and the immense market power that comes with those. Perhaps Qualcomm is interested in designing x86 chips, or, more likely, perhaps they’re interested in all that sweet, sweet licensing money they could extract by allowing more companies to design and sell x86 processors. The x86 market currently consists almost exclusively of Intel and AMD, a situation which may be leaving a lot of licensing money on the table.
Pondering aside, I highly doubt this is anything other than an overblown, misinterpreted story.
If this happens Qualcomm will be saddled with copious amounts of leveraged-buyout debt and we’ll get a crappier Qualcomm and crappier Intel.
I think you are seeing the value of the x86 license fail Intel right now. Intel’s entire business model is built around making generic one-size-fits-all processors out of it’s x86 line – commodity hardware, and it’s just not working any more. The world is in love with ARMs more customizable approach, and the foundries that build them are quite a bit more flexible than Intel’s commodity at scale model. On top of that, people are convinced that x86 is the actual problem for power efficiency (it’s not) – and it has a real impact on the market place. I don’t know how much of an asset x86 is at this point in time. I sometimes wonder how long it’ll be before AMD and Intel slap an ARM decode front end on their chips.
CaptainN-,
Intel’s strength is high end processors where power consumption doesn’t matter and/or markets where x86 software compatibility is important. But neither of these hold true on mobile and neither intel nor AMD are competitive there. The energy budget is less favorable towards processors with more transistors. Back when intel had a fab advantage, they had some extra headroom, but not any more. The factor by which they are behind on energy efficiency is greater than the factor by which they are ahead on performance.
It will be interesting to see what the future brings for intel, especially if they go fabless. AMD and apple both made this work for them. I have some reservations with what less fab competition will do to the market though.
Alfman,
Yes, there is a cost to decoding in complex instruction sets. However it was estimated to be 15-35% 7 years ago, and I would assume it is even less today:
https://www.reddit.com/r/hardware/comments/6rtx9u/comment/dl87o94/
Yet, the other factor you touched is important. When your competitors are at 3nm(*), and you are stuck at 7nm, you’d be at a massive disadvantage.
Modern processors now have clean instruction sets for high performance, like vector operations (AVX), tensors, and specialized accelerators for things like encryption or video processing. They can also have low powered cores with lower decoding overhead at the expense of speed. So they can scale both directions.
But none of them can save you if your technology is essentially generations behind.
(*) I know these are not real nanometers anymore.
sukru,
I agree with that. With desktop computers, we just throw more transistors at the problem. but when it comes to power efficiency therein lies the rub. They need to have fewer active transistors, not more! On the one hand we could say it doesn’t matter that much, x86 can just eat the cost as less battery life, or a simplified implementation like atom that performs worse. And to that I say, yeah maybe we can, but is that really competitive?
I created a spreadsheet with passmark ST scores to show generational improvements for intel’s i9 line and apple M# line.
https://ibb.co/k4PtwLF
Take the TDP numbers with a grain of salt because they’re not well defined, but the thing to note is that the fastest M3 max CPU is neck and neck on ST performance with intel’s i9-14900k, and doing it with less power. Intel does have x86 CPUs designed to sip power. Good on them, but just look at how much performance they had to sacrifice to get there…not even in the same ballpark as M# CPUs.
I honestly don’t think intel will manage to come within 10% of ARM performance with the same power envelope. x86 complexity calls for more transisters and that’s a disadvantage here.
Intel does far better in high performance multithreading desktop CPUs because they’re better able to mitigate x86 shortcomings on the desktop. They have a further advantage of discrete components that can be upgraded separately and I do feel that apple has painted themselves into a corner there. The iGPU has to compete with the CPU for resources and most gaming benchmarks are relatively poor on apple M# CPUs, which is ironic given that so many games are bounded by ST performance. So intel might still have some advantages here.
Yeah. Eventually we’ll reach the end of the line with silicon and I guess everyone will be on the same node size then.
That’s true, although you generally need special applications to take advantage of that and I personally feel GPGPU is a better target than CPU for that type of workload.
Your second last point could also leads to another problem. As node costs go up (i.e. transistors costs) I suspect any performance to transistors advantage should start to manifests as a total chip cost advantage.
As you say we will eventually hit an end of the road (either determined by cost or actual engineering limits). At that point one side is spending some of its final transistor budget on addressing architecture limitations while the other side has a budget to either design cheaper processors or spend those transistor specialist hardware to eke out the last bits of performance.
We will see if this is an issue for x86.
Kf aInte, and Qualcomm were to merge they could heavily invest in fans to take advantage of the economy of scale to make chips. That was Intel’s advantage in the past over SPARC, MIPS&Co.
“I sometimes wonder how long it’ll be before AMD and Intel slap an ARM decode front end on their chips.”
Would it not make more sense to do the opposite? Imagine an X Elite like CPU that could decode x64 instructions in hardware. If it could handle both ARM64 and x64, it would be the ultimate laptop CPU.
Lunar lake proves that Intel can go SoC and get huge power savings. I don’t think it’s over for intel or x86_64.
There’s a perception problem. Look at Alfmans insistence that x86 must be less efficient. People really believe that x86 is destined to be less efficient than alternatives, regardless of the rest of the story.
But I agree, AMD and Intel could still right the ship. Momentum is a real thing, look at how strongly Windows clings to life…
I have to admit, I had not really taken a look at Lunar Lake. If it pans out according to the numbers, Intel may stop the Qualcomm X Elite from building a beach-head. If the next gen also delivers, they could turn it around. A lot depends on how badly the current quality problems hurt them.
tanishaj,
Intel are promising a lot, but as always independent testing is needed. It will be interesting to see where it goes.
https://www.tomshardware.com/pc-components/cpus/intel-launches-lunar-lake-claims-arm-beating-battery-life-worlds-fastest-mobile-cpu-cores
Apple leads the ARM processors, so I’m guessing their absence from intel’s comparisons must be deliberate.
CaptainN-,
Calling it only a “perception problem” seems to ignore that x86 complexity is a real shortcoming. Even x86 CPUs designed to be efficient are nowhere close to ARM’s performance/power ratio. You can drown prefetch inefficiency with intensive workloads like SSE that use magnitudes more power, this would help mask the disadvantage of x86 decoders. But many real world applications do not use much SSE and are long sequences of fairly basic logic instructions. Despite (or perhaps because of) the simplicity of the code, it’s a sore spot for x86 CPUs.
Fixing this requires us to take x86 out of the execution path. Microcode does this but the limiting factor there is that OS+applications don’t fit within uop cache. Adding more static cache uses more transistors and therefor more energy. Atom processors reduce transistors needed, but it comes at the cost of eliminating the performance benefits those transistor were providing.
Personally I think the solution to this might be to translate the x86 instructions to a simpler more predictable and dense instruction set that require minimal transistors to fetch, thereby minimizing both the energy and latency costs of fetching x86 instructions.
Incidentally this is how Transmeta Crusoe processors worked…
https://en.wikipedia.org/wiki/Transmeta_Crusoe
It’s also similar to how apple translates x86 code to run on ARM before executing it. I’m guessing there are going to be some pathological cases where this won’t work well, like startup execution latency/jitter. Also, historically translation is known to suffer from performance loss (although apple rosetta is still pretty darn good). I know it would be controversial for intel to use this technique to execute it’s own x86 architecture instead of the CPU doing so directly. People might remember when android switched from Dalvik to Art.
https://www.addictivetips.com/android/art-vs-dalvik-android-runtime-environments-explained-compared/
This transition was a bit painful, APK translations could take a minute or more. But it was usually a one time process. If intel did this they though, they would have the advantage of translating x86 to an instruction set specifically designed to have a direct mapping with it. This means that the translation itself should be fast, and I don’t see a reason 100% efficiency of translated code wouldn’t be possible. We can also get rid of the transistors dedicated to overcoming x86 complexity. They go away and so does their power footprint. This works best with assistance from the OS to save the translated binary code so that in future executions the x86 code might not have to be loaded at all, execution would start immediately from translated code.
I don’t know how well intel could market this to x86 loyal customers, but I think it does offer a path to solving the complexity issues on the CPU.
I deeply aggree with CaptainN-
Power efficiency is a matter of architecture not instruction set.
The number of transistors needed to decode X64 instructions is very small compared to the cache, execution, branch prediction.
The most power efficient CPU usually are those which are straigth forward implemented – but they are not powerfull!
To improve IPC, a CPU needs more execution units, a good pipeline, register renaming out-of-order execution and so on.
When people say “ARM”, they still think of simple ARM cores like in early raspberry pis. But powerfull ARM have a greater TDP, because they eat about as much power as intel/AMD.
For instance: while Qualcomm “threatens” to buy Intel (certainly to make the world know that it is more powerful than intel, and therefore hammering the idea that computer running on ARM are serious things), Intel and AMD are face-to-face in the power soc for small computer
and handheld gaming machines…
ARM based computers don’t show a great improvement in battery life, even on pure ARM code. Certainly because of all the security that is build in the OS.
That is the same with RISC-V: you can have a very power efficient RISC-V MCU, but if you want more computational power, you need some less power efficient variant.
“it would be controversial for intel to use this technique to execute it’s own x86 architecture instead of the CPU doing so directly”
Intel is doing this since the Pentium Pro. The x86/x64 instruction set is only real out of the CPU. Into the CPU the code is broken in parts.
From a very persoonal point of view, I really think that Apple did optimise rosetta to produce sequence of codes that are optimised for the M1/M2/M3 branch prediction, to allow some kind of simili-VLIW (a well design sequence of code can feed a maximum of execution units in parallel).
doubleUb,
Please think about it a bit more. ARM cpus can decode more instructions and do so faster than x86 CPUs. One of the reasons more uop cache is needed is to mitigate this. And while there is nothing wrong with this to improve performance, it does cost more energy.
Yes, if you read my comments you can see we agree on this.
We agree that complex pipelines need a lot of power, but again it sounds like you formed this reaction without having read my posts. What you are saying about the pipelines I’ve said as well.
Well, kind of. Intel have atom CPUs do cut back power, but the performance of their 2024 models remains abysmal compared to apple’s 2021 M1 CPUs. See this link I posted before.
https://ibb.co/k4PtwLF
It seems like everyone wants to rebut this with “but it’s not x86 fault, intel/amd can do better by doing XYZ”. And sure, there’s room for optimization and we’ll see where that takes us, but the CPUs available today don’t negate what I’m saying.
I’m not sure what you mean?
…this thread has been talking about microcode all along. I think you may be taking this out of context because intel CPUs are most certainly not doing what I described. Intel CPUs have to decode and perform OOO code transformations using transistors in real time. This requires lots of energy. If we could move this out of the CPU and compute it ahead of time, it would save all the energy consumed doing it over and over again inside the CPU. This kind of code optimizer module could open a lot more innovation on CPU efficiency by providing more opportunities to make transistors redundant. Not just the front end, but inside the micro-architecture as well.
It just isn’t true that transistors translate directly to power use – that factors, but it’s not direct. Race to sleep, caching, and a host of other additional features can REDUCE power usage, while adding transistors. Transistor count, and complexity is not the problem (they do add cost – a business concern). It’s the same with code BTW – more code does NOT equal slower. That’s just not an adequate measuring stick.
The bigger issue, and I’ll reiterate – is business priorities. Intel’s business model, and the momentum behind it, has been in big, powerful, hot, and power hungry terrestrial contexts, on commodity hardware. The world has moved on to tailored hardware in a way that Intel hasn’t been able to pivot to quite so effectively. AMD has had a better time, because they are fabless, so they contract for specialty designs like in games consoles, but their inability to license out x86 has been a drag on that business model – all business concerns, not technical ones.
ARM meanwhile, has dozens of partners, licensing and contributing back to their platform, almost entirely in spaces which require better energy efficiency. And if we are going to compare Apple’s MX chips, let’s compare them properly – they get absolutely trounced in high performance workloads, and likely will continue to get trounced for the foreseeable future, despite the hype. What advantage Apple has in that space all boils down to the additional coprocessors they would have been unable to add to an x86 licensed commodity part – more business problems, not technical ones.
x86 does NOT have an instruction set problem, despite the perception. They have a business problem.
BTW, one of the cool things about M1, M2, etc. is that they ADDED a lot of transistors to ARM, including some modes which make x86 easier to emulate.
Let’s look at some direct comparisons:
– M1 Max (2020) has about 57 billion.
– The 10th generation Intel i9 processor (2020) has about 4.2 billion.
I’m just not sure what we are talking about here. M1s are more efficient because of business priorities at Apple (and at ARM and their partners). Not because of transistor count.
CaptainN–,
What I’m saying is true though, Every transistor actively doing work adds to the energy footprint. This doesn’t necessarily mean a die with more transistors uses more energy. This is what I meant by “We can’t ignore duty cycles.”! Transistors burn through most energy when they’re on doing work.
As an illustrative example: a CPU might have an accelerator for h264 encoding that is much more efficient than using the generic CPU pipelines to do the same work. The x86 software algorithm might use 150W for streaming, whereas the hardware accelerator might use only 30W. Obviously the number of total physical transistors doesn’t change at all, but the difference in transistor duty cycle between these two circuits makes a big difference. The OOO pipeline is very energy intensive, any transistors we can make redundant there are opportunities to save energy.
I agree with you on the business arguments.
It’s not just “perception” though, the technical tradeoffs are real. I don’t take issue with you making the business case that these tradeoffs are justified, maybe they are. But it’s wrong to suggests the tradeoffs don’t exist; they do and we’re living with them.
I’m having trouble finding transistor counts for specific models so I don’t know where you got the intel figure, I’ll take you at your word though. Note that the M1 max was not out in 2020…
https://en.wikipedia.org/wiki/Apple_M1
Anyway, it doesn’t matter too much because intel were stuck on their 14nm process for a long time versus TSMC’s 5nm process for apple chips. It was a huge disadvantage for intel!
It didn’t say it was just transistor count though, those transistors have to be active. Duty cycle matters! Hopefully what I said above makes this clearer.
The perception that x86 is less efficient is rooted back in the 80s and 90s.
A hell of a lot of things have happened ever since in terms of microarchitecture.
Most people stuck in that perception are stuck in that time period.
FWIW the perception on the opposite direction is also a thing. A lot of people were surprised that ARM could scale up to the type of IPC performance than Apple Firestorm and Qualcomm’s Oryon ARM cores have been achieving. They were kicking the ass of x86 for a while in that regard.
Power consumption has very little to do with a specific ISA, given that most of the transistor budget in a modern core is not invested in decoding instructions.
When fabbed on the same node processes, a comparable ARM or x86 microarchitecture will tend to have similar power/performance envelopes.
Apple’s edge in the laptop SoC for the past few years had very little to do with ARM. Most of the performance comes from them using a very wide and aggressively speculative out of order microarchitecture in their performance cores. Whereas their brilliant power/battery characteristics is more related to Apple having had access to the more advanced TSMC node consistently (for a variety of reasons) as well as having a more advanced packaging. Specially in terms of them using a chip on chip PDN (which frees up the mental layers on the main die to be used for clock and signal, that gives much better power, performance, as well as higher transitory density). On top of it all, Apple uses a vertically integrated stack of HW/SW so they have finetuned the OSX scheduler as well as the sleep behavior to a higher degree than windows thus far.
Apple could have used an x86, if they have had access to the license, and they would have produced a similarly performant SoC.
Xanady Asem,
What you fail to mention is that power is not just a function of the number of transistors dedicated to decoding instructions but their relative duty cycle as well. Transistors that solve x86 complexity can take up a small area on the die, but it does not follow that they will consume a small amount of power proportional to the die size. We can’t ignore duty cycles. When those transistors are actively operating at ~5Ghz, they’ll consume much more power than unused portions of the core even if the unused portions are physically much larger.
I agree that microarchitectures and node sizes make a big difference and these have seen a lot of progress throughout the years, providing greater gains than ISA could. Despite x86 complexity, intel has often held node and micro-architectural advantage plus a market that valued performance over power helped too. I think intel will still do well in high performance applications where the ratio of compute to decode is high, but x86 is still a disadvantage for applications where the compute to decode ratio is low.
Once competitors catch up on node/microarchitecture optimizations, front-end complexity becomes a a bigger liability for both power consumption as well as decode latency. The later can be fixed by throwing more transistors at decode and cache, but these add power rather than reduce it. There’s no doubt intel could do better with a clean slate, but for better or worse intel are tied to x86 for business reasons.
Although this is not intrinsic to ARM ISAs, apple M# processors have disadvantages too. It’s rather fortunate for intel that apple locked their ecosystem into shared memory iGPUs that limit the scalability of both the GPU and the CPU. So I predict current trends are likely to continue: Apple will dominate intel on efficiency, and intel will dominate apple on heavy compute performance for things like gaming.
It would be almost impossible to find such a rare corner case where the decoder is going at full blast and the rest of the microarchitecture is idle.
The complexity of the decoder is not a significant limiter to performance for eons on an out-of-order architecture. Things like the multiported register file, the ROB, the OOO scheduler, the predictor, etc take significantly more effort in terms of design and validation, as well as the resulting power and area budgets.
I have no idea what you mean by “decode to compute” ratio, every instruction in the pipeline has to be decoded. If anything, such a concept would intrinsically favor x86 since it tends to decode into slightly more uOps than ARM on average.
Xanady Asem
The rest of the micro-architecture being completely idle would be a pathological case, but I’m not suggesting things are that bad. Instead, let’s say the rest of the die might be using a magnitude more power to perform basic math & logic, this would leave 9% overhead on the front end that could be optimized.
I don’t think this is so unlikely because a lot of applications don’t utilize the CPU’s heavy number crunching facilities. A JIT compiled javascript library could easily overflow the u-op cache with basic instructions that don’t take long to execute but keep the decoder continuously occupied.
A 5-10% front end overhead might be tolerated and “good enough”. But once everyone eventually hits a wall of diminishing returns on node & micro-architecture, this degree of overhead stands out in CPU comparison charts. Obviously intel can and should focus on their strengths, which is high performance. Complexity and higher transistor requirements make x86 less attractive for low power applications.
Some types of instruction sequences carry a lot more weight in terms of work done. Consider how drastically different the work to decode ratio is between a tight loop of SSE instructions versus long chains of basic instructions with lots of application context trashing.
Oops: “context trashing”->”context thrashing”
Every component can be optimized. But that doesn’t mean everything is a limiter. E.g. x86 decode is a few percentage points overhead. It is not worth throwing the bathwater with the baby (of backwards compatibility) in order to just get a slight performance/efficiency uptick.
And yes, the data parallel portions of the kernel are going to have much lower overall control (not just decode) footprint with respect to the compute being achieved. But the type of SIMD/Vector widths required to make a clear difference in that regard (execution vs control) is on the very wide AVX256+ territory. Which presents their own issues in terms of presenting limiting power densities elsewhere in the core.
Even in kernels with low IPC, the power consumption of stuff like the branch predictor, the register files, the ROB, the read/store queues, and even the scheduler that is freaking out because it can’t find uOps to send into the FUs are going to have a significant power footprint because most of them use multiported registers, for example, which burn a boatload of power. Rember, even if the process of finding out you can’t schedule something takes some power.
Besides, those pathological use cases are not ISA dependent. This is, a RISC processor will also be stalled in those scenarios.
FWIW context switch overheads haven’t been an issue since the 90s. Given the granularity between context ticks and the overall increase of clock cycle and IPC.
Also, you need to consider that there is a lot in the fetch engine other than the FSMs for a specific ISA. The decoder front end is complimented by a lot of structures that are ISA agnostic.
This is why if you scale (to the same node) 2 uArchs to have comparable issue/retire width, with similar register resources, etc, You end up with cores that are remarkably similar in area, performance, and power consumption regardless of the ISA being used. Because, as I said, the x86 decoder presents nowadays low single digits overhead, it becomes almost noise.
Xanady Asem,
That’s fine, people may be too vested in x86 to throw it out. My point is simply that it’s complexity does lock intel into some technical disadvantages.
I agree these out of order pipelines are very costly. It’s easy to have a simple ALU circuit be more efficient, but keeping that ALU busy is tough. All the complexity of out of order pipelines is to combat stalls. In principal this could be achieved at the compiler level without OOO circuits, but in practice the strict coordination between the CPU and compilers has rendered this approach non-viable.
It’s a stretch in practical terms, but hypothetically assuming we had a clean slate and were not tied to earlier architectural conventions, we could actually address all of these things with solutions that are viable and more efficient. Instead of directly executing compiled code, software would be compiled down to intermediate bytes codes. These intermediate byte-codes would then go through a CPU-specific module to “bake” the binaries specifically for the CPU. This would be such a useful thing to have because it means lots of energy consuming transistors become redundant with the transformations done ahead-of-time by the code optimizing module. I’m including not only the decoder transistors, but those used for out of order pipelines as well. Assuming the module could be perfectly matched to the silicon, the issue you mention “Finding out you can’t schedule something takes some power.” is solved and every time slot can be perfectly accounted for without the need for transistors.
I believe this would be something of a holy grail for fast yet low energy CPUs, but it does require breaking away from scheduling un-optimized ISA code directly on the CPU. I concede this may be a bridge too far today, but conceivably once the node improvements and microarch optimizations stop producing gains, the appetite for more performance and efficiency may be sufficient to push us over the bridge to the other side: ISA pre-processors with a great reduction in CPU complexity leading to low energy high performance.
To be fair, our assessments are in the same ballpark. Our disagreement seems to be whether we should consider x86 competitive on energy despite this overhead. Obviously the efficiency gap between intel/amd and ARM M# today is pretty substantial today. Time will tell how far they are able to close this gap.
The point is that the efficiency gap between M-series and intel/AMD is mainly due to factors other than the ISA. Said gap was more correlated with Apple’s being ahead in terms of uArch, node, PDN, and packaging.
Lunar Cove, having access to parity of some of those resources, seems to have catch up in regard to efficiency.
FWIW, every time people try to create a new architecture from scratch. Almost invariably end up reinventing the wheel and end up having to add complexity. Some people tend to develop a certain technology amnesia, where they assume that complexities in the past were bolted on arbitrarily for no reason 😉
ARM is a perfect example. In order to scale it up in performance, by the time firestorm was released by Apple, that was an ARM core bigger and more complex than its contemporary x86 competitors.
Most of the complexity that x86-land deals with is at the system (platform and software) level of having to support legacy there. x86s will remove the 16 bit and 32bit system modes. That will force for BIOS compatibility to be finally done. Even though the core itself will still have the 32bit datapaths and instruction execution support.
Xanady Asem,
Nobody is saying there aren’t other sources for gaps. And it’s all well and good that engineers are working on getting the most out of node sizes and uarch. They’ve done good work and the proof for it is in the pudding, however what happens when those other gaps get minimized while the x86 complexity remains? Logically, it becomes a bigger proportion of the remaining gap.
,
I’m going to say what I always say: wait for independent testing before taking PR material at face value. I was insistent on this when it was apple and now I’m insistent on it when it’s intel. We can assess next generation CPU rankings when independent testers actually get their hands on the latest intel and apple cpus later this year.
This is kind of why I proposed a different solution. My aim would be to decrease CPU complexity using a special code optimizer. Rather than making the CPU adapt and analyze the ISA code, much of the decoding & OOO analysis could be done before the CPU. There are some interesting opportunities here, It wouldn’t just benefit x86, but other superscalar CPUs could benefit from this too. I acknowledge we may not be ready for this sort of change yet, but once stagnation sets in, I suspect the industry may become more open to it.
I don’t understand the point of proposing solutions to problems that have been tackled for decades.
Modern compilers do a tremendous amount of trace analysis, kernel reordering, etc. They can even issue code paths with speculative scout threads to warm up caches, etc.
uArch and compiler teams tend to have healthy feedback loops within organizations.
FWIW nowadays the bulk of the complexity of modern high performance semiconductor designs come in terms of tweaking the cell libraries, design for manufacturing, validation, etc.
So you can end up with something like an NVIDIA GPU, which is made of much simpler cores and regular structures than a moder superscalar ARM/x86 core. But the resulting design ends up being more complex to manufacture and make performant.
Xanady Asem,
Current solutions need more transistors to increase performance, which I’ve been saying all along. However these solutions use boatloads of power, and I’ll quote you saying it…
The point being, our current solutions trade off between power and performance. There’s no denying it. Fab advancements have enabled us to cut back power regardless, however at some point physical barriers will prevent us from going further. I don’t know why you are so reluctant to agree that a reduction in complexity and actively powered transistors will have to play a bigger role in the future if we want to make more efficiency gains.
Yes they do, but in compiling down to x86, there are front end inefficiencies as well as microcode pipeline inefficiencies being left on the table. Decode inefficiencies could be mitigated with ISA improvements, but the OOO pipeline overhead that you are always so keen to bring up would require a very specialized optimizer in order to optimize away those transistors. From what you’ve said we both are in agreement that it takes a significant amount of power to run these pipelines. Well, a code optimizer can optimize this ahead of time and put less burden on the CPU to do it at run time.
GPUs thrive for parallel algorithms. The sheer number of threads means even much slower cores end up doing lots more work overall using less energy. GPUs unquestionably beat CPUs at appropriate parallel tasks, but obviously a lot of software doesn’t boil down to GPU primitives, so we still need both.
So basically, you’re literally rediscovering itanium. Great!
Xanady Asem,
Itanium was a flop, but that doesn’t imply that intel engineers were unqualified hacks or that non of their innovations had merit at all. Intel sought to create a new ISA around the needs of the CPU rather than a CPU around the needs of the ISA (ie x86). However like I said before “in practice the strict coordination between the CPU and compilers has rendered this approach non-viable.” Devs & customers were absolutely not ready for radically incompatible ISA changes given everyone’s dependency on x86, which ran terribly on itanium. There are good lessons to be learned for sure. However it doesn’t imply all the technical solutions they came up with had zero merit. Borrowing your phrase, we shouldn’t throw away the baby with the bathwater.
So with all of this in mind, including your feedback, how can we design a CPU that takes out the complexity and energy consumption of modern CPUs without being so drastically different that we break compatibility?
Well, I still think that having a silicon-specific code translator to finish the code analysis & scheduling where x86 compilers leave off is on sound footing. Some changes will be needed in the OS, but normal x86 applications shouldn’t notice.
Software translation has improved significantly since ~2000, not only evolving academically, but evolving to serve real world use cases too. Rosetta 2, has impressive results going across architectures. Intel could do better given that their target architecture would be designed for it as opposed to translating x86->arm. A recent osnews article highlighted the importance of 1:1 code mapping.
https://www.osnews.com/story/140607/what-it-takes-to-run-the-witcher-3-on-risc-v/
I know this is different from how things have been done. Engineers have gotten spoiled by node improvements to cover up the power footprint costs of adding more transistors. “Let’s not worry about the additional power footprint because silicon node improvements will offset the power anyway”. This is similar to the software industry trope about premature optimization: “let’s not worry about it and use faster CPUs instead”.
We all have some resistance to change, but do you agree that node improvements are reaching the end of the line? This may be the catalyst needed to push the industry to a more transistor optimized path. Whether it’s qualcom, apple, intel, amd, whoever, they’re all going to encounter the same node limits and they’re all going to need to think outside the box to solve this. I think it’s inevitable that there’s going to be a greater push to justify transistors like the industry hasn’t seen in recent times. If there’s a possibility to make transistors redundant, then it’s an opportunity for power savings.
The performance levels an Apple Avalanche core extracts at 4Ghz out of a <2.5W power envelope are remarkable. That level of efficiency is not being reached naively. A lot of the stuff you keep mulling over was solved eons ago, and I don't think you understand that.
I recommend you apply your advice about overcoming resistance to change; try comprehending the problem first and then providing a solution, this time around.
Xanady Asem,
The status quo has forced us to choose between performance or energy efficiency, but more efficient+performant CPUs are possible. You’ve already admitted that OOO cores require lots of energy, and I agreed with you. But you seem bent on denying this is a problem, probably because you are still looking backwards to a time when node sizes still had tons of headroom and conveniently helped solve this for us. Of course it’s easy to ignore then, but forward looking engineers really won’t have this luxury forever. Again, I’m not saying we’re there yet, but those refusing to eliminate sources of inefficiency will become less competitive. The efficiency winner need to minimize ancillary transistors that only exist to implement an architecture while maximizing those directly involved in the final computation. Those looking outside the box, working outside of current comfort zones, are the ones who will solve it.
Xanady Asem
Gosh! I should have read you before answering. Totally agree.
Alfman
“My aim would be to decrease CPU complexity”
Already tried with Transmeta and RISC-V.
Unable to achieve the same performances.
It is very difficult to sell a new product that is not in the same performance standard…
doubleUb,
I mentioned transmeta too, but comparing them to a company with intel’s resources and fab advantage wasn’t exactly a fair fight. I’ve been saying that I don’t expect CPU designers to change their approach until other avenues are exhausted, including node size and micro-architecture. But at some point they’ll be chasing such marginal gains that they’ll be forced to consider outside the box solutions if they want to make more progress. Since the only alternative is stagnation, they will look to move incidental work out of the CPU so that CPUs don’t need to allocate as many transistors doing that and can focus on doing actual work faster and more efficiently.
“Guess you guys aren’t ready for that yet…but your kids are gonna love it”
https://www.youtube.com/watch?v=ZzAgacFBr48
There are literally thousands of people looking at these issues in detail. People are still getting PhDs exploring these issues. Papers are still being published. And companies are still investing billions in designing CPUs.
We know about those issues.
This is really out of my domain knowledge, but I’m really confused if x86_64 could be made more efficient for SOC, then what on earth happened when intel tried it for mobile devices?
The failure of intel’s mobile push had more to do with their corporate culture than the inability of x86 to scale down to low power envelopes.
What makes a successful mobile SoC is more about the integration of all the IPs within the die, not just the scalar cores. This is, you also need to have a good GPU, NPU (now), Camera/Image processing IP, Networking, etc. And have a very good balanced design for the system controller that manages all the IPs, their inter communication and their IO/Memory requests from the die. As well as having a good packaging group. And lots and lots of more etcs.
Similarly, the mobile market operates very differently from what Intel has been used to in terms of OEMs, const structures, customer support, etc.
It’s similar as to why Qualcomm, for example, has had such a hard time scaling the opposite way; from mobile to desktop.
Company culture is perhaps a bigger limiter of success than engineering alone. This is why, even though intel was able to produce a couple of mobile x86 SoCs for Android, which were somewhat competitive in performance, they went nowhere.
There is also the matter of software catalog. ARM, by then, was way too entrenched in the mobile/embedded spaces. So they had the largest software offer there. x86 android SoCs were not as attractive when most of the software was running on ARM by then.
Similarly, the huge x86 library for desktop/server is the main reason why x86 won’t die. No matter how many times even intel themselves have tried to kill it off and replace it.
Thanks for the insight. I would assume that a good deal of that minus the software catalog is also true now ( the gpu, cpu, image processing, networking, etc) for laptops. I guess they figured that out, if the benchmarks they gave are accurate. Having x86_64 with the batterylife claimed would be very very attractive.
It is very difficult to explain.
I personnally think that Intel succeeded in mobile devices!
Their mobil phone Atom was not poor (but not compatible with some applications partly written in ARM assembly).
The Atom notebooks weren’t that bad either (I personnally still have one as nextcloud/jitsi server).
Atom Z3xxx Z8xxx were pretty good for embedded media players and tablets.
The Intel Core-m were amazing for 3.5W TDP…
But it would have been better with OS support other than Windows: what kills a limited performance appliance? Doing a full anti-virus scan of 25000 JPEGs every week…
Intel showed x86 can be scaled down to mobile. So, in that sense it was a “success.” But the minuscule penetration that they earned in that market, sort of makes it a “failure.”
Similarly, ARM has shown it can be scale up to data center.
The thing about ISAs, it is that mostly only people on the net are still going on about it. Unlike Academia and Industry. 😉
never going to happen. If the FCC blocked the nvidia buyout after investigation , then this is a no go from the start.
Is Mr Robot at it again?
As some have pointed out Qualcomm is probably pushing “the writing is on the wall” for x86, perhaps without even knowing what Intel might have in the inventory.
It feels like a corporate game of chicken, that would require executives to blink on the Intel side of the table for Qualcomm to get a favourable outcome!
It seems ARM CPUs are the new forever advanced PS3 cell CPU tech we’ll be hearing about for the next 10 years again, even though the first round of reviews for the new Snapdragon chips said the new gen of AMD/Intel would be out by their release and the numbers Qualcomm were giving at the time would probably not be competitive once they were on the shelves.
Snapdragon Elite isn’t bad but Qualcomm doesn’t have a decent GPU right now. That’s their current roadblock. Partnering with Nvidia could help them or they can try to buy Arc off Intel. Pat might go for that even though it would be a terrible mistake.
The real problem with ARM is getting high-frequency, fast desktop/workstation chips. Apple has come the closest but still loses hard to Intel and AMD on most workloads. Content creation isn’t the only workload.
yeah ARC is not even a competitor in most cases. Look at phoronix and arc gets 8 pfs vs the cheapest amd sollution at 280fps in both wayland and Xorg. ARC is currently not an option at all for anything. In windows 11 it does however support nglide decently well, so you can play glide games if you like to a decent fps. So you can play your 1990s games just fine. But retract that, nglide does not work properly either. Spiro and baldurs gate does not work properly.
I don’t think NVIDIA would be that interested in a partnership with Qualcomm. Since rumor has it that they are going to launch their own WoA SoC.
Not likely to happen.
However, Qualcomm has been trying to diversify aggressively, since they will soon lose the Apple modem business.
Maybe they just want the fabs, something Intel doesn’t seem to what to do anymore.
Intel has already moved the fabs.
where?