In what started last year as a handful of reports about instability with Intel’s Raptor Lake desktop chips has, over the last several months, grown into a much larger saga. Facing their biggest client chip instability impediment in decades, Intel has been under increasing pressure to figure out the root cause of the issue and fix it, as claims of damaged chips have stacked up and rumors have swirled amidst the silence from Intel. But, at long last, it looks like Intel’s latest saga is about to reach its end, as today the company has announced that they’ve found the cause of the issue, and will be rolling out a microcode fix next month to resolve it.
↫ Ryan Smith at AnandTech
It turns out the root cause of the problem is “elevated operating voltages”, caused by a buggy algorithm in Intel’s own microcode. As such, it’s at least fixable through a microcode update, which Intel says it will ship sometime mid-August. AnandTech, my one true source for proper reporting on things like this, is not entirely satisfied, though, as they state microcode is often used to just cover up the real root cause that’s located much deeper inside the processor, and as such, Intel’s explanation doesn’t actually tell us very much at all.
Quite coincidentally, Intel also experienced a manufacturing flaw with a small batch of very early Raptor Lake processors. An “oxidation manufacturing flaw” found its way into a small number of early Raptor Lake processors, but the company claims it was caught early and shouldn’t be an issue any more. Of course, for anyone experiencing issues with their expensive Intel processors, this will linger in the back of their minds, too.
Not exactly a flawless launch for Intel, but it seems its main only competitor, AMD, is also experiencing issues, as the company has delayed the launch of its new Ryzen 9000 chips due to quality issues. I’m not at all qualified to make any relevant statements about this, but with the recent launch of the Snapdragon Elite X and Pro chips, these issues couldn’t come at a worse time for Intel and AMD.
Bad day to be a faithful Intel customer who purchase their last gen stuff on launch.
Problem is, Intel is getting way too careless these days, especially considering the competition is strong, they cannot afford a misstep. Yet they do, again and again, since the fdiv bug.
Gamers Nexus’ stance on the subject ain’t pretty : https://www.youtube.com/watch?v=OVdmK1UGzGs
AMD is doing the right thing by delaying the launch and not launching chips which might be failing validation. The lackluster performance of the new ARM chips is really not a concern for AMD/Intel except in a few laptops since the Snapdragon doesn’t provide much benefit for Mac users to switch away. Since the Snapdragon on Windows can’t really play games (compared to an AMD/Intel chip the performance on Win/ARM is terrible and their graphics driver is terrible), the only use cases are work related and there the benefit is only battery life really.
ARM has a lot of benefits, but it’s relevance in the Windows ecosystem has yet to be proven. Time and again, Microsoft has pushed Windows to different architectures, and time and again, nothing has stuck. The fundamental problem is that Windows is an x86 platform, and moving away from that requires losing, or compromising, almost 40 years of existing software and application compatibility
If you’re going to lose, or compromise performance, by moving architecture, you better have some good incentives to use the new platform. Windows doesn’t really bring anything to the table, except the win32 API. However, that API is only good if the programs are ported to the new CPU architecture. Apple got away with it, as their CPUs were worlds away better performing than the hobbled x86 mobile processors in their laptops, so the performance hit of x86 translations was not so noticeable. With Windows, people are moving away from (often better thermally architected) x86 Windows PCs to ARM, and the performance different between them is less noticable, leading to poor perceived performance of x86 code on ARM on Windows.
To drag this back on topic, away from low power devices and back to high power ones, such as the i9… The main issue is power delivery. As with cars, efficiency never leads to performance, it’s a case of one or the other. If you want a fast car, throw boost and displacement at it, and take the hit on fuel efficiency. The same is true with CPUs. Lots of cores running at high clock speeds is inherently going to suck power, and i think you’ll find the case here is that the “fuel lines” are just not wide enough to deal with the demand (to go back to my car analogy…), or in a CPUs case, the power delivery buses are just not wide enough to provide the current needed, and Intel “fixed” this by increasing the core voltage. However this apparently (and i’m hypothesising here throughout this paragraph) led to degradation of the power buses through overvoltage, possibly through some analogous process to arcing.
I’ll be very interested to see if these microcode changes lead to more/different instabilities or performance losses. I think the issue here is pretty tightly rooted in the physical design of the chip,, and microcode changes are not going to fix the root cause, only patch over it.
The123king,
Agreed. On the one hand, x86 software compatibility has been one of microsoft’s strongest assets. Customers need to be able to run their (x86) software and this has kept users coming back to windows over and over. But at the same time it’s kind of microsoft’s Achilles’ heel since windows is perpetually held back by application compatibility requirements in cross architectural realm.
It’s true that you can optimize between performance and efficiency. Node improvements have historically benefited both simultaneously. The issue is that we are reaching physical limits. Without node improvements, tiny gains in performance require a lot more energy (and visa versa).
I suspect you may be right.
Which got me curious, are there any companies that make x86 CPUs that are available outside China? Who owns VIA these days?
After x86_64 it’s only been AMD and Intel, as far as I know for commercial licenses in the West.
There is an AMD collaboration with a Chinese company, but it is a werid setup because I think the front end is provided by AMD so is a sort of joint venture.
VIA is basically dormant in terms of x86 uArchtiectures, and most of the newer x86 stuff is now done by Zhaoxin which sort of got the old x86 license from VIA and Centaur? But I don’t know if they have access to the latest x86 revisions in terms of license.
There were a few academic licenses though. But nobody cares much about them anymore, since RISC-V is easier to deal with.
The Intel stuff is even worse than advertised it seems. They are pinning it all on this voltage issue, but they have also quietly admitted that an undisclosed number of chips are affected by oxidation issues (mentioned in the article summary) caused by a (now fixed) manufacturing issue.
I’ve been following this somewhat closely on Gamer’s Nexus and Level 1 Techs and this has been a total disaster by Intel. They’re still trying to mitigate the PR damage when they should be releasing a full list of affected part numbers and a full replacement program from any and all chips.
The oxidation issue is legit and was only eventually admitted to in a follow up Reddit post (of all things) by Intel.
I think this is a huge stain on Intel and vastly different than AMD delaying their upcoming chips to make sure they get it right.
If it’s a simple case of “We overclocked those things to the point of instability, so we are dialling it back a bit”, that’s good. And yes, Intel does some rather extreme “official overclocking” on their CPUs, the 13900K and 14900K have a TjMax of 100 degrees Celcius, which is insane.
If it’s a trick to push the issue beyond the warranty period, then Intel may be looking forward to a new “bumpgate” lawsuit.
Tj of 100C is pretty standard nowadays. dGPUs have been doing Tjmax 100C for ages.
Most SoCs tend to do Tj throttling in the range between 90C to 110C