The Box64 project, which allows you to run Linux x86-64 binaries on non-x86 architectures like ARM and RISC-V, has achieved a major milestone with its RISC-V backend.
It’s been over a year since our last update on the state of the RISC-V backend, and we recently successfully ran The Witcher 3 on an RISC-V PC, which I believe is the first AAA game ever to run on an RISC-V machine. So I thought this would be a perfect time to write an update, and here it comes.
↫ Box86/Box64 blog
Calling this a monumental achievement would be underselling it. Just in case you understand how complex running The Witcher 3 on RISC-V really is: they’re running a Windows x86 game on Linux on RISC-V using Box64, Wine, and DXVK. This was only made possible relatively recently due to more and more x86 instructions making their way into RISC-V, as well as newer RISC-V machines that can accept modern graphics cards.
The Witcher 3 can runs at about 15 frame per second in-game, using the 64-core RISC-V processor in the Milk-V Pioneer combined with an AMD Radeon RX 5500 XT GPU. That may not sound like much, but considering the complexity underpinning even running this game at all in this environment it’s actually kind of amazing. It seems Box64 could become as important to gaming on ARM and RISC-V Linux as Wine and Proton were for gaming x86 Linux.
There’s still a lot more work to be done, and the linked article details a number of x86 instructions that are particularly important for x86 emulation, but are not available on RISC-V. The end result is that RISC-V has to run multiple instructions to emulate a single x86 instruction (“a whole of 10 instructions for a simple byte add”), which obviously affects performance.
This news is almost magical, and is exactly the kind of thing I love to see here! I wish I could afford to get into RISC-V right now but life is putting a damper on my hobbies for the moment, I simply don’t have the free time to spend on exploring a new architecture right now. I do feel that the time is right to switch out my postponed Raspberry Pi projects for RISC-V equivalents when I do have the time and money to get back into it. The community support around RISC-V is approaching parity with the Pi community, especially as the latter has dwindled due to the Foundation’s focus shift towards commercial customers and away from hobbyists.
Yeah, this is a great achievement. I wonder if the performance would improve if wine and dxvk would be dropped from the chain, after all Witcher 3 has a native Linux version.
While reading this, I was reading the following article about a RISC-V Ubuntu tablet:
https://www.omgubuntu.co.uk/2024/08/dc-doma-pad-ii-risc-v-tablet-runs-ubuntu
Why can’t we have such a low-power and low-cost tablet with an x86 CPU? VIA and Transmeta were doing it 15 years ago, so it can be done. Instead, we have to use ARM and RISC-V CPUs despite PC software being generally x86-centric and then force the software to work via emulation that might or might not work.
I`ve the same, sad, observation. They are on market, but are expensive, mostly for industrial. And I`d love to play in old Windows games on Linux tablet via Wine.
Because x86 just isn’t cut out to do low-power? Not for lack of trying, but Intel failed misrably with the low-power Atoms. No x86 in phones anymore! What VIA and Transmeta did can’t be compared to what modern ARMs do, so yeah, you could do _something_ 15 years ago, but not power a modern low-power tablet. And PC software isn’t just “x86-centric” (bit of a strange term, it’s all just x86/x86-64, not “centric”), but also requires Windows. So emulating an x86 processor is one thing, but you need (a subset of) Windows and either DirectX or OpenGL drivers, etc.
Intel Atoms for smartphones were doing very well on the AnTuTu benchmark, their failure was due to other reasons, mainly the fact that the modems in Qualcomm’s SoCs were so much better than anything else and the fact Android native apps can be assumed to have ARM binaries but not necessarily x86 binaries. None of this is relevant for a tablet meant to run desktop OSes (aka Windows and Desktop Linux) and will only occasionally use cellular data (if at all).
Also, personal pet-peeve: Intel Atom is a bad word in the minds of most customers (even worse than Celeron) and Intel should stop using it, they are hurting themselves.
BTW people complain that x86 is a duopoly, but SoCs for high-end phones are a monopoly because Qualcomm’s modems are so much better than anything else and Qualcomm has patented certain aspects of the modems that make them so good, so everyone else is legally barred from competing on the high-end.
Back on topic, I know x86 comes with some overhead, but it’s not as bad as people make it, I’d happily take a small performance hit to have x86 compatibility without emulators. Also, the patents for x86-64 should expire soon, even Intel’s x86-64 (Conroe) launched July 2006.
The main problem with x86 is that Intel is stuck on obsolete process nodes and AMD doesn’t care about seriously competing in low-power. So, we are all supposed to move to ARM because of corporate incompetence. MacOS is especially funny in this regard, because MacOS users were moved from RISC (PowerPC) to x86 because PowerPC was stuck on ancient process nodes, and now they are being moved from x86 to RISC (ARM) because x86 is stuck on ancient process nodes. But this time the PC is also supposed to be moving to ARM for the same reason. I hate this game. Someone please convince Intel to license some dies from TSMC while they work on their first-party fab issues.
Not completely true. Qualcomm is entirely willing to license their modem designs to other manufacturers, but its a very expensive license. Apple may well be the one to figure that out when its modem is ready.
“x86 just isn’t cut out to do low-power” that is a completely unsupportable assertion. It would be more accurate to say that Intel (and AMD to a lesser extent) is not cut out to do low-power, but it really has nothing to do with the ISA. That old yarn has been debunked dozens of times. Why? The decode stage of x86 is just a part of the CPU – the execution engine is where all the important stuff happens is spent. ALL modern architectures, ARM, RISC-V, MIPS, PowerPC, x86 – ALL of them, have a decode engine, and none of them have to spend some time translating that to. something the execution engine actually understands. For Intel, the problem lies in their business model. They sell commodity hardware, in a world where bespoke, tailored ARM chips are more fashionable (and more profitable). (Microsoft is facing a similar problem with Windows.) THAT is why Intel hasn’t competed, not for any ISA reasons. AMD to their credit is doing more with custom designed chips (see Valve’s Deck CPU, or the CPUs in other game consoles, for example) but they are doing it through a services model, while ARM let’s companies just license the tech and do it themselves.
ARM is out competing x86 companies – for business reasons, not for ISA reasons.
Read more: https://chipsandcheese.com/2024/03/27/why-x86-doesnt-need-to-die/
CaptainN-,
There is truth on both sides of the argument, but you can’t write off either side without looking at more specific workloads.
Once everything is in uop cache, the ISA is abstracted away and to this end it would be fair to say the execution cores are NOT ISA specific. However this really is not the same as suggesting that ISA can’t create energy/time bottlenecks in front of the execution cores. That clearly depends on the prefetcher’s duty cycle under a given workload. Some workloads are very tight loops with computationally intensive instructions that consume virtually all the CPU’s time and power.
But sometimes there are workloads that don’t fit so well in the uop cache, comprising of long code paths and lots of context switching between kernel, processes, threads, etc. These workloads will naturally place a heavier load on the prefetch unit. A complicated instruction prefetcher under a heavy duty cycle will effect both instruction latency as well power efficiency.
Can an ISA impact the performance of the prefetcher? The answer is yes, it can. Not only is the ISA responsible for things like instruction density, but also the ease with which instructions can be quickly fetched in parallel. This is why apple Mx processors are able to prefetch more ARM instructions in parallel than x86 can.
Of course everything in engineering is compromise. x86 engineers can add more cache and transistors to help compensate…and to their credit this works very well. But it also comes at a cost: using more power. These kinds of tradeoffs get spelled out pretty clearly when looking at p-cores versus e-cores. If a cleaner ISA helps eliminate some of these tradeoffs, that’s where it can be advantageous.
That prefetch/parallelization scenario is like the one x86/x64 critiques that holds up, but even that can (and largely has been) mitigated, by throwing die space at the problem. Apple though, doesn’t beat x86 in performance at the top end, and they don’t really claim to. They do claim power efficiency, and long lasting batteries, most of which comes from very aggressive race to sleep algorithms, and a bunch of specialized co-processors that Intel would never add (because they need a commodity argument, and specialized coprocessors wouldn’t be used by enough customers to justify the transistors – an innovator’s dilemma).
BTW, Qualcomm (and Samsung) can claim efficiency too, mostly because they’d done the work on the Android side, and ported that over to their Windows CPUs. However, they are going to run in to a tantalizingly similar problem on the Windows side. Microsoft, sells commodity software. They likely won’t have a whole lot of interest in supporting Qualcomm’s efforts (through the filter of Samsung). Conversely, how much real staying power does Qualcomm/Samsung have to keep optimizing proprietary Windows for their laptop CPUs? It’s a really interesting conundrum (and maybe that’s why Qualcomm/Samsung are hedging with Linux support.) What I could imagine, is some kind of ARM based Steam Deck thing, or a game console (likely running some flavor of Linux). But that’s got it’s own whole minefield of compatibility and market challenges.
This is why I say it’s more a business problem, and less an ISA problem, and while there are some issues with x86/x64 ISA, which could actually be addressed with some smallish changes, there are fewer than people think.
CaptainN-,
Yes, I agree. Throwing more transistors at the problem helps solve performance bottlenecks, but the tradeoff with that is more energy consumption. If your goal is to reduce energy, then adding transistors is a problem. If a simpler architecture can avoid the need for those transistors, this would be advantageous.
I don’t necessarily think x86 overhead is huge, but reading your comments it does seem like we can agree that it can be improved on.
There was a time when apple were beating intel at single threaded performance AND energy consumption at the same time. Although that had a lot to do with the fact that apple were on TSMC’s fabs. Today they don’t beat intel on performance any more, but they still beat intel on energy efficiency by a huge margin. In order for intel to become competitive on power, and assuming comparable fabs, they would have to decrease the number of actively powered transistors and in doing so it would be a trade off with the performance that many of those transistors existed to enhance.
Yes, I accept there are a number of complex market factors in play.
For windows users anyway, the compatibility issue is a huge factor, far more important than the benefits of ARM. As a linux user, the compatibility issue affects me far less than on windows because most of my workloads are not dependent on x86. I think ARM would do pretty well with linux users if there were more commodity systems available and fewer ARM specific problems. The lack of effective standards around booting generic operating systems is particularly egregious with ARM. Anyway, that’s really a different topic.
Just to re-iterate, it’s not impossible to make x86 as efficient as ARM – it’s just that there are only 2 companies that can do it (due to patents, and stupid crap like that). Die size doesn’t translate directly in to more power – but it does translate directly to more complexity, and expense. (and again, there are a couple of small changes that can be made to the ISA that would help a lot here.) It’s just a question of business priorities, and so far, the 2 companies that can make x86 chips, haven’t chosen to focus on efficiency to the same extent as folks like Apple (there are other business structural advantages – Apple is virtually integrated and all that). I’m just saying that the ISA is not the main reason x86 isn’t as power efficient as ARM (or RISC-V, to get back on topic) – that’s something of a myth.
For Windows users, they do need Windows apps to “just work”- and I’m honestly surprised that Qualcomm and Samsung didn’t try to add something like those special memory modes that Apple’s ARM chips have. I don’t see how you can run x86 on ARM without some specialized hardware to support that. I guess maybe they are counting on the chips just being fast enough, but that’s not a bet I would make.
As far as the standard boot method one of the things that Casey Muratori mentions in that video I linked as a cleanup target (or reform target) is real mode – which is used for startup systems (I don’t know the details, I’m just passing along what he said). That and a few specific instructions that can be easily parallelized in the decode stage are things he suggests changing in the ISA to simplify decode (but he also says it’s largely taken care of in the algorithms AMD and Intel have implemented in their decode engines).
CaptainN-,
For the reasons talked about additional complexity can be compensated for with more transistors. But that doesn’t work when it comes to efficiency. X86 is at a disadvantage there. It’s not a question of if, but how much, and that really depends on the workload and prefetch duty cycle.
I agree Idle transistors don’t use as much power as active ones, but that’s kind of the point I was trying to convey. If the prefetcher is idle while the CPU is doing tight loops, and even SSE, then the ratio of prefetch energy versus computation energy will be dominated by the computation. However if you have a large code base and lot of thrashing between kernel/processes/threads the prefetch duty cycle may hang near 100% while the CPU has little to do until the prefecher fetches more instructions, then the ratios naturally change and x86 can become more of a bottleneck than before. This is bad and so Intel & amd throw more transistors at the problem, which works, but doing so adds more power. compared to an architecture that’s simpler and faster to decode.
Maybe not the main reason, but just one of. I don’t think the x86 overhead is huge, I don’t think we should be insinuating there is none.
I don’t think the majority of x86 software really depends on the x86’s strict memory model. A lot of software is actually quite portable provided the runtime libraries and system calls are there. I don’t know how we would go about finding out, but it would be interesting to have a lot more statistics about windows software specific dependencies on the x86 strict memory model.
Ultimately I am not one to suggest windows users should switch to ARM anyway.
Maybe we can change x86 to make instructions easier to prefetch. The easiest thing to do is just to pad everything out so that instructions are more predictable. While trivial, we may that we solve some of the complexity, but the trade of is lower code density and therefor requiring more cache (ie more power). Everything is a tradeoff. x86 would not be designed the way it is today, the reason we stick with it is backwards compatibility.
“The end result is that RISC-V has to run multiple instructions to emulate a single x86 instruction” – which… is kinda expected when emulating a CISC processor with a RISC processor? You reduce the types of instructions, but need more of them to do the same as CISC with a single instruction.
RISC doesn’t mean minimal instruction set… it means reduced. Mostly this hinges around not having a plethora of specific memory instruction variants.
Ideally you’d be able to do this in 3 instructions on an optimal RISC… while its taking around 7 here because RISCV cannot natively “do the thing”
load
do the thing
store
cb88,
I agree. I’m not sure if it will happen, but the article makes a very good case for the missing instructions.
The “ADD AH, BL” example shows how a seemingly trivial thing takes several instructions to emulate. When we use a naive emulator that takes every literal instruction and executes it one step at a time, this can’t be done efficiently unless RISCV has a suitable equivalent already in silicon.
The original programmer writing in a high level language doesn’t care about “ADD AH, BL”. A more sophisticated emulator would be able to de-compile the instructions, follow the code paths to see what this instruction is doing in the the broader context of an algorithm and then recompile it efficiently using RISCV. instructions without the need for “ADD AH, BL” at all. Theoretically this might even yield better performance than the original! But this type of emulator is far more complex to implement.
For kicks I searched my /bin directory for any binaries that had this pattern. This one liner does the trick:
Edit: Oh no, wordpress is killing my backslashes…
Note the ATT syntax.
It found only two candidates in the /bin directory and unfortunately neither are open source.
I wanted to see what the compiled code would natively compile to on arm.
Anyway, nvidia’s code interestingly uses repnz, which is a different twist on the instruction. I believe x86 flags can be more difficult to emulate.
I know this is just one arbitrary case and there are others. Still, if these instructions are rare, inefficient translations might not be such a big deal.
Edit: There’s a bug with the wordpress comment editor and back slashes. I added more backslashes, but I see now that the backslashes got doubled.
Those commands aren’t even what modern CPUs execute – that’s just what is sent to the CPU, which is then decoded to another entirely different format , which is actually executed within the CPU’s execution engine.
I posted a link to an article which explains all this in another comment. Here’s something similar in video format: https://www.youtube.com/watch?v=xCBrtopAG80
It’s worth noting, RISC means “fewer instructions” which means, you have to double up on the smaller set to do the stuff a CISC CPU can receive in one shot – it’s kind of built in. But also worth noting, that modern “RISC” architectures like ARM, have been adding quite a lot of instructions.
CaptainN-,
Obviously they get translated to uops, but the problem the article goes over is that since the RISCV ISA doesn’t have the instructions, the emulator has to specify those instructions inelegantly on RISCV. Hypothetically even if the missing x86 instructions were implemented on the uops side, they still need to be reachable by the emulator to be used.
RISC vs CISC are overloaded terms that mean too many things to different people and for this reason it’s probably better to talk more about specifics than in these general terms.
That reminds me of a recent article where the PS3 emulator author explained why he’s using AVX-512 to emulate some PS3 CPU features, with some low level explanations., I love this stuff, even though my day to day development work doesn’t touch anything remotely low level like this. https://whatcookie.github.io/posts/why-is-avx-512-useful-for-rpcs3/
CaptainN-
Yeah I concur, this is cool stuff 🙂
I do wonder how much AVX-512 will limit the accessibility of this emulator though.
https://www.tomshardware.com/news/intel-bios-update-disables-alder-lake-avx-512
I don’t have much experience with AVX512, but I do know some intel cpus downclock when running AVX512, which I only learned about when I tried to benchmark it.
The L1 cache on modern CPUs is plenty big enough. So I wonder how the PS3 SPU register speed compares to modern L1 cache? Any idea?
According to memtest
i9-11900k L1 achieves 326GB/s
Ryzen 7 5800x L1 achieves 254GB/s
I don’t have any performance numbers, other than that which whatcookie provides, and I’m aware of the Intel throttling and the fact that Linus Torvalds apparently HATES AVX-512 extensions because of that throttling. I do know, there is a backup code path in case AVX-512 isn’t supported in the PS3 emulator, and he claims it’s about 30% slower (or using AVX-512 on supported CPUs is 30% faster). I mostly just found it interesting – I’m not looking to play PS3 games at 200% of the original speed on an emulator.
Yeah I think its a question of… are these instructions really useful for all RISC-V users or should they only be there for x86 emulation in an x86 emulation extension to the spec.
The way around number of instructions is going superscalar… so while a CISC may have a single instruction to do X… it may nnot execute in a single cycle and for those that do…. a smart RISC may schedule them in parallel if they do not have data races.
A smart x64 scheduler will also schedule things in parallel (and out of order, etc.) – there’s also no guarantee that in RISC a single instruction takes a given number of cycles. Modern RISC architectures end up looking a lot like x86 these days – the only real difference is in the details, and how the decode engine performs (x64 decode engines do need to be a bit larger).
I mean, this stuff if fascinating – we CAN know what is going on to execute a decoded instructions micro-ops by looking at this table: https://uops.info/table.html
So coooool! https://youtu.be/xCBrtopAG80?si=M93NmpH6mN5XzRpw&t=2154
CaptainN-,
I remember when it was common to use references like this to write efficient assembly code.
I didn’t know people still used this. The architecture specifics are so complicated now I’m not sure how useful this approach is any more….but maybe a really high end compiler could be fed enough information about a CPU’s internal workings to be able to optimize it. I know GNU C has CPU switches for this purpose, but I just take it for granted and I’ve never really tested it to see how effective it is in practice. Does anyone have more insights here?
Applying some optimizations based on that table to performance sensitive code might be something an LLM could do with reasonable expectations.
I said this elsewhere but some of these instructions could be bundled as a RISC-V extension targeting emulation. Many ( perhaps most ) RISC-V use cases would not require this extension and normal RISC-V code could probably avoid it without issue ( to maximize portability ). However CPUs that target use cases where emulation can be expected ( eg. desktop or laptop chips ) could implement the extensions and then emulators could take advantage of them.
The emulators could be written such that they check for the extension and use the more efficient instructions if present and longer batches of instructions if not. This means that the emulators would be broadly compatible and extension support would just be a performance question ( similar to AVX-512 on amd64 ). Again, this would not complicate the overall RISC-V ecosystem. Chips that do not need these instructions can skip them ( saving silicon ).
There are decompiler like Binary Ninja. So I suppose you could write an emulator that decompiles the code into a pseudo-C and then uses and optimizing compiler.
This is never going to be as compatible. It also seems very likely that the recompiled code will be less optimized. It will be difficult for information not to be lost.
tanishaj,
I think it would depend on the method used. You could literally take each decompiled opcode and execute output code that literally corresponds to it. This wouldn’t wouldn’t even be hard to do. Doing it this naively the resulting program should technically be robust, but it would also be ridiculously over specified. This limits optimization opportunities. The output would effectively have to match whatever way the x86 instructions did things. Optimizations that would have had leeway with the C original source wouldn’t necessarily be available.
To fix this and try to interpret the programmer’s intent, heuristics could be used, but heuristics and intent are lossy…
Some of these may be good candidates for RISC-V extensions. If you are making a chip for a platform that is likely to have emulation as a target use case ( eg. laptop or desktop CPU ), these extensions could be implemented. If it is a platform that is less likely ( eg. phone, microcontroller, etc ) then these instructions could be skipped ( saving silicon ).
If I understand correctly this uses Wine’s x64 binaries. Now that Wine support WoW64 I wonder how a WoWRV64 would perfom. There have been articles here on osnews of people who emulated dead Windows architectures like MIPS in modern Windows that way and that is what Microsoft uses to emulate x64 in ARM.
We have also discussed here the performance of RISC-V vs ARM before. I think RISC-V needs to add some instructions to be perfomant, not just because compatibility with x86.
Likely. I think it would make sense thought to separate these use cases.