If RISC-V ever manages to take off, this is going to be an important tool in RISC-V users’ toolbox: felix86 is an x86-64 userspace emulator for RISC-V.
felix86 emulates an x86-64 CPU running in userspace, which is to say it is not a virtual machine like VMware, rather it directly translates the instructions of an application and mostly uses the host Linux kernel to handle syscalls.
Currently, translation happens during execution time, also known as just-in-time (JIT) recompilation. The JIT recompiler in felix86 is focused on fast compilation speed and performs minimal optimizations. It utilizes extensions found on the host system such as the vector extension for SIMD operations, or the
↫ felix86 websiteB
extension for emulating bit manipulation extensions like BMI. The only mandatory extensions for felix86 areG
, which every RISC-V general purpose computer should already have, and v1.0 of the standard vector extension.
The project is still in early development, but a number of popular games already work, which is quite impressive. The code’s on GitHub under the MIT license.
Cool project. The article doesn’t cover the question that’s probably on everyone’s mind: performance. The usability of code translation depends on compatibility and performance. The compat page shows that it still needs work, but this can keep improving.
https://felix86.com/compat/
For performance, what’s the overhead for translated code versus native code?
Also qemu does the exact same thing so I’d like to hear more about how the goals of this project differ from those of qemu? Is this merely an alternative or is there something that will differentiate it?
@Alfman
You are a well informed person so perhaps you know more about this than I do. However, I do not think that QEMU does the same thing as Felix86.
QEMU is a VM platform. It is going to emulate everything. Felix86 is going to JIT the application logic from x86-64 to RISC-V but the Linux system calls are going to be sent directly to the host kernel, where they will be executed natively. While QEMU does employ JIT techniques, I do not believe that QEMU is so “Linux aware”. And, as with a container, there is no “guest operating system” here. Again, things run directly on the host kernel. This is going to lead to a significant difference to both performance and resource use.
When you combine this with the fact that, for games, a lot of the heavy lifting is done by the GPU, I think performance could be pretty good. The GPU stuff can be handled by the GPU natively without concern for how the CPU instructions are being handled.
LeFantome,
QEMU supports both VM as well as userspace emulation. I should have made my post clearer because many QEMU users may only be familiar with the full system emulation…
https://www.qemu.org/docs/master/user/main.html
My mind immediately wonders how these emulators compare to each other. Here is an unrelated benchmark comparing several userspace x86 emulators for ARM…
https://box86.org/2022/03/box86-box64-vs-qemu-vs-fex-vs-rosetta2/
In that instance, QEMU (without the benefit of KVM) is left behind by more optimized emulators. A benchmark to compare emulators on RISC-V is warranted, unfortunately I don’t have any RISC-V computers to test for myself, haha.
That’s true, many games demand more of the GPU than CPU. Emulation overhead will result in higher CPU loads and power consumption, but it may not matter if the load remains less than 100% CPU regardless. I’m guessing that microstuttering and compilation delays might be more noticeable with a JIT design. Depending on the game, new code might be executing throughout the game, or all the main code paths will be compiled up front without further involvement of the compiler inside the game loop. Somebody’s gotta benchmark this stuff for the sake of science 🙂
@Alfman
Thank you so much for that. I had either forgotten or did not know about QEMU user mode.
Now I want to spend the day exploring and benchmarking as you say. Sadly, I do not have the time. I think Felix is much like Fex, so those benchmarks may give us some hints. From your link, QEMU was by far the worst both in terms of performance and compatibility.
What I am blown away by right now, though, is that QEMU can do usermode emulation of BSD on a Linux kernel (so, running BSD software on Linux without a BSD kernel) and vice versa (Linux software on BSD). I for sure had no idea that such a thing was possible. I really do have to find the time to play sometime.
LeFantome,
Indeed. In the past when I tested QEMU VMs without KVM, the software emulation scored about 20-25% of native (x86 on x86). However the fact that “QEMU doesn’t integrate a pass-thru mecanism for GL by default.” seems like a death sentence for both compatibility and performance.
There are challenges when it comes to emulation across architectures. x86 in particular has stricter memory semantics, which is non trivial to replicate on architectures that have looser memory semantics. QEMU takes a hit for this…”Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock in order to preserve their semantics.”. I wonder how much performance QEMU looses because of the global lock. Apparently apple implemented x86 memory semantics in their ARM CPUs to help with rosetta2 emulation. It makes me wonder if box86, FEX, felix86 are emulating x86 memory semantics correctly? Most software does not depend on x86 semantics and so not implementing it could help increase performance without a negative impact. I can’t say anything definitely without going through the code and testing each emulator’s behavior.
JIT sucks. I am so glad Android moved away from it and went AOT.
JIT for ISAs not meant to be JITed such as x86 sucks even harder.
I can see Felix is trying to thunk various stuff, which is not done by qemu (e.g. it’s cheaper to load native riscv library and proxy the calls instead of calling into JITed shared libraries such as OpenGL), so it’s going to be huge benefit if code calls into external libraries. I think xbox emulator team used similar approach when calling win32 api and probably Microsoft uses similar techniques to bridge 32bit and 64bit stuff. Looks like they just cover basic stuff, however more work on the JIT would be beneficial.
It also looks like they switched to qemu like approach for translation – instead of comples SSA logic, which is good for compilers, just translate necessary stuff and let CPU do the rest.