Releases: FEX-Emu/FEX
FEX Release FEX-2410
Read the blog post at FEX-Emu's Site!
A bit of a slower month for this release as our developers prepare for the X.Org Developer's Conference 2024 and
the GStreamer Conference 2024!
We're going to keep this release short and sweet as GStreamer conference has already started and XDC is kicking off in two days.
ARM64EC changes
This month we had some ARM64EC changes to fix some AVX bugs around state saving and restoring when running under Wine. In addition some changes to dynamically determine Windows syscall numbers under WINE rather than hardcoding it.
Additionally some fixes around the how the call checker works, which fixes issues when applications hook function calls early using the Win32 APIs.
For any user that wants to tinker with these, you can find instructions for how to build & install on our wiki.
JIT optimizations
We had some optimizations land in the JIT this month which speeds up the time it takes the JIT to emit code. In some microbenchmarks this has shown up
to 9% less CPU time spent JIT'ing code. This is very important for reducing stutters when encountering new code inside of applications.
JIT fixes
We fixed multiple bugs in the JIT emulation this month, one of which was fixing the x87 FPREM instruction. This instruction is interesting because x87
actually offers another variant called FPREM1, which FEX did emulate correctly. Once we found the bug in the implementation, it actually managed to
fix a major issue where Steam's login was super flaky and had to attempt it about a dozen times before it worked. This also happened to fix
Touhou Luna Nights which rendered its 2D tiles incorrectly before this change.
We also fixed an issue where Halls of Torment was copying data from a thread-local object using vector instructions and FEX had mistakenly broken
vector memory accesses to TLS objects. Fixing this solves this game problem, and probably many other games that happens to inline a memcpy accelerated
using vector operations!
We also happened to fix a bug with the SSE MAXSS instruction where in the face of NaNs we were returning the incorrect result. Only affected some edge
case behaviour, but nice to see little bug fixes like this.
An aside
Make sure to check out the XDC livestreams over the next few days! You definitely won't want to miss it!
See the 2410 Release Notes or the detailed change log in Github.
Raw Changes
FEX Release FEX-2410
-
AOTIR
-
Move debugdata structure to internal header (304b5de)
-
ARM64EC
-
Dynamically determine syscall numbers under wine. (967c04e)
-
AVX register save/restore support under wine (e9d34c7)
-
Patch the call checker before calling any syscall exports (2672e06)
-
Switch to exception-based native syscall dispatch (07f117b)
-
Arm64
-
Fix warning (ebae1bf)
-
CMake
-
Disable vixl compiling if not enabled (f99900b)
-
Compile with system libraries for xxhash, Catch2, and fmt, if available (a4acd64)
-
CPUBackend
-
Remove unused functions (5f79761)
-
CPUID
-
Update to something a little more modern (5bd0afa)
-
Config
-
Be a bit smarter in ReloadMetaLayer (b5c9eb3)
-
Fixes stack usage in EnvLoader (4f8d7cb)
-
ConstProp
-
speed it up (36e4f9a)
-
FEXConfig
-
Add the ability to watch and configure global rootfs (f49d82d)
-
FEXCore
-
Remove BlockSamplingData (9310ac5)
-
Convert Base tables over to constexpr (20fb0da)
-
Remove Context header include when unneeded (4cb4c0e)
-
Delete ThreadsState struct (219a477)
-
Move
CustomIRResult
to internal header (29f0e9b) -
Move
JITSymbolBuffer
to internal header (b75efc4) -
Context
-
Removes unused features (05be944)
-
FEXLoader
-
Drop the binfmt_misc
I
flag (e190d02) -
FEXRootFSFetcher
-
Fallback to TTY if zenity isn't installed (bebd740)
-
XXHash
-
Fix double fd close (360ea53)
-
FEXServer
-
PipeScanner
-
Stop writing null to incoming pipes (0a35029)
-
JIT
-
always use 64-bit moves for SRA (cd4c224)
-
Linux
-
Optimize CreateNewThread and HandleNewClone stack usage (d547f2b)
-
LinuxEmulation
-
Update syscalls for v6.11 (aaa8eef)
-
Update DRM for v6.11 (c036868)
-
LinuxSyscalls
-
Update max reported kernel version to 6.11 (9716fc7)
-
With poll syscall, ensure fds is writable only if nfds is not zero (f77841d)
-
OpcodeDispatcher
-
Fixes segment prefixing on vector element loadstores (2726f35)
-
Constexpr-ify all the tables possible (9d6865f)
-
Do not forbid INT 2E syscalls on 64-bit Windows (bbef4d7)
-
SignalDelegator
-
Let the frontend inform AVX support (9100235)
-
Thunks
-
Move to the frontend (f640dcc)
-
Vulkan
-
Updates thunks to v1.3.296 (1670c89)
-
Tools
-
Delete imgui FEXConfig (c0b8a0d)
-
VDSO
-
Implements v6.11 vdso getrandom (e902ad5)
-
VDSOEmulation
-
Stop using dlopen for VDSO (5745b41)
-
Windows
-
Switch supported MinGW CRT to UCRT (c2f2c58)
-
Don't assume the log file was successfully opened (f20e626)
-
Misc
-
speed up the JIT (e82d2c7)
-
Fixes fprem (b89f5b8)
-
x87 small cleanups; NFC (5e9c211)
-
Build host tools without jemalloc (49de1fa)
-
Fix FScale'ing zero that should return zero (4d62e75)
-
Stop installing binfmt_misc on x86 (b6d3df0)
-
Fix (v)maxss on non-AFP platforms (9e4de3b)
-
Support for emulated overcommit on Windows (4a2b4be)
-
Bunch of little things reported by coverity (b4c797c)
-
Global flag optimizations (1c59bfe)
-
gitignore
-
ignore build symlink (90dbd47)
-
unittests
-
Be more strict with -- separator (3c205eb)
-
Fixes incorrect argument (c8f3fe3)
FEX-2409
Read the blog post at FEX-Emu's Site!
FEX-2409 is now released... with a big performance boost.
Inline Graph
I'm tired, carry me.
Little differences between x86 and Arm can cause big performance penalties for
emulators. None more so than flags.
What are those?
Flags are bits representing the processor state. For example, if an operation
results in zero, the "zero flag" is set. Both x86 and Arm have flags, so for
emulating x86 on Arm, we map x86 flags to Arm flags to reduce emulation
overhead. That is possible because x86 and Arm have similar flags. By contrast,
architectures like RISC-V lack flags, slowing down x86-on-RISC-V emulators.
Many arithmetic operations set flags. Programs can then conditionally jump
("branch") according to the flags. On x86, the flags are thus the building blocks
of if statements and loops. To check if two variables are equal, x86 code
subtracts them and checks the zero flag. To check if one variable is less than
another, x86 code subtracts and checks the negative flag. This pattern --
subtracting, setting flags, and discarding the actual result -- is so common
that it has a special instruction: CMP ("CoMPare").
If the story ended here, emulation would be easy. Unfortunately, we need to
talk about the carry flag.
After an addition, the carry flag indicates if the result overflowed.
Programs can then check the carry flag to detect overflows. The flag can also
be input to another addition to implement 128-bit additions.
Subtractions are similar. In hardware, subtractions are additions with an
operand negated. Because they are additions in hardware, subtractions set the
carry flag. Precisely how is the carry flag defined for subtraction? There are
two competing conventions.
The first sets the flag when there is a borrow, by mathematical analogy with
addition. x86 uses this "borrow flag" convention, as it seems more natural.
The second option sets the flag when there is not a borrow. Isn't that
backwards? It turns out that adding a (two's complement) negated operand
overflows exactly when the subtraction does not borrow. This "true carry"
convention matches actual hardware behaviour, while the "borrow" x86 convention
requires extra gates to invert carry. Arm uses the "true carry" convention to
save a few gates.
Which convention should FEX use?
We could store the x86 carry flag in the Arm carry flag. Unfortunately, that
requires an extra instruction after each subtraction to invert carry to get the
borrow flag.
The counter-intuitive alternative is storing the opposite of the x86 flag.
That requires an extra invert after every addition, but it eliminates the
invert after subtraction.
Either we pay after additions or after subtractions. Which should we pick?
While addition is common, using the flags from an addition is not. Flags are
typically used with comparisons, which are subtractions. Therefore, the
inverted convention usually wins. This month, Alyssa adjusted FEX to invert
carries, speeding up typical workloads by a few percent.
After tackling the carry flag, Alyssa optimized FEX's translations of address
modes, push/pop, AVX load/stores, and more. Overall, benchmarks are upwards of
10% faster since the last release.
A Qt change
What about more user-visible changes? If you use the FEXConfig tool to
configure the emulator, you're in for a treat. While it works, this ImGui-based
tool isn't exactly known for its convenience. In
between his work optimizing the [redacted] out of FEX's [redacted], Tony
rewrote FEXConfig as a simple Qt application, improving aesthetics, usability,
and accessibility all in one go. Here's a preview:
Besides look and feel, we've polished first-time setup for logging,
library forwarding, and RootFS images. We've also made tweaking various
emulation settings a bit nicer. Users of our Ubuntu PPA can simply
update to unlock these improvements without any further action.
But with so much optimization, who needs speedhacks anymore?
Raw Changes
FEX Release FEX-2409
-
ARM64EC
-
Always use the CPU area context for BeginSimulation (cc589ba)
-
Fix ContextFlag member tests (47b3637)
-
AVX128
-
Fixes 256-bit float compares (2478abb)
-
Fixes incorrect size usage in AVX128_Vector_CVT_Int_To_Float (a4db585)
-
Arm64
-
Implement support for strict in-process split-locks (74e95df)
-
Ensure 256-bit operations always assert without 256-bit SVE (90f7cc9)
-
Fixes SIGBUS handler for FEAT_LRCPC (2357253)
-
Allow directly correlating an ARM register back to an x86 register (92c951c)
-
Handle backpatching in a thread-safe manner (226f5e2)
-
CMake
-
Adds an AArch64 cross-compile toolchain file (06497fd)
-
Don't install binfmts for MinGW builds (f009a00)
-
CPUID
-
add missing Apple core part numbers (8296bfc)
-
ConstProp
-
stop pooling inline constants (5013b8a)
-
DeadStoreElimination
-
handle PF/AF invalidate (03832b2)
-
External
-
Update robin-map from 1.2.1 to 1.3.0 (96055cb)
-
Update fmt from 10.1.1 to 11.0.2 (f1d7879)
-
FEX
-
Moves sigreturn symbols to frontend (4baeffe)
-
FEXCore
-
Dynamically scale TSC (46a2a06)
-
Splits up atomic enablement checks (689b461)
-
Stop installing static library (4abac0c)
-
Disable vixl linking if vixl disasm or simulator is disabled (5e706df)
-
FEXInterpreter
-
Support portable installs (9336e35)
-
FEXLoader
-
don't install FEXUpdateAOTIRCache (1e1bcc4)
-
FEXQConfig
-
Add strict split-lock option (8fe1e95)
-
FEXQonfig
-
Fix minor saving/loading quirks (e234e11)
-
Minor followup changes (c74df6a)
-
FEXRootFSFetcher
-
Allow UI override through options (5b65f30)
-
Frontend
-
short-circuit code generation on invalid instructions with multiblock (92ddc00)
-
HostFeatures
-
Moves MIDR querying to the frontend (fbf62f1)
-
Removes vixl usage (1caa31c)
-
fix clang reformat constantly with missing comment (86e5e1a)
-
IR
-
fix scalar FMA tied sources (a66fac6)
-
InstCountCI
-
include x86 instruction count (50a9cea)
-
InstructionCountCI
-
explicitly enable flagm for multiinst (33558e6)
-
LinuxEmulation
-
Implement support for seccomp (b368223)
-
Moves guest signal frame generation to its own file (d9544e7)
-
LinuxSyscalls
-
Adds missing header (4f40416)
-
Implements less invasive assertion only EFAULT handlers (ce88f5f)
-
Some minor cleanups (27cd399)
-
OpcodeDispatcher
-
Convert more template handlers to Bind handlers (2829ad5)
-
Allow x86 code to read CNTVCT on ARM64EC (e6aa268)
-
fix tso checks (2a170cf)
-
Remove old bad assumption in INC/DEC (c634c53)
-
SignalDelegator
-
Refactor how thre...
FEX-2408
Read the blog post at FEX-Emu's Site!
In the beginning, there were integers.
And robots wanted precise math, and so the x87 floating
point unit was created.
And robots wanted faster math, and so SSE was created to
replace x87, and it was good.
Speeding up x87
Although x87 is slow and deprecated, it hasn't disappeared. 64-bit games will
use SSE or even AVX for floating point math, but older 32-bit binaries --
compiled decades ago -- are filled with x87.
FEX aims to support your entire game catalogue. Last release, we added AVX to
support the newest games. This release, we've circled back to the oldest. Old
games ought to run well on new hardware, but if they use x87, performance can
nosedive. Why? Two x87 quirks: 80-bit precision and the stack.
Floating point numbers are typically 32-bits or 64-bits. 32-bit is
faster, while 64-bit enhances precision for numerical computing. Our
target Arm hardware supports both 32-bit and 64-bit, but
x87 adds an unfortunate third mode: 80-bit. Ostensibly, the extra bits of
precision in intermediate calculations minimizes the accumulated error of
the final result.
Is that necessary? Careful code can mitigate rounding error without the massive
80-bit hammer, thanks to techniques like the Kahan summation
algorithm. New
code doesn't miss the 80-bit hardware.
Sadly, the FEX team can't afford a time machine. They're not cheap anymore.
We'll see what happens with Moore's Law eighteen months ago. So we can't teach
game developers in 2005 how to make do with 64-bit floats. All we can do is
slowly emulate 80-bit floats in software, or substitute 64-bit and the game
won't notice.
The second problem with x87 is more obscure. In a typical instruction set, each
instruction specifies which registers it accesses. By contrast, x87 arranges
registers in a stack. Instead of a destination register, x87 instructions
push to the stack. Instead of source registers, sources are indexed relative
to the top of the stack. Unlike 80-bit floats, stack machines are alive and
well for virtual machines. Like 80-bit floats, they complicate
emulation.
Arm instructions specify their registers directly, but we don't know which
registers an x87 instruction will use without knowing the stack top.
Previously, FEX worked around this mismatch by keeping the emulated x87 stack
in memory instead of registers. Arm can indirectly index memory, so this
works. However, it's slow. Because Arm is a RISC architecture, this approach
requires multiple load/store instructions for every x87 arithmetic operation.
We can do better.
Instead of single instructions, we can translate entire x87 code blocks. That
gives us the full context of each instruction. In "good" conditions, that lets
us statically determine the stack layout, so we can translate stack access to
real Arm floating point registers. The stack loads and stores disappear the
our generated code.
There's another trick we can play. Sometimes games will copy 80-bit floats
without performing any calculations. Translating these copies naïvely is slow
due to 80-bit emulation overhead. However, we can analyze multiple
instructions together to detect 80-bit copies and translate to efficient Arm
code.
These optimizations combine to a surprisingly large speed-up. To illustrate: a
hot block in Psychonauts swizzles a 4x4 matrix. That's light on arithmetic but
heavy on x87 overhead. These optimization reduce the translated code from 2340
to 165 instructions. That's a 93% improvement!
Big thanks to Paulo for taming x87, available in
this month's FEX release.
[redacted]
...
"I'm working on a branch that's 10% faster than upstream. Should I mention that
in the blog?""No, we don't want to ruin the surprise for next month's release."
...
Raw Changes
FEX Release FEX-2408
-
AOTIR
-
Change std::unique_ptr to fextl::unique_ptr (5fe405e)
-
ARM64EC
-
Install a custom call checker to bypass NTDLL function patches (2c3e6cb)
-
Set appropriate AFP and SVE256 state on JIT entry/exit (85d1b57)
-
Introduce FEX-side CRT and Windows API replacements (7b1d954)
-
Handle direct syscall instructions (24ea4b7)
-
Improvements to exception flag handling (cadb0a2)
-
Support the JIT API as is used by Windows (7ffd3e5)
-
AVX128
-
Optimize blends (4882f10)
-
Optimize all cases of vpermq (69ed39d)
-
Implement support for scalar FMA with AFP (9201ac5)
-
Improve VPERMILPS/PD and VPSHUFD (f8c4c54)
-
Fixes vmovq loading too much data (b9a6cae)
-
Extends 32-bit indexes path for 128-bit operations (7ccb252)
-
Optimize the vpgatherdd/vgatherdps cases that would fall back to ASIMD (22b2669)
-
Optimize QPS/QD variant of gather loads! (3627de4)
-
Extend 32-bit address indices when possible (aad7656)
-
Prescale addresses in gathers if possible (47d077f)
-
AllocatorHooks
-
Correct memory API usage on Windows (f98c010)
-
Allocate from the top down on windows (9d0b6ce)
-
ArgumentLoader
-
Removes static fextl::vector usage (f81fc4e)
-
Arm64
-
Implements support for DAZ using AFP.FIZ (f8c6baa)
-
Implement support for SVE bitperm (b282620)
-
Remove one move if possible in FMA operations (3bea08d)
-
Fixes long signed divide (3d65b70)
-
Arm64Emitter
-
Reload STATE before SRA fill on ARM64EC (a742441)
-
CMake
-
Add option to build PDB debug info instead of DWARF (49b8dae)
-
CPUID
-
Adds a few missing CPU names for new CPU cores (c4ae761)
-
CodeEmitter
-
Removes vestigial vixl usage (2da819c)
-
Config
-
Little assume non-null check (f40bc13)
-
Converts two LUT maps over linear scan arrays (c5fe872)
-
Removes a static vector initializer (e0c783d)
-
Search more locations for the config directory on Windows (1f59f0e)
-
EmulatedFiles
-
Adds a few leaf CPUID flags (d385e49)
-
Fix bad formatting (d1249ec)
-
F80
-
Drop dependency on state stored in TLS (c3c2b61)
-
FEX
-
Moves HostFeatures querying to the frontend (434bffa)
-
FEXCore
-
Pass HostFeatures in to CreateNewContext directly (0ecfc65)
-
Drop deferred signal handling on Windows (1007f87)
-
Add a generic spill/fill-all syscall ABI and use for Windows (10ee963)
-
Removes ThreadManager (403fd62)
-
Refactor ExitHandler slightly (380ba0a)
-
Removes CPUBackendFeatures (1fe497d)
-
FixedSizePooledAllocation
-
Fix a race when unclaiming disowned buffers (228009c)
-
HostFeatures
-
Removes feature flags always supported by FEX (633f624a697f...
FEX-2407
Read the blog post at FEX-Emu's Site!
I hope you're ready to game, because your Arm system just got support for AVX.
And AVX2. And FMA3. And F16C. And BMI1. And BMI2. And VAES. And VPCLMULQDQ.
Yeah, we've been busy.
We did a little team-building exercise this month: "bring up AVX on 128-bit
hardware in a week". Now our team is built, and AVX games run on FEX:
AVX on 128-bit Arm
Computers traditionally perform one operation at a time. The hardware decodes
an instruction, evaluates the operation on a single pair of numbers, and
repeats for the next instruction. In mathematical terms, the instructions
operate on scalars.
That design leaves performance on the table.
Many programs repeat one operation many times with different data. Modern
instruction sets exploit that repetition. A single "vector" instruction can
operate on multiple pieces of data at once. Programs will perform the same
amount of arithmetic overall, but there are fewer instructions to decode and
the arithmetic is more predictable. That enables more efficient hardware.
A "scalar" instruction adds a pair of numbers; a "vector" instruction adds
multiple pairs. How many pairs? That is, what length is the vector?
That's a design trade-off. Increasing the vector length decreases the number of
instructions we need to execute while increasing the hardware cost.
Supporting large vectors efficiently requires a large register file and many
arithmetic logic units. Besides, there are diminishing returns past a certain
vector length.
There is no-one-size-fits-all vector length. Different instruction sets make
different choices. In x86, the SSE instruction set uses 128-bit vectors, while
AVX and AVX-512 instructions support 256-bit and 512-bit respectively. For Arm,
the traditional ASIMD (NEON) instructions use 128-bit vectors. Depending on the
specific Arm hardware implementation, the flexible new SVE instructions can use
either 128-bit or 256-bit.
For performance, we try to translate each x86 instruction to an equivalent arm64
instruction. There's no perfect 1:1 correspondence, but we can get close. For
vector instructions, we translate 128-bit SSE instructions into equivalent
128-bit ASIMD instructions.
In theory, we can do the same for AVX, mapping 256-bit AVX instructions to
256-bit SVE instructions. Mai implemented that last year, speculating that their
work would enable AVX on future 256-bit SVE hardware.
...
That hardware never came. Some recent hardware supports SVE but only 128-bit.
Others don't have that, supporting only ASIMD. We had a shiny AVX-256
implementation with no 256-bit hardware to use it with.
Our position remains that efficient AVX emulation requires 256-bit SVE. Unfortunately,
many games today have a hard requirement on AVX. We want to let you play
those games on your Arm devices, so we need to plug our noses and implement
256-bit AVX on 128-bit hardware.
The idea is simple; the implementation is not. To translate a 256-bit
instruction, we decompose it into two 128-bit instructions operating on each
half of the 256-bit vector. In effect, we partially "undo" the vectorization.
This plan has a gaping hole: the register file. Arm has more general purpose
registers than x86, so we statically assign each x86 register to an Arm
register. If we didn't, accessing certain x86 registers would require slow
memory accesses. All efficient x86-on-Arm emulators therefore statically map
registers.
This scheme unfortunately fails with AVX emulation. Our Arm hardware has 128-bit
vectors, but we're emulating 256-bit AVX vectors. We would need twice as many Arm
vector registers as x86 vector registers, and we don't have enough.
Still, running AVX games suboptimally is better than not running at all. Yes,
we're out of registers -- we'll just have to keep some in memory. The assembly
isn't pretty, but it works well enough.
Building on Mai's SVE-256 implementation of AVX, Ryan whipped together a
version supporting both SVE-128 and ASIMD. That means it should work on
all arm64 devices, all the way back to ARMv8.0.
F16C, FMA, and more
AVX isn't the only x86 extension new games require. After beating AVX, x86 has a
post-AVX questline for emulator developers, with extensions like F16C.
Fortunately, nothing could scare Ryan and Mai after AVX, and they tackled the
new extensions without a hitch.
Speeding up translation
Dynamically assigning AVX registers means we can't translate instructions
directly and expect good results. While we don't need a full optimizing
compiler, we do need enough intelligence for basic inter-instruction
optimizations. FEX has had some optimizations since day one, but the priority
has always been on bringing up new games. There's been even less focus on the
translation time -- not how fast the generated arm64 code executes but how
fast we can generate the arm64 code. Translation overhead contributes to slow
loading screens and in-game stutter, so while it flies under the issue radar,
it does matter. So, Alyssa optimized the
FEX optimizer this month by merging compiler passes. That both simplifies the
code and speeds up translation. Since the start of June, we've reduced
translation time 10%... and more optimization is coming.
Raw Changes
FEX Release FEX-2407
-
ARM64
-
Adds new FMA vector instructions (00cf8d5)
-
AVX128
-
Fixes vmovlhps (e2d4010)
-
Minor optimization to 256-bit vpshufb (5821054)
-
Fixes scalar FMA accidentally using vector wide (4626145)
-
Minor optimization to vmov{l,h}{ps,pd} (cf24d3c)
-
Some quick bugfixes (739ac0f)
-
Enable all the things (e519bf5)
-
Fixes wide shifts (a031a49)
-
F16C support (4d821b8)
-
Implement support for gathers (6226c7f)
-
FMA3 (7d939a3)
-
fix VPCLMULQDQl (5da205d)
-
More instructions Part 4 (77aaa9a)
-
Fix vmovntdqa failing to zero upper 128-bits (7ff9622)
-
Fixes SSE4.2 string compare instructions (dce1b24)
-
More instructions Part 3 (8c751d7)
-
More instructions (ddb9f6d)
-
More various instructions (be8ff9c)
-
Some pun pickles, moves and conversions (7bbbd95)
-
Move moves! (f489135)
-
Arm64
-
Minor VBSL optimization with SVE128 (2e6b08c)
-
Remove contiguous masked element optimization (dc44eb4)
-
Implement support for emulated masked vector loadstores (3d26e23)
-
Arm64Emitter
-
drop out of date comment (825d2c9)
-
BMI2
-
Ensure rorx immediate masks by operation size correctly. (54a1f7d)
-
CMake
-
Add a clang version check (5d67223)
-
CPUID
-
Oops, forgot to enable AVX2 (58e949e)
-
Update labeling on some reserved bits (2da1e90)
-
CodeEmitter
-
Fixes vector {ldr,str}{b,h} with reg-reg source (2e84f21)
-
CoreState
-
Move
InlineJITBlockHeader
to ...
FEX-2406
Read the blog post at FEX-Emu's Site!
A little late this month but we have a new FEX release has finally landed. This month we have some good optimization and fixes so let's get right in to it.
A bunch of JIT optimizations
This last month is finally the culmination of preparation work over the past few months of cleanups in the FEX JIT. The new register allocator has
landed in FEX which is significantly better than our previous RA. Our prior implementation was meant to be a temporary solution when FEX initially
started as a project and as with most temporary code, it became permanent. It was excessively slow, best case it ran in quadratic time, worst case it
could take INFINITE time which resulted in significant stutters or hangs. This new implementation by Alyssa now runs in two passes in linear time,
significantly improving performance and also removing a ton of bad design decisions from the first implementation.
In addition to the new RA, we also have a bunch of little optimizations spread around that improves performance all over the place. One of the bigger
performance improvements for people with new hardware is enabling the AFP extension and RPRES if supported. Apple supports these in their latest SoC
and the newer Cortex also supports them. This improves scalar SSE performance by quite a bit. We won't dive in to these too much but the various
optimizations can improve performance from 2% to 12% in testing. We're marching ever closer to running applications at near native speeds now.
Add support for 32-bit OpenGL thunking
This is a big feature! 32-bit thunking has been a long time coming and has crossed some significant hurdles towards actually working! One of the
biggest CPU time sinks with games is the amount of time we need to spend in the video driver when running a game. "Thunking" allows us to remove that
overhead and jump directly in to the AArch64 libGL directly and remove a bunch of emulation overhead. We have done a bunch of testing with this but we
expect there will still be some bugs that need to be worked out. As for fun performance improvements, we have seen one game go from 150FPS up to
270FPS, so it's worth trying in some cases.
As a note though, this is only 32-bit OpenGL thunking. 32-bit Vulkan drivers still need to go down the emulation path, so things like DXVK in older
32-bit games won't get these performance improvements.
Default TSO emulation options changed
Over the course of the past couple months we have been testing the new TSO memory model emulation toggles and during this time we have determined the
cost of accurately emulating Vector and Memory copy memory atomics to be too high for most hardware. The good news is that from all the testing we
have done, this doesn't actually cause any problems in any known games. So from this release onward we are by default disabling TSO emulation on these
operations. We may come back and visit this once hardware ships that has FEAT_LRCPC3 which adds new instructions for Vector TSO loadstores.
Users with an older configuration can go in to FEXConfig to toggle these options off and enjoy the free speed benefits of not doing accurate emulation
today! As a note, Apple Silicon's TSO hardware emulation bit doesn't suffer the same performance degradation so once Asahi Linux supports this for
users then they get accurate emulation and speed!
Unaligned Half-barrier TSO Enabled is still recommended to keep enabled as that can cause significant bugs
Fix fstatat/statx with NOFOLLOW And JIT bugs
During a livestream one of our users encountered a bug in FEX-Emu that was breaking Darwinia.
After diving in to the game to figure out what it was doing, it actually turned out to be three separate bugs that broke the game. The first bug fix
with fstatat and statx syscalls were around edge case behaviour with the NOFOLLOW flag. The game was attempting to find the directory that the
executable was living in and being smart in a way that broke FEX.
The other bugs were behaviour in our optimization passes where we broke x86 SIB addressing in a couple ways. We have since added unittests for
these two bugs but if you would like to read more you can check out ConstProp fixes for Darwinia
With these bugs fixed the game now runs correctly under FEX-Emu without issue!
More ARM64EC improvements
This was some cleanup work for helping more easily integrating with what upstream WINE is doing for ARM64EC support. While still not entirely usable
for end-users yet, it is steadily improving and can run real games if the environment is setup correctly. A lot of good work here and we're hoping
for more testing going forward.
NVIDIA Orin CPU errata!
Over the past month or two we had noticed that the NVIDIA Orin platform with its Cortex-A78AE CPU cores were running games markedly worse than our
Snapdragon 8cx Gen 3 platform with Cortex-A78C cores. While these CPU cores are not identical between platforms, they are both based on the Cortex-A78
CPU core design so they should be relatively close. The NVIDIA Orin runs its cores at a 2.2Ghz clock frequency, while the Snapdragon runs its cores at
2.4Ghz. Nearly a 9% clock speed difference wasn't accounting for the performance delta we were seeing!
The game we were testing was the PC port of Sonic Adventure 2: Battle; On Orin the board could only achieve 18FPS, while on Snapdragon we were easily
hitting 60FPS with headroom to go higher if VSync was disabled. We were stunned by this absolute performance difference and couldn't nail down the
difference being due to different drivers.
Turns out we only needed to look at the Cortex-A78AE Software Developer Errata Notice to find out why.
1951502
Atomic instructions with acquire semantics might not be ordered with respect to older stores with release semanticsUnder certain conditions, atomic instructions with acquire semantics might not be ordered with respect to older instructions with release semantics. The older instruction could either be a store or store atomic.
This erratum can be avoided by inserting a DMB ST before acquire atomic instructions without release semantics. This can be implemented through execution of the following code at EL3 as soon as possible after boot:
This then goes on to talking about some code that programs the CPU so that it injects these DMB ST instructions before atomic acquires automatically!
This is why this platform has been so weird for performance testing for years! This massive hardware errata basically deletes any advantage that the
FEAT_LRCPC extension gives FEX and goes back to emulating atomics using half-barriers similar to how FEX already does it around unaligned
atomics!
We are now looking to move off of this NVIDIA Orin platform as quickly as possible, it was already old and now that we have identified a significant
problem around atomic performance it is higher priority. Luckily over the last few months we have great new hardware announcements. The Snapdragon X
Elite devices are shipping soon, NVIDIA has announced a new Jetson AGX Thor platform, The NVIDIA Grace server platform is starting to become available, and
Apple has some new M4 devices that will be interesting! Ideally we will get a new platform that we can plug a Radeon GPU in since it is a huge boon to
our testing performance, but depending we may not have that luxury. We'll see as we move on to new and better platforms!
Video game showcase
Instead of a video showcase from FEX this month, go checkout Asahi Lina's Youtube Page. She recently did a
couple of live streams fixing issues with the Asahi Linux MicroVM solution for running FEX-Emu on on Apple Silicon! She showcases a bunch of games
while covering some of the more technical problems involved with getting FEX-Emu running on that platform.
Be warned, these are very long streams.
Raw Changes
FEX Release FEX-2406
-
AOTIR
-
Refactor interfaces to clarify ownership flow (3bac767)
-
CMake
-
Remove obsolete Catch2 setting (170204d)
-
CPUID
-
Adds Qualcomm Oryon product name (f7bfecd)
-
Config
-
Change default TSO options (6954ebe)
-
ConstProp
-
remove x86 jit leftover (58614ff)
-
Constprop
-
clean up (14bfe60)
-
FEXCore
-
Fixes the difference between CPL-0 and undefined instructions (efe7c54)
-
Get rid of DeferredSignalFaultAddress and use the InterruptFaultPage (f27f187)
-
ARM64EC x64 entry/exit support (2cae2f2)
-
docs
-
Adds programmer documentation about memory model emulation (a515b70)
-
FEXLoader
...
FEX-2405
Read the blog post at FEX-Emu's Site!
One month older and we have a new release for FEX. This month is a little bit slower for user facing features but behind the scenes we had a large amount of refactoring to facilitate future improvements.
Support OpenGL and Vulkan thunking with forwarding X11
A thorn in FEX's side has been forwarding X11 calls when thunking OpenGL and Vulkan. This has caused us pain since X11's API is a fairly leaky
abstraction that leaks data symbols which FEX's thunking can't accurately thunk. We have now instead redirect the X11 Display object directly in
OpenGL and Vulkan. This not only reduces the amount of code we need to thunk, but also is required for us to eventually thunk 32-bit libraries.
This may change behaviour in some games when thunks are enabled, so it will be useful to do some testing and regression testing.
Enable Enhanced REP MOVS when memcpy TSO is disabled
Alongside last month's optimization when optimizing memcpy instructions, we now enable this CPUID bit when memcpy TSO emulation is disabled. This
means that glibc will take advantage of the optimization when it is doing memcpy and memset operations. This is a minor performance improvement
depending on the application.
Implement support for SMSW instruction
This instruction isn't too notable since all it does on recent x86 CPUs is return the same data no matter what, but legacy CPUs it was useful for
checking if x87 was supported. As this is considered a system level instruction, FEX didn't implement it originally but we finally found a game that
uses it. The original Far Cry game from 2004 uses this instruction for some reason. Now that we have implemented the instruction the game at least
gets to the menus but seems to still stall out when going in-game. Kind of neat!
Fix disabling TSO emulation on some stack accesses
When emulating the x86 memory model, we can get away with not emulating it when a thread accesses its stack. This works since we know that stack
accesses are usually not shared between threads. While we usually disabled the TSO emulation in these cases, we had accidentally missed some cases.
This will mean that there is some performance improvements for "free."
Fix ADC and SBC instructions
Back in FEX-2403 we had landed some optimizations for these instructions. Turns out that we had inadvertently introduced some broken behaviour as an
edge case. Most games didn't hit these edge cases so they were fine but it completely broke rendering in the Steam release of Final Fantasy 7.
With that fixed, we now get correct rendering again!
Option to disable half-barrier TSO emulation on unaligned atomics
On x86 most atomic instructions don't care about alignment, with a feature that Intel calls "Split-locks" to even support unaligned atomics that cross
a cacheline. On ARM we don't have this luxury, where all atomic instructions need to be naturally aligned to their size. Newer ARM CPUs added a
feature that allows unaligned atomics inside of a 16-byte region, but if the data crosses the edge then FEX needs to handle this instruction in a
different way. We backpatch the instruction stream with a "half-barrier" access which still emulates the x86 memory model but is exceedingly heavy.
Now we have an option to convert these "half-barrier" implementations in to non-atomic access. While this doesn't match true x86 behaviour, this can
accelerate some games that heavily abuse unaligned atomic memory accesses. So if a game is running slow, this is an option to try out!
Refactor code using clang-format
As alluded to at the start, the FEX codebase has now been completely refactored using clang-format to ensure a more consistent coding experience. We
provide a helper script and CI enforcement to ensure this stays in place. This took quite a bit of work to ensure the feature was up to everyone's
standards. Major thanks to everyone that worked on this!
Eliminate cross-block liveness for instructions
This is preparation work for future JIT improvements and speedups. Cross-block liveness makes our register allocator more complex and by removing this
from our JIT we can start working on improving the register allocator. Look forward to register allocator improvements in the coming months.
Support arm64ec function calls
This changes how some of FEX's JIT infrastructure works in order to support the ARM64ec windows ABI. This will get used in conjunction with Wine's
ARM64ec support. While still a work in progress, this is starting to become interesting as WINE continues to implement more features to handle this
case.
Video game showcase
Raw Changes
FEX Release FEX-2405
-
Allocator
-
Fixes compiling on Fedora 40 (7b88b0f)
-
AllocatorHooks
-
Mark JIT code memory as EC code on ARM64EC (bb24e14)
-
AppConfig
-
Disable libGL forwarding for steamwebhelper (02ebb6e)
-
Arm64
-
Adds another TSO hack to disable half-barrier TSO (1069cab)
-
CI
-
Drop use of obsolete ENABLE_X86_HOST_DEBUG setting (c126b20)
-
CPUID
-
Removes FEX version string from CPU model name (faa494c)
-
Fix inverted RDTSCP check (9781b95)
-
Enable enhanced rep movs in more situations (68e543c)
-
DebugData
-
Remove header (1616b4e)
-
FEXCore
-
Support x64 -> arm64ec calls (1a8b61b)
-
FHU
-
Switch over win32 file operations to std::filesystem (ae5e388)
-
IR
-
Remove VectorZero (a0f2cae)
-
JIT
-
fix neon vec4 faddv (fe70ec7)
-
fix ShiftFlags shuffles (376936c)
-
Adds support for spilling/Filling GPRPair (eedb120)
-
LookupCache
-
Track ARM64EC page state in the code cache (f0dad86)
-
OpcodeDispatcher
-
Implement support for SMSW (20bd864)
-
Fixes disabling TSO access on RSP SIB stores (1ba678f)
-
Add helper for making segment offset addresses (66cbb66)
-
RCLSE
-
disable store-after-store optimization (5cb11ae)
-
X87
-
Simplify constant loading for FLD family (271700e)
-
Misc
-
Library Forwarding: Support libGL/libvulkan without forwarding libX11 (b33e0e3)
-
clang-format: allow overriding clang-format (3fda47e)
-
Fix 8/16-bit ADC/SBC (9c6f749)
-
Library Forwarding: Fix issues with libGL's fake X11 dependency (81a4206)
-
Factor out SetWriteCursorBefore (a0bf6a4)
-
WOW64 backend code invalidation fixes (81c219c)
-
Clang Format file and script (53bed6d)
-
CI workflow to check clang-format usage on pull requests (f9097c0)
-
Add second reformat to git blame ignore file (55c054b)
-
Reformat until fixed-point (07e58f8)
-
Create .git-blame-ignore-revs with whole-tree reformat sha (7e97db8)
-
Remove trace of clang-tidy experiment from CMakeLists.txt (7614ac9)
-
Whole-tree reformat (1d806ac)
-
Move const to the left in preparation for reformatting (028c220)
-
Enable jemalloc for ARM64EC (a9b7ad8)
-
Validate that we have no crossblock liveness (e91e1d5)
-
Elimina...
FEX-2404
Read the blog post at FEX-Emu's Site!
After last month having an absolute ton of improvements, this month of changes is going to look positively tiny in comparison. We have some good new
options for tinkering with FEX's behaviour and more performance improvements. Let's get in to it!
Implement more memory model emulation toggles
The biggest performance hit with FEX's x86 emulation has always been emulating the memory model of x86. ARM has added various extensions over the years to make this emulation faster but it still isn't enough.
- FEAT_LSE - Adds a bunch of atomic memory instructions
- Original ARMv8.0 doesn't support this. Massive impact on performance.
- FEAT_LSE2 - Adds unaligned atomics (within a 16-byte granule) to improve performance of x86 atomics.
- Doesn't quite cover the full 64-byte cacheline of unaligned atomics that x86 supports
- FEAT_LRCPC - Adds new load instructions which match the x86 memory model
- FEAT_LRCPC2 - Adds even more loadstore instructions which match x86 instructions
- FEAT_LRCPC3 - Adds even more, including vector loadstore instructions
- No hardware today supports this extension
Even with this set of extensions, emulating x86's memory model can have near a 10x performance hit. This performance impact is most felt in games because they use vector instructions very heavily, which is because of the lack of the FEAT_LRCPC3 extension.
With this in mind, we are introducing some sub-options around emulating x86's TSO memory model to try and lessen the impact when we can get away with it. These new options can be found in the FEXConfig Hacks tag.
These two new options are only available for toggling when TSO emulation is enabled. If your CPU supports FEAT_LRCPC and FEAT_LRCPC2 then a recommended configuration is to keep the TSO Enabled option enabled, but disable the Vector and Memcpy options.
While this will incur a performance hit compared to disabling TSO emulation, it is significantly more stable to keep TSO emulation on.
If you still need more performance, then it may be beneficial to turn off TSO emulation entirely. It's unstable though! It's incorrect emulation to gain speed!
Vector TSO enabled
This option enables emulating the memory model around vector loadstore instructions. This has a HUGE performance impact even on latest generation hardware.
Memcpy TSO enabled
This option enables emulating the memory model around x86's REP MOVS and REP STOS instructions. These are used for doing memory copying and memory setting respectively.
The impact of this option depends heavily on the application. This is because most memcpy and memset functions actually use vectors to modify the memory.
JIT core improvements
Once again this month there has been a focus on JIT optimizations. Although this time it might be hard to see what is improving. Overall in benchmarks
there has been roughly a 3% performance improvement. With a mixture of improvements this month being foundational work to lower JIT compile time
overhead in the coming months. As usual there is too much to dive in to each change individually so we'll just have a list.
- Optimize LOOP/N/E
- Negate more inline constants
- Optimize PF calculation using integers rather than vectors
- Optimize CLC
- Optimize cmpxchg
- A bunch of instructions cleaned up and rewritten to remove small amounts of overhead
- Improves 32-bit address mode accesses
- Implements support for prefetch and rdpid instructions
Optimize memcpy and memset IR operations when TSO emulation is disabled
Speaking of the previous optimization. We have now optimized the implementation of the memcpy and memset instructions to be significantly faster. Sometimes a compiler will inline these instructions which was causing upwards of 5% CPU time doing memory copies.
With this optimization in place we have benchmarked and improvement from 2-3GB/s up to 88GB/s! That'll teach that code to be slow.
Fix memory leaks in thread creation
A memory leak that has occured where FEX would leak some thread stacks when they shutdown. This has now been resolved which lowers memory usage for
long running applications that shutdown threads. In particular this makes Steam consume less RAM.
We have more memory leaks to solve as we move forward but they are significantly less severe than this.
A ton of small cleanups in the code
This month has had a lot of code cleanup in FEX but these aren't user facing so it isn't very interesting. Let it be known although that something
like half the commits this month were cleaning up various bits of code or restructuring which isn't getting a focus.
Raw Changes
FEX Release FEX-2404
-
Allocator
-
Cleanup StealMemoryRegions implementation (8a3d08e)
-
ConstProp
-
drop dead code (202a60b)
-
ELFParser
-
Stop using a VLA (32ec4a3)
-
Externals
-
Update Catch2 to v3.5.3 (b892da7)
-
FEXCore
-
Fixes priority of FEX_APP_CONFIG (7786c23)
-
Move nearly all IR definitions to internal (e2a0953)
-
Moves CodeLoader to frontend (3bed305)
-
Moves CPUBackend definition internal (f6639c3)
-
Remove DebugStore map (2ad170b)
-
Adds more TSO control levers (24fd28e)
-
Removes vestigial mman SMC checking (542f454)
-
Fallback to the memcpy slow path for overlaps within 32 bytes (167896d)
-
Add non-atomic Memcpy and Memset IR fast paths (7dcacfe)
-
FEXLoader
-
Add a way to sleep a process on startup (624bc3f)
-
Add some debug-only tracking for FEX owned FDs (79454ed)
-
InstcountCI
-
Adds a block that is causing panic spilling (150af80)
-
IoctlEmulation
-
Add missing nouveau ioctl (6d94d79)
-
JIT
-
Optimize pmovmaskb with a named vector constant (ab8ee64)
-
Linux
-
Expose support for v6.8 (e33a76a)
-
Threads
-
Fixes a stack memory leak for pthreads (3d31291)
-
OpcodeDispatcher
-
clean up shifts (c43af8e)
-
drop ZeroMultipleFlags (b632f72)
-
eliminate xblock liveness for rcl/rcr (cd9ffd2)
-
eliminate branch in cmpxchg pair (aa26b62)
-
Fixes 32-bit mode LOOP RCX register usage (7698347)
-
optimize LOOP/N/E (8852d94)
-
Implement support for the various prefetch instructions (2a9fcc6)
-
Implement rdpid (ba3029b)
-
RA
-
drop dead block interference code (f2d001e)
-
Removes VLA usage (67baff8)
-
Adds RIP when a block panic spills (a8b59c1)
-
RCLSE
-
Optimize store-after-store (ca6b2e4)
-
Telemetry
-
Allow redirecting directory that data is written to (970d5d5)
-
Adds tracker for non-canonical memory access crash (7f90ca5)
-
Rename old file instead of copying (002ca36)
-
Misc
-
Negate more to inline constants (aa8d04c)
-
Minor cleanups around flags (37f2b41)
-
Use scalar integer code to calculate PF (bd0b5ec)
-
Eliminate xblock liveness with rep cmp/lod/scas (e8abc88)
-
rewrite ROL/ROR (29c6281)
-
Fix reference to out of bounds address in offsetof (4214d9b)
-
optimize clc (b1ddd8cd3b53f0866f04668240...
FEX-2403
Read the blog post at FEX-Emu's Site!
Welcome back to another new tagged version of FEX-Emu! This month we have quite a few important bug fixes and optimizations, so let's get right in to
it!
Steam fix
As of Steam's February 27th update there was a fairly major change to how Steam starts its embedded Chromium instance.
With this most recent change it now is run inside of the Steam Linux Runtime environment. In turn Steam has disabled the sandbox feature of the
Chromium instance because it is incompatible. FEX was already disabling this sandbox and forcibly passing in the argument to disable it.
Chromium really didn't like the argument being passed in twice and it was causing it to crash early. We have now removed our application profile and
let Steam configure the arguments as required.
As a side effect of Steam updating their version of Chromium, some users have noted that they are experiencing problems with GPU acceleration on
Raspberry Pi systems. This is seemingly a video driver problem and unrelated to this crash that was fixed. It is currently unknown if we can fix this
problem, as it is working find on Tegra and Snapdragon systems.
Rootfs images updated
FEX's rootfs images have been updated to include the latest versions of Mesa, gfxreconstruct, and Renderdoc. The major change here is having Mesa
updated to 24.0 as the other two packages are mostly for developers.
Fix a potential hang on forking with memory allocations
We have fixed a known hang that occurs when a process is forking while another thread is allocating memory. This tended to occur as a hang when
running Proton applications. While this fixes one hang, we still have another one that sporatically happens that we haven't tracked down. While the
occurence is relatively rare, it's good to watch the process trees if the program is stuck waiting on a futex.
A bunch of CPU optimizations
As per usual, this last month has added a bunch of CPU optimizations! We have noted up to a 14% performance improvement in one benchmark and an
average of around 4% in Geekbench. We need to commend our developers for
hammering out these optimizations, even a small optimization can have big impacts on games that abuse a particular feature.
- Use FlagM SETF8/16 for INC/DEC
- Optimize LOCK DEC
- Optimize ADC/SBC
- Fuse add+cmn in to adds
- Misc other optimizations
- Optimize less than 32-bit add and sub
Small timestamp counter scaling
Recently we have found out that some games rely on an x86 CPU's RDTSC instruction operating at Ghz frequencies. While this is not a good idea to make
assumptions, it is relatively common that x86 CPUs have a really high frequency cycle counter frequency. Most laptop CPUs operating in the 1-2Ghz
range while desktop CPUs can go up to 3Ghz in our testing.
Unreal Engine 5 has a new work graph system that spin-loops CPUs for a fixed number of cycles, expecting to not spin for very long. While this is
relatively okay at 1Ghz, since it is only a few nanoseconds, When an ARM CPU's cycle counter gets added to the mix it starts encountering problems.
The primary problem here is that all 64-bit Snapdragon processors ship with a fixed rate 19.2Mhz cycle counter. This continues all the way to their
latest flagship the Snapdragon 8 Gen 3. Additionally other ARM devices we have tested like the Nvidia Jetson ARM boards and Apple M1 also ship a
similarly low cycle counter. So while Unreal engine will only spin-loop for a measily 1597 cycles, on an ARM board this takes ~51,000 nanoseconds but
on an x86 PC it only takes 591 nanoseconds! This was causing games to burn CPU time unnecessarily and run slower than they should!
To compensate for these slow cycle counters on most ARM devices, we are now scaling the value we return to the applications by multiplying the value
by 128 times! This makes snapdragon cycle counters behave more similarly to a 2457 Mhz cycle counter, but with a 128 cycle granularity. This improves
the FPS in Tekken 8 and will also improve performance in all other Unreal Engine 5 games. There may be other games affected as well!
As a side-note, a 1Ghz cycle counter is now mandated by ARMv8.6 and ARMv9.1 spec! So this problem will soon go away as new SoCs get on to the market.
Introduce ARM64EC static register allocation mapping
As part of the ongoing effort to support WINE's Arm64EC code, we have changed the order in which our registers get allocated to more closely match
what the Arm64EC ABI
wants for register layout. Matching what Arm64EC wants for the regster layout means that when the JIT jumps out to some code, we shuffle less data
around which gives a performance improvement. The Linux side of code doesn't need this, so this only happens when building as a WINE module.
32-bit thunking improvements
This last month has had an exciting milestone for 32-bit thunking! We have landed support for thunking Wayland on 32-bit. Which this means with our
previously implemented Vulkan thunking, we can now run some games using Wayland plus Vulkan and Zink thrown in to the mix! In particular we have been
able to test that Super Meat Boy works in this configuration! We still have more work to do before X11 and GLX works with 32-bit thunking so stay
tuned to the future!
Memory leak fix
FEX had an issue with long running processes leaking memory. This showed up in applications that would start hundreds of threads and tear them down
over and over. Steam is one of these long running processes that would starve the system of memory if left open over night. This is because the
program spins up helper threads fairly aggressively and then shuts them down.
We have fixed one major memory leak but we still have a few more to go before its nailed down!
Syscall passthrough optimization
One important thing to be wary of when running games is syscall overhead. Every time an ioctl or other syscall is made, FEX can incur significant
overhead compared to running the application natively. Additionally if we are passing syscalls through to glibc helpers then this can add more
overhead and sometimes introduce bad behaviour.
This month we spent some time looking at how syscalls are handled when we know that we can pass the data directly to the kernel. This allows us to
more quickly add new system calls when the kernel adds them, and ensures they are as fast as possible. With this optimization in-place FEX now
directly emits small syscall handlers per syscall and jumps directly to the kernel if possible. This lowers CPU overhead for the most common syscalls,
thus removing emulation overhead. While FEX's syscalls were already fairly low overhead, this just improves the situation further!
Raw Changes
FEX Release FEX-2403
-
ASM
-
Arm64
-
Stop moving source in atomic swap (98572b9)
-
Arm64Emitter
-
Introduce ARM64EC SRA mappings (6ec628f)
-
CMake
-
Define _M_ARM_64EC when building for ARM64EC (3e5694b)
-
FEXCore
-
Expose AbsoluteLoopTopAddress to the frontend (d4be2dc)
-
Add a frontend pointer to InternalThreadState (5769ffb)
-
FEXLoader
-
Allocate the second 4GB of virtual memory when executing 32-bit (0b34035)
-
FileFormatCheck
-
Fixes FD leak (0505b30)
-
InstCountCI
-
enable preserve_all (2f9449c)
-
add FMOD block (cc9c80d)
-
add The Witcher 3 block (60e8da0)
-
InstcountCI
-
Add a monster of a game block (f41674b)
-
Adds a vector addition loop from bytemark (44d1502)
-
Adds addressing limitations to instcountci (cf06799)
-
Linux
-
Converts passthrough syscalls to direct passthrough handlers (b74de53)
-
More safe stack cleanup for clone (9687ac5)
-
Make sure to destroy thread object when thread shuts down (a7c7fe4)
-
OpcodeDispatcher
-
Don't use AddShift with no shift (d24446e)
-
RedundantFlagCalculationElimination
-
fix missing NEG case (32a4abb)
-
Syscalls
-
Fix SourcecodeMap generation for GDB JIT integration (7e2f20c)
-
Misc
-
Removes steamwebhelper config (d66a83a)
-
Use SETF8/16 for 8/16-bit INC/D...
FEX-2402
Read the blog post at FEX-Emu's Site!
Welcome back everyone! After last month's cancelled release and this month being a bit late we have a lot of changes that happened.
More JIT performance improvements
A lot of the work these paste two months have been optimizing our JIT more. We have run Geekbench and Bytemark for these which showed a marginal
performance improvement in these benchmarks. Bytemark showing the biggest improvement of 16% in one sub-benchmark. A lot of the performance
improvements are targeting real-world applications rather than benchmarks which shows as those games getting more of an improvement.
As typical, explaining each individual optimization would take too long so we're going to spam out a bunch in a list.
- Removes a vtable indirection for syscalls
- Fix RCL/RCR wraparound behaviour
- Remove process-wide lock in JIT
- Fixes syscall rcx/r11 state
- Optimize SIB address calculation from three instructions to one
- Optimize TST instruction with -1
- Optimize TST more
- Improve XCHG instructions
- Optimize rotates
- Optimize CDQ
- Optimize shifts
- Optimize PTEST, VTESTP, PDEP
- Optimize SHA256 instructions to remove spilling
- Optimize CMPXCHG
- Stop zero extending a bunch of instructions where it doesn't matter
- Optimize ANDN
- Optimize a bunch of instructions using NZCV flags
Fix glibc clone usage of CLONE_CLEAR_SIGHAND
Newer glibc versions starting with 2.38 have started using this new clone flag for executing a program. We also fixed this in 2312.1 but can now make
a note of it.
Fix VDSO symbol fetching on ARM64
This is a fairly minor change but can have a big performance hit. When FEX was querying for VDSO interface functions we were using the wrong names on
ARM64. Since the wrong names were used, this meant we always fell back to the slower glibc implementation of functions. This in particular fixes a
performance hit when games call clock_gettime excessively.
Fix Proton again
Sometime in December there were some changes to Valve's Proton layer which caused us to break it. This has now been fixed.
Expose Linux 6.6
With some relatively minor changes we now support reporting kernel version 6.6 to the guest application. This gives us a range from v5.0 to v6.6 now.
Workaround hang when process is forking
A long-standing bug in FEX is that sometimes a process can hang when it is forking, usually to execute another program. We have now worked around this
issue to an extent that lets the application continue. It's not a full fix because we can still have a crash but that is easier to see instead of a
program hanging forever in the background.
Commonize some WOW64 code to share with ARM64ec
In preparation for sharing some code with FEX built for ARM64ec, this has shared move some Windows code to a common location to be used.
An absolute ton of work went in to thunking
Over the past couple months this has been one of the more active projects within FEX. Today FEX has support for thunking 64-bit x86 libraries across
to ARM64. A significant portion of this work is doing analysis of API interfaces in order to allow thunking 32-bit x86 libraries over to ARM64
libraries with data repacking. This isn't yet complete but since a ton of work has gone in to this, we wanted to call it out.
NOTE: Memory leak on long-running processes like Steam
We have found a memory leak when a process shuts down a thread that has been around for quite a while. We only identified this memory leak this last month which hasn't been fixed.
We are hoping to fix this bug for the next release but be aware that long running processes like Steam has a relatively aggressive memory leak. This is exacerbated by how Steam spins up threads for
doing work which makes this application particularly heavy.
Raw Changes
FEX Release FEX-2402
-
Arm64
-
Removes a vtable indirection in syscalls (743df8d)
-
BranchOps
-
Fix unused-variable warning (f515b1e)
-
ArmEmitter
-
Support single use forward labels (8c31630)
-
CPUID
-
Removes Init and just uses constructor (f5997a0)
-
Config
-
Fixes JSON parsing of "ArgumentHandler" types (bd1e029)
-
Dispatcher
-
Convert GetCompileBlockPtr to using PMF helper (82ce76b)
-
Removes unused asserting CompileBlock function (5b4e9c6)
-
Externals
-
Update xbyak to v7.02 and switch away from fork (9da08b4)
-
FEX
-
Removes legacy kernel 32-bit allocator (de2cd46)
-
FEXConfig
-
Initialize paths before trying to read configuration files (79526b9)
-
FEXCore
-
Fix RCL/RCR shift wraparound behaviour (c0be974)
-
Use TMP1-4 for values that need preserving across spills (0e97f8f)
-
Decompose some std::function usage to regular pointers (615cfe0)
-
Pass thread object to HandleUnalignedAccess (d488592)
-
Removes SRA option, it's now permanently enabled (5467c3e)
-
Removes context wide and map lookup (eea2e7b)
-
Optimize HostFeatures and CPUID feature calculation (0071c1b)
-
Warn if MDWE is set (00669a1)
-
Changes ParentThread ownership from the CTX to the frontend, take 2 (b4b8e81)
-
Describe exit function linking object with a structure (93ec676)
-
Removes stale references to x86 JIT (8665490)
-
Removes old InternalThreadState header (12b72f9)
-
Moves OS thread creation to the frontend (0a4e064)
-
Moves XID check to the frontend (7524029)
-
FEXLinuxTests
-
Fix build warnings (d34302a)
-
FEXLoader
-
Moves thread management to the frontend (5e26b77)
-
Temporarily disable CLONE_CLEAR_SIGHAND (26c9d5d)
-
Fix incorrect format strings (3a5ac39)
-
GdbServer
-
Fixes crash on gdb detach (a1cf14f)
-
HostFeatures
-
Supports runtime disabling of preserve_all (235f32c)
-
InstCountCI
-
Fixes test to not use relative data (3db31a6)
-
JIT
-
Fixes broken register in VTBX1 (9841983)
-
Jitarm64
-
Implements spin-loop futex for JIT blocks (750b0b7)
-
Linux
-
Decouple thread object creation and tracking (5e5984a)
-
Implements a fault safe memcpy routine (8e3d4a3)
-
Adjust when clone allocates stack memory (bd13052)
-
OpcodeDispatcher
-
Fixes syscall rcx/r11 generation (56d8080)
-
Initial support for runtime long-mode switch (4b37921)
-
Fixes flags generation in imul (3d2cbc5)
-
Optimize SIB addr calculation (81c85d7)
-
PassManager
-
Removes unused exit handler (12923ba)
-
Scripts
-
More changes to InstallFEX script (b613576)
-
Updates InstallFEX with supported Ubuntu versions (eb5cf1a)
-
SpinLockWait
-
Fixes unexpected lock success (472a701)
-
SpinWaitLock
-
Removes unused variable in spin-loop fallback (a4a1d60)
-
TestHarnessRunner
-
Move to its own tool folder (https://github.com/FEX-Emu/FEX/commit/...
FEX-2312
Read the blog post at FEX-Emu's Site!
We're back with another month of changes. After last month being a bit slower, we're back in the swing of implementing more optimizations and bug fixes. No dilly dallying, let's get right in to it!
More optimizations this month!
Once again this month has a whole bunch of optimizations that is very exciting! We will lightly go over the changes to talk about what changed.
Keep guest SF/ZF/CF/OF flags resident in host NZCV
This is one of the bigger optimizations this month. A bit of backstory is needed for what this optimization is for. x86 has a flags register called EFLAGS which contains quite a few random bits of information. The subflags we care about here are the SF, ZF, CF, and OF flags inside of it. These are various flags that are set typically from ALU operations for information depending on the result. So something like an integer Add will set ZF if the result was zero, SF if the result has the sign bit set, CF if a carry occured, and OF if the operation overflowed. These are usually quite cheap for the CPU to calculate by itself, but manually calculating the flags usually takes a few additional instructions each.
The original implementation inside of FEX for calculating these flags would spend the additional instructions and calculate each one manually. This
would usually end up with a dozen or so of additional instructions for calculating flags. While FEX would typically optimize out the calculations if
they weren't used, it would still add CPU time when we couldn't.
Luckily ARM also has a flags register called NZCV which maps almost perfectly to x86's EFLAGS. This lets us optimize these instruction implementations
to instead use the ARM flags directly. This has a couple of effects, not only does it remove the instructions from our code generation, it has
knock-on effects that the flags are now stored inside of the NZCV which reduces memory accesses. A multi-hit combo for improving performance.
While not all x86 instructions map their flags registers 1:1, this has a fairly significant performance uplift in most situations!
Dedicate registers for PF/AF
Related to the previous change, x86 has two flags registers stored inside of EFLAGS that doesn't have a direct equivalent on ARM CPUs. These two
flags are fairly uncommon but instructions will still generate them. These flags have the additional problem that they are fairly costly to calculate,
with one of them requiring a GPR population count instruction which ARM doesn't even support until new instruction extensions called CSSC. While in
most cases the result of these flags isn't used, the overhead of calculating them can add up a bit. This is why we are now dedicating two registers to
these flags to reduce their overhead as much as possible!
Misc optimizations
- Optimize BT/BTC/BTS/BTR
- Optimize shifts/rotates
- Optimize selects & branches & more nzcv goodies
- Optimize three sha instructions
- Make "not" not garbage
- Optimize memcpy and memset when direction is compile time constant
With all these optimizations in place this month we have a fairly significant performance uplift!
<-- Geekbench and bytemark graphs -->
While Geekbench is showing a fairly modest 17.6% performance uplift, bytemark is showing up to a 60% performance uplift! Over the course of
the last three months we have had benchmarks that have improved by over 100%! These improvements can be seen in games as well, with some CPU heavy
games have had their FPS improve by over 2x. In a lot of games tested they have changed from being CPU limited to GPU limited on our Lenovo X13s
laptops even! We are looking forward to when these companies release new laptops based on Snapdragon X
Elite in the middle of next year!
Various bug fixes
In addition to performance improvements, we have some bug fixes this month.
- Fixed corruption in the JIT
- Caused corruption with x87 heavy games
- Fixes integer multiply corrupting results
- Corrupted some register state, which was breaking the game Dungeon Defenders
Support extracting erofs images
One of the features that FEXRootFSFetcher was missing was the ability to extract erofs images once downloaded. This was because we didn't know that
erofs-utils provided an application for extracting these images without FUSE. Turns out the developers put an extractor inside of their fsck
application that we had completely missed! Now if a user wants to extract an x86 rootfs image for lower overhead, they can do this directly from our
FEXRootFSFetcher tool.
Preparation for improving gdbserver
GDBServer is a socket interface that GDB supports for remotely debugging applications. One of the harder things about working on FEX-Emu is that the
ability to debug an application is usually quite hard. GDBServer is a way to improve this situation so that GDB can remotely connect to a FEX process.
There's a bunch of work this month towards cleaning up this interface and getting it to work correctly. While it is still not quite usable for
debugging, we are working towards this so applications can actually be debugged!
Improvements to WOW64 compatibility for newer WINE
Newer versions of WINE has changed some behaviour around WOW64 support. So this month we have added support for some of this newer behaviour. Thanks
again to Bylaws for implementing this!
FEX rootfs image updates
This month we are updating our rootfs images to incorporate the latest Mesa 23.3.0 release that
occured a few days ago. We have updated our Ubuntu 22.04, 23.04, 23.10, ArchLinux, and Fedora 38 images with this latest version of mesa. As usual if
there are any issues, let us know so we can sort them out.
Raw Changes
FEX Release FEX-2312
-
Arm64Emitter
-
Dedicate registers for PF/AF (9b64674)
-
Fixes warning (bf147f4)
-
Config
-
Removes Threads option (3c73357)
-
Dispatcher
-
Fixes corruption when spilling SRA registers (f6b1434)
-
EmulatedFiles
-
Stop relying on O_TMPFILE (2e24f34)
-
FEX
-
Only pass CPU tunables to FEXCore and FEXLoader (b4eeb96)
-
FEXCore
-
Removes GetProgramStatus (c8ef77c)
-
Removes InitializeContext API (b35fadf)
-
Fixes passing arguments to ABI helpers (d0f54bc)
-
Removes Get/SetCPUState (7de66ac)
-
Optimize memcpy and memset when direction is compile time constant (85a1c1f)
-
Removes FEX_PACKED from CPUState (8e892ec)
-
Moves debug strings to gdbserver (3f02d7c)
-
Start changing how thread creation works (f328fca)
-
Moves more SignalDelegator functions to the frontend (6e8af29)
-
Removes x86 DebugInfo table (8015ce2)
-
Removes GetExitReason (bdf4089)
-
Disables RPRES until AFP is audited and enabled (b027113)
-
Fixes imul returning garbage data (389c6b1)
-
Work around broken preserve_all support in Windows clang (98f9a65)
-
FEXLoader
-
Wire up gdbserver in the frontend (efc5eb2)
-
FEXRootFSFetcher
-
Supports extracting erofs images (aa1344a)
-
GdbServer
-
Switch over to a unix domain socket (0357bb2)
-
IR
-
Moves remaining NZCV operations to use DestSize (b619f38)
-
IRDumper
-
Fixes missing conditional name (8892580)
-
InstCountCI
-
Moves Sonic Mania code to 32-bit file (470615b)
-
InstructionCountCI
-
Remove Optimal flags (4a31b61)
-
JIT
-
Fixes crash in TestNZ (a8ab8bb)
-
OpcodeDispatcher
-
Optimize three sha instructions (0e1e4c1)
-
Make "not" not garbage (3767f36)
-
ScopedSignalMask
-
Clean up API and use std::unique_lock/shared_lock (8726c8fb7377509bb569826f1...