Improve 32bit ld/st addressing mode propagation #3421

pmatos · 2024-02-12T14:23:04Z

Signed-off-by: Paulo Matos pmatos@igalia.com

pmatos · 2024-02-12T14:23:57Z

Still looking to see if I can improve other types of addressing modes, so this is still marked as draft.

pmatos · 2024-02-12T14:26:22Z

Was going to say there are no asm_tests failures so it looks like the current change is ok, although I am adding access in 32bits to a block that has been 64 bits only. However, I just noticed glibc tests failed so I will look at those.

FEXCore/Source/Interface/IR/Passes/ConstProp.cpp

pmatos · 2024-02-20T17:18:57Z

This is still failing but I am gaining an understanding of the complexities of this optimization. It's currently failing.

My last optimization was a result of fixing this:
lea eax, [0x10000040]
mov eax, [eax-0x40]

which was generating a LoadMem between a register (repr eax) and the two complement of 0x40, but failing to sign extend it.

The current issue is in :

add     dword [eax], edi

Here we are loading eax, adding eax and edi and then storing to eax. But when we create the LoadMem, and simplify it we are sign extending the incorrect base. Unsure if there's a sure-fire way to know which of the registers is the base and which is the offset, or if it makes sense to ask this question even. WIP

Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of small address wrap around, allocate the first 4GB in the 64-bit space. Some context. Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do). Example of 32-bit code: lea eax, [0xffff_0000] mov ebx, [eax + 0x1_0000] The address calculated by the mov will wrap around to 0x0 which will result in SIGSEGV. If FEX messes up zero extensions then it would try to access 0x1_0000_0000 instead. This could result in a 32-bit application potentially accessing some FEX memory instead of crashing. Add this safety net which will still SIGSEGV and we will be able to see the crash.

Spurred on by FEX-Emu#3421 This does a bunch of GPR and vector loads to showcase addressing limitations between ARM and x86. Tests: - 8/16/32/64-bit GPR loads - Both 32-bit and 64-bit addressing modes - 32/64/128-bit Vector loads - Both 32-bit and 64-bit addressing modes - Duplicate the tests for 32-bit addressing mode with a 32-bit process - Since it should change behaviour. Untested: - 8/16-bit vector loadstores since those don't exist on x86 - 16-bit x87 integer load exists but that doesn't go through a vector load in FEX.

…2-bit Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of small address wrap around, allocate the second 4GB of virtual memory. Some context. Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do). Example of 32-bit code: lea eax, [0xffff_0000] mov ebx, [eax + 0x1_0000] The address calculated by the mov will wrap around to 0x0 which will result in SIGSEGV. If FEX messes up zero extensions then it would try to access 0x1_0000_0000 instead. This could result in a 32-bit application potentially accessing some FEX memory instead of crashing. Add this safety net which will still SIGSEGV and we will be able to see the crash.

Sonicadvance1 · 2024-02-24T09:27:30Z

This'll need a rebase so it can pick up the new instcountci changes before it can get merged

pmatos · 2024-02-25T10:57:26Z

This'll need a rebase so it can pick up the new instcountci changes before it can get merged

done - thx.

alyssarosenzweig · 2024-02-25T14:03:28Z

IsInlineConstant is your new friend

pmatos · 2024-02-25T18:26:57Z

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

pmatos · 2024-02-25T18:53:45Z

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

alyssarosenzweig · 2024-02-25T19:11:30Z

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

Oops, my bad! I meant IsValueConstant .. sorry for the mixup

pmatos · 2024-02-25T19:22:59Z

IsInlineConstant is your new friend

That's awesome - thanks. Friends are never enough. :)

Not as straightforward as calling the method. It's in ARM64JITCore which we don't have access to in ConstProp.cpp. It seems to me like this is more of an IR helper function anyway, and indeed it only depends on IR so we probably could put it elsewhere. I will see if I can refactor it.

Oops, my bad! I meant IsValueConstant .. sorry for the mixup

Ah, yes, I was wondering the different between a constant and an inline constant. :)

When the source arguments for LoadMem/StoreMem have bit 31 set then they are incorrectly sign extending in some instances. Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.

pmatos · 2024-02-27T16:39:03Z

I just made some sort of mistake pushing the last patch - sorry. Fixing.

pmatos · 2024-02-27T16:59:14Z

Thank you git reflog show! :)
Anyway, I have now fixed the issue from #3460.
I have assumed the displacement can be sign extended and that it's always in Arg1. I don't think it's crazy to assume this, since when the LoadSource create the Add for the memory address it always put the displacement in Arg1.

…u#3421 Doesn't quite match the libc code directly because it uses `[gs:eax]` with both having the sign bit set and we can't deal with that with ASM tests. So match the behaviour in a different way.

pmatos · 2024-02-28T13:42:12Z

Rebased new patch on main - which should now include the #3466 test.

This time found in MGRR. It flips the problem space on its head.

Spurred on by FEX-Emu#3421 This does a bunch of GPR and vector loads to showcase addressing limitations between ARM and x86. Tests: - 8/16/32/64-bit GPR loads - Both 32-bit and 64-bit addressing modes - 32/64/128-bit Vector loads - Both 32-bit and 64-bit addressing modes - Duplicate the tests for 32-bit addressing mode with a 32-bit process - Since it should change behaviour. Untested: - 8/16-bit vector loadstores since those don't exist on x86 - 16-bit x87 integer load exists but that doesn't go through a vector load in FEX.

…2-bit Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of small address wrap around, allocate the second 4GB of virtual memory. Some context. Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do). Example of 32-bit code: lea eax, [0xffff_0000] mov ebx, [eax + 0x1_0000] The address calculated by the mov will wrap around to 0x0 which will result in SIGSEGV. If FEX messes up zero extensions then it would try to access 0x1_0000_0000 instead. This could result in a 32-bit application potentially accessing some FEX memory instead of crashing. Add this safety net which will still SIGSEGV and we will be able to see the crash.

When the source arguments for LoadMem/StoreMem have bit 31 set then they are incorrectly sign extending in some instances. Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.

Spurred on by FEX-Emu#3421 This does a bunch of GPR and vector loads to showcase addressing limitations between ARM and x86. Tests: - 8/16/32/64-bit GPR loads - Both 32-bit and 64-bit addressing modes - 32/64/128-bit Vector loads - Both 32-bit and 64-bit addressing modes - Duplicate the tests for 32-bit addressing mode with a 32-bit process - Since it should change behaviour. Untested: - 8/16-bit vector loadstores since those don't exist on x86 - 16-bit x87 integer load exists but that doesn't go through a vector load in FEX.

…2-bit Spurred on by FEX-Emu#3421. To ensure that applications don't take advantage of small address wrap around, allocate the second 4GB of virtual memory. Some context. Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do). Example of 32-bit code: lea eax, [0xffff_0000] mov ebx, [eax + 0x1_0000] The address calculated by the mov will wrap around to 0x0 which will result in SIGSEGV. If FEX messes up zero extensions then it would try to access 0x1_0000_0000 instead. This could result in a 32-bit application potentially accessing some FEX memory instead of crashing. Add this safety net which will still SIGSEGV and we will be able to see the crash.

When the source arguments for LoadMem/StoreMem have bit 31 set then they are incorrectly sign extending in some instances. Detected this when testing FEX-Emu#3421 but I don't have a proper fix for it.

…u#3421 Doesn't quite match the libc code directly because it uses `[gs:eax]` with both having the sign bit set and we can't deal with that with ASM tests. So match the behaviour in a different way.

ASM: Another sign extend bug in #3421

pmatos · 2024-03-05T10:58:48Z

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.

In other words, you can't generally simplify:

add wX, wX, const
ld/st wX, [xX]

into ld/st wX, [xX, const].

For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like:
https://lo.calho.st/posts/black-magic-buffer/

However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11.
I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:

https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68

What do you think about this?
I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?

alyssarosenzweig · 2024-03-05T11:58:32Z

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.

In other words, you can't generally simplify:
add wX, wX, const
ld/st wX, [xX]
into ld/st wX, [xX, const].

For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like: https://lo.calho.st/posts/black-magic-buffer/

However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11. I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:

https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68

What do you think about this? I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?

This is exactly what #3439 is for. In particular the remark:

Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do).

That means that (ignoring sign-extension at least), if the constant is < 16384, the fold is legal. Since either:

There is no overflow, equivalently there is no 32-bit wrap around.
There is 32-bit overflow. Since the address is 32-bit and the constant < 16384, wraps around to something in the first 16KB. This is an illegal memory access and should crash. With 64-bit, this overflows to an address in the first 16KB after the 4GB marker, which FEX has specifically reserved to ensure it will crash in the same way.

Folds reg+const memory address into addressing mode, if the constant is within 16Kb. Update instcountci files. Add test 32Bit_ASM/FEX_bugs/SubAddrBug.asm

pmatos · 2024-03-05T14:05:38Z

@Sonicadvance1 The bug you found in #3471 did feel like a nail in the coffin of this optimization. Problem being that anything that wraps around in 32bits cannot be faithfully represented in a single 64bit ld/st instructions. After a couple of iterations, I thought just reg+reg was problematic, but after your last bug, even reg+const is problematic. If the value in reg is high enough we wrap around and cannot fold the addition into the memory addressing.
In other words, you can't generally simplify:
add wX, wX, const
ld/st wX, [xX]
into ld/st wX, [xX, const].
For the past few days I have been playing with possible solutions including something like mapping the first 4G of pages and the second 4G of pages to the same physical pages. Something like: https://lo.calho.st/posts/black-magic-buffer/
However, I think I found something that's easier to integrate into FEX. The use of userfaultfd available at least since kernel 4.11. I have written a prototype that allocated a structure in the first 4G at and then tried to access it +0x1_0000_0000. The region 0x1_0000_0000 - 0x1_FFFF_FFFF is under userfaultfd, so access to that region triggers a handler that copies the pages from the first 4G to the second 4G and proceeds as if nothing happened. See:
https://gist.github.com/pmatos/677b1e3a3390c00f5c046fbc9bca7f68
What do you think about this? I will try to integrate a mechanism like this locally in FEX and see how it does, but wanted to understand how you feel about having something like this in FEX to enable these kind of optimizations?
This is exactly what #3439 is for. In particular the remark:

Linux always reserves the first 16KB of virtual address space (unless you tinker with some settings which nobody should do).

That means that (ignoring sign-extension at least), if the constant is < 16384, the fold is legal. Since either:

There is no overflow, equivalently there is no 32-bit wrap around.

There is 32-bit overflow. Since the address is 32-bit and the constant < 16384, wraps around to something in the first 16KB. This is an illegal memory access and should crash. With 64-bit, this overflows to an address in the first 16KB after the 4GB marker, which FEX has specifically reserved to ensure it will crash in the same way.

After that explanation the 16Kb comments makes even more sense. Which means that ok - without any other handling - we can indeed deal with 16Kb positive and negative offsets. I have pushed a patch to that effect. I am both upset and happy that I didn't see this coming. On the one hand, I spent a ton of time looking into userfaultfd which was not strictly necessary for this. On the other hand, looking into userfaultfd did show me a couple of cool tricks that could open the door for further optimization.

Sonicadvance1 · 2024-03-05T15:25:37Z

Fantastic research in to finding out the limits of 32-bit addressing. I knew it would have problems but I wasn't sure where the limits were at. Known unknowns and whatnot.

userfaultfd doesn't necessarily work because it will do copying and accesses that split the ranges will get decoupled. Additionally device mapped memory will have problems.

Good to know these problems now and this optimization must be limited to constants <16KB

pmatos · 2024-03-11T13:06:52Z

@Sonicadvance1 ping - what do you think about merging this now?

Sonicadvance1

Nice! Gave this some testing and I don't see any regressions from the more limited scope of this now.
Thanks for pushing this through and finding the issues.

pmatos changed the title ~~AddressingModes32~~ Improve 32bit ld/st addressing mode propagation Feb 12, 2024

Sonicadvance1 reviewed Feb 13, 2024

View reviewed changes

FEXCore/Source/Interface/IR/Passes/ConstProp.cpp Outdated Show resolved Hide resolved

Sonicadvance1 reviewed Feb 14, 2024

View reviewed changes

FEXCore/Source/Interface/IR/Passes/ConstProp.cpp Outdated Show resolved Hide resolved

pmatos force-pushed the AddressingModes32 branch from 151adb7 to ee23b55 Compare February 20, 2024 17:10

pmatos force-pushed the AddressingModes32 branch from 3ae67d0 to 645da81 Compare February 21, 2024 18:05

pmatos marked this pull request as ready for review February 21, 2024 18:05

pmatos requested a review from Sonicadvance1 February 21, 2024 18:05

Sonicadvance1 mentioned this pull request Feb 21, 2024

FEXLoader: Allocate the second 4GB of virtual memory when executing 32-bit #3439

Merged

Sonicadvance1 mentioned this pull request Feb 22, 2024

InstcountCI: Adds addressing limitations to instcountci #3440

Merged

pmatos force-pushed the AddressingModes32 branch from 645da81 to b120df1 Compare February 25, 2024 10:56

pmatos force-pushed the AddressingModes32 branch from b120df1 to 5c476f3 Compare February 25, 2024 19:05

pmatos force-pushed the AddressingModes32 branch from 5c476f3 to 019555a Compare February 25, 2024 19:26

Sonicadvance1 mentioned this pull request Feb 26, 2024

Adds a unittest for a bug from #3421 #3460

Merged

pmatos force-pushed the AddressingModes32 branch from c5e3122 to 5f27180 Compare February 27, 2024 16:29

pmatos force-pushed the AddressingModes32 branch from 5f27180 to 8beeef3 Compare February 27, 2024 16:47

Sonicadvance1 mentioned this pull request Feb 28, 2024

OpcodeDispatcher: Don't use AddShift with no shift #3466

Merged

pmatos force-pushed the AddressingModes32 branch 2 times, most recently from b7f7aa9 to f4b3870 Compare February 28, 2024 13:41

Sonicadvance1 added a commit to Sonicadvance1/FEX that referenced this pull request Feb 29, 2024

ASM: Another sign extend bug in FEX-Emu#3421

b3489d7

This time found in MGRR. It flips the problem space on its head.

alyssarosenzweig added a commit that referenced this pull request Mar 1, 2024

Merge pull request #3471 from Sonicadvance1/another_3421_bug

b7984e8

ASM: Another sign extend bug in #3421

pmatos force-pushed the AddressingModes32 branch 3 times, most recently from ecfd88f to 9c16415 Compare March 5, 2024 13:55

Improve 32bit constant usage in memory addressing

a86f2d3

Folds reg+const memory address into addressing mode, if the constant is within 16Kb. Update instcountci files. Add test 32Bit_ASM/FEX_bugs/SubAddrBug.asm

pmatos force-pushed the AddressingModes32 branch from 9c16415 to a86f2d3 Compare March 5, 2024 14:01

Sonicadvance1 approved these changes Mar 11, 2024

View reviewed changes

Sonicadvance1 merged commit ff0c763 into FEX-Emu:main Mar 11, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve 32bit ld/st addressing mode propagation #3421

Improve 32bit ld/st addressing mode propagation #3421

pmatos commented Feb 12, 2024

pmatos commented Feb 12, 2024

pmatos commented Feb 12, 2024

pmatos commented Feb 20, 2024

Sonicadvance1 commented Feb 24, 2024

pmatos commented Feb 25, 2024

alyssarosenzweig commented Feb 25, 2024

pmatos commented Feb 25, 2024

pmatos commented Feb 25, 2024

alyssarosenzweig commented Feb 25, 2024 •

edited

Loading

pmatos commented Feb 25, 2024

pmatos commented Feb 27, 2024

pmatos commented Feb 27, 2024

pmatos commented Feb 28, 2024

pmatos commented Mar 5, 2024

alyssarosenzweig commented Mar 5, 2024

pmatos commented Mar 5, 2024

Sonicadvance1 commented Mar 5, 2024

pmatos commented Mar 11, 2024

Sonicadvance1 left a comment

Improve 32bit ld/st addressing mode propagation #3421

Improve 32bit ld/st addressing mode propagation #3421

Conversation

pmatos commented Feb 12, 2024

pmatos commented Feb 12, 2024

pmatos commented Feb 12, 2024

pmatos commented Feb 20, 2024

Sonicadvance1 commented Feb 24, 2024

pmatos commented Feb 25, 2024

alyssarosenzweig commented Feb 25, 2024

pmatos commented Feb 25, 2024

pmatos commented Feb 25, 2024

alyssarosenzweig commented Feb 25, 2024 • edited Loading

pmatos commented Feb 25, 2024

pmatos commented Feb 27, 2024

pmatos commented Feb 27, 2024

pmatos commented Feb 28, 2024

pmatos commented Mar 5, 2024

alyssarosenzweig commented Mar 5, 2024

pmatos commented Mar 5, 2024

Sonicadvance1 commented Mar 5, 2024

pmatos commented Mar 11, 2024

Sonicadvance1 left a comment

Choose a reason for hiding this comment

alyssarosenzweig commented Feb 25, 2024 •

edited

Loading