Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the Cranelift IR instruction LoadSplat #2278

Merged
merged 3 commits into from
Oct 28, 2020

Conversation

akirilov-arm
Copy link
Contributor

It corresponds to WebAssembly's load*_splat operations, which were previously represented as a combination of Load and Splat instructions. However, there are architectures such as Armv8-A that have a single machine instruction equivalent to the Wasm operations. In order to generate it, it is necessary to merge the Load and the Splat in the backend, which is not possible because the load may have side effects. The new IR instruction works around this limitation.

The AArch64 backend leverages the new instruction to improve code generation. I am not really qualified to implement the optimization for x86, so I have added a temporary work-around to avoid any test breakage. It's a bit of a hack, but it should be incorrect only when cross-compiling to x86 on an Arm platform.

Fixes #1175 (but other backends need work too).

cc @abrown @cfallin

Copy link
Contributor

@abrown abrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (though I am not as familiar with the aarch64 side of things); I'll make a note to remove the if cfg! in the code translator and lower things in x64 instead.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing work on this on AArch64!

The aarch64-specific lowering logic in cranelift-wasm feels wrong to me, though -- it's too much of a leaky abstraction. E.g., it means that we have some IR that we can only codegen on one architecture; and it means that the lowering result depends on where we've compiled the crate.

So, if we add the CLIF opcode, we should really handle it in all backends. Since the old x86 backend is still alive, unfortunately that means x86-old, x64-new, and aarch64. The first of these can probably use a legalization rule that decomposes back to load and splat ops; and you've done the third; so then we need a lowering for the x64 backend. @abrown, would you be willing to help with this (and then we can land it all at once)?

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:meta Everything related to the meta-language. cranelift:wasm labels Oct 7, 2020
@abrown
Copy link
Contributor

abrown commented Oct 7, 2020

@abrown, would you be willing to help with this (and then we can land it all at once)?

Yes, but I might not get around to it until next week? If the delay is not a problem, I don't mind doing that so that the issue is resolved all at once. (I think I can push to this PR, right?)

@cfallin
Copy link
Member

cfallin commented Oct 7, 2020

That sounds fine to me at least!

@akirilov-arm
Copy link
Contributor Author

I don't mind a one week delay either.

@akirilov-arm
Copy link
Contributor Author

akirilov-arm commented Oct 9, 2020

I added the necessary legalization rule to the old x86 backend. I also removed the temporary work-around in cranelift-wasm, so the lowering logic is no longer target-specific.

@abrown
Copy link
Contributor

abrown commented Oct 9, 2020

@akirilov-arm, nice! So I'm tracking the only thing I need to do next week is lower LoadSplat in the x64 backend, right? (Unless you get to it first...).

@akirilov-arm
Copy link
Contributor Author

@abrown Yes, that's the only thing remaining, and no, I don't intend to do any other x86 changes (unless I am addressing review feedback, of course).

However, I have noticed that the peephole optimizer tests are failing and according to the logs the reason seems unrelated to my changes...

@abrown
Copy link
Contributor

abrown commented Oct 13, 2020

@akirilov-arm, @cfallin: I committed an implementation of LoadSplat on x64--please take a look. Unfortunately I had to thread SourceLocs through more Inst formats in c9c8de6 and I feel there may a better way to do this so I opened #2290.

@github-actions github-actions bot added the cranelift:area:x64 Issues related to x64 codegen label Oct 13, 2020
@abrown
Copy link
Contributor

abrown commented Oct 13, 2020

Looking at the failures in the CI, I think something is behaving weirdly re: Git. The lines it claims need extra parameters should have those added by c9c8de6 and ci/run-experimental-x64-ci.sh. Perhaps we should rebase on main and re-push?

@akirilov-arm
Copy link
Contributor Author

akirilov-arm commented Oct 13, 2020

@abrown I have just rebased, but the tests are still failing. Your changes look fine to me otherwise, but I am not particularly familiar with the x64 backend.

It corresponds to WebAssembly's `load*_splat` operations, which
were previously represented as a combination of `Load` and `Splat`
instructions. However, there are architectures such as Armv8-A
that have a single machine instruction equivalent to the Wasm
operations. In order to generate it, it is necessary to merge the
`Load` and the `Splat` in the backend, which is not possible
because the load may have side effects. The new IR instruction
works around this limitation.

The AArch64 backend leverages the new instruction to improve code
generation.

Copyright (c) 2020, Arm Limited.
@akirilov-arm
Copy link
Contributor Author

I am not sure why, but I have moved the new IR instruction definition to the end of the function defining all the operations in cranelift/codegen/meta/src/shared/instructions.rs and now the peephole optimizer tests are no longer failing.

cc @fitzgen

@akirilov-arm
Copy link
Contributor Author

@abrown It looks like you missed updating a couple of Inst::xmm_rm_r and Inst::xmm_rm_r_imm call sites, probably because my branch was outdated when you started working on it. I just rebased, so would you be able to fix it?

@julian-seward1
Copy link
Contributor

I apologise for getting to this so late in the day.

I would like to request that this be reconsidered. I believe the approach discussed so far works against the design goals of the new backend framework, and that an alternative approach would be preferable.

A primary goal of CLIF, in the new backend framework, is to provide a way to conveniently do redundancy elimination. One key aspect is to provide only a single way to represent any computation. This change runs counter to those goals. For example, consider GVN: with this change in place, there is no way to GVN a single LoadSplat CLIF with an equivalent two-CLIF load and splat sequence.

Of course I want this to become a single instruction on AArch64. The instruction selection framework in the new backend has been designed from the start to allow matching multiple CLIF nodes into a single machine instruction, and so we should use that here, and not add the proposed LoadSplat CLIF node.

@akirilov-arm writes:

In order to generate it, it is necessary to merge the Load and the Splat in the backend, which is not possible because the load may have side effects.

I don't follow .. can you clarify this? What instruction are you referring to?

@akirilov-arm
Copy link
Contributor Author

akirilov-arm commented Oct 14, 2020

What instruction are you referring to?

Opcode::Load and Opcode::Splat. I actually tried what you are talking about and it didn't work - refer to the discussion in #1175.

@julian-seward1
Copy link
Contributor

julian-seward1 commented Oct 14, 2020

Ah, I see. Yes, it's true; the new framework currently doesn't allow what you tried -- it only allows a load to be at the root of a pattern tree. Whereas here, the splat would be at the root.

Having said that, @cfallin was talking recently about lifting that restriction in the framework. This isn't of much significance in the AArch64 case (apart from here, obviously) since AArch64 has very few load-op instructions. But for x64, almost all instructions have a load-op form, and so it's important that the framework can support that idiom effectively. Hence it might be the case that this limitation gets lifted sooner rather than later.

In order to register traps for `load_splat`, several instruction formats need knowledge of `SourceLoc`s; however, since the x64 backend does not correctly and completely register traps for `RegMem::Mem` variants I opened bytecodealliance#2290 to discuss and resolve this issue. In the meantime, the current behavior (i.e. remaining largely unaware of `SourceLoc`s) is retained.
@abrown
Copy link
Contributor

abrown commented Oct 14, 2020

@akirilov-arm: good call, now d990dd4 should have source locations in more places. @julian-seward1: regardless of what we decide to do with this PR, you should also take a look at #2290--we need a good way to trap at all of the places an x64 instruction could load (which could be quite a few) so I suggest a possible alternative in that issue.

@akirilov-arm
Copy link
Contributor Author

Thanks, @abrown!

@cfallin @julian-seward1 How are we going to proceed with this? The implementation is now complete (covering all three major backends) and passes all tests. Based on the discussion in #1175, I believed that this would be an acceptable approach, but even if it is not, most of my code is actually independent of whether there is a new CLIF operation or not, and I think that's the case for @abrown's code as well. Are there plans to change the new framework soon, so that it supports the type of pattern matching that would be applicable to this case?

@cfallin
Copy link
Member

cfallin commented Oct 15, 2020

@akirilov-arm @julian-seward1 @abrown so, I'm off for this week and next (but I'll try to keep up with this when I can). At a high level, I am fine with the pragmatic approach here (and in #1175) of creating these merged opcodes for now, and then eventually removing them when we have the story for load-related code motion sorted out.

In other words, I don't necessarily think that we are so close to having the "right thing" built that we should hold back true codegen improvements in the meantime and create a dependence on a to-be-built part of the framework; I think the cost of removing these opcodes later is minimal, and can be done as part of a PR that introduces the proper pattern-matching. It's not the ideal end-state, but it lets us move forward in the meantime.

@julian-seward1, what do you think?

@cfallin
Copy link
Member

cfallin commented Oct 28, 2020

So, @julian-seward1 and I discussed this last week; apologies for taking so long to get back to it (I'm back to work again, albeit somewhat busy with onboarding stuff.)

In general I agree with @julian-seward1 (and it seems others would be on board with) a better approach eventually that can actually handle pattern-matching load sources to instructions. This has been on my to-do list for a while; but in theory is possible if we loosen the load-motion conditions a bit.

However, I think that having the better codegen as a starting point is valuable; if we take this as-is for now, then we have a baseline whose codegen output we should match when we do have the proper pattern-matching in place.

Thus, I'm happy to take this, and I'll commit to removing the LoadSplat as part of my pattern-matching fixup later, as long as @julian-seward1 doesn't strongly object to this transitional state. Thoughts?

@akirilov-arm
Copy link
Contributor Author

@cfallin Sounds good to me. It's always better to have a good test case before adding non-trivial functionality and, if I understand correctly, previously there were no actual use cases for the enhanced pattern matching.

@julian-seward1
Copy link
Contributor

Well, I can't say I'm enthusiastic; I think it moves the design in the wrong direction. But if it's only temporary, and since it's useful to land the rest of it anyway, I won't stand in the way.

I should add .. Today I implemented the proposed wasm instructions v128.load32_zero and v128.load64_zero, although the patch isn't on any PR yet. These are almost identical to the load-splat insns, except that all lanes other than lane zero are zeroed out, not made into clones of it. I used a two-CLIF-insn scheme -- a scalar load, and a move-scalar-to-lowest-lane-and-zero-others. Once the isel framework can do load-op merging, I'll add a matching rule so as to only generate a single insn.

So .. if/when that lands, it'll create further CLIF-level inconsistency, at least in the short term :-/

@cfallin
Copy link
Member

cfallin commented Oct 28, 2020

@julian-seward1 understood, and yes, not an ideal situation; but I think this should be a useful test case in the sense that if a new matching framework can handle this case (allow LoadSplat to be removed while retaining identical output), it will have succeeded. I definitely want to get to that point as soon as possible :-)

I'll go ahead and approve and merge now; thanks for the discussion, all!

@cfallin cfallin merged commit c35904a into bytecodealliance:main Oct 28, 2020
abrown added a commit to abrown/wasmtime that referenced this pull request Oct 28, 2020
The changes in bytecodealliance#2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.
abrown added a commit to abrown/wasmtime that referenced this pull request Oct 28, 2020
The changes in bytecodealliance#2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.
jlb6740 added a commit to jlb6740/wasmtime that referenced this pull request Oct 28, 2020
abrown added a commit that referenced this pull request Oct 28, 2020
The changes in #2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.
@akirilov-arm akirilov-arm deleted the load_splat branch October 28, 2020 22:21
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 9, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 10, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 11, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 16, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin pushed a commit that referenced this pull request Nov 30, 2020
The changes in #2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.
cfallin added a commit that referenced this pull request Nov 30, 2020
This was added as an incremental step to improve AArch64 code quality in
PR #2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR #2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace temporary lowering of Wasm's load_splat with a new Cranelift instruction
4 participants