Introduce the Cranelift IR instruction `LoadSplat` #2278

akirilov-arm · 2020-10-07T19:26:49Z

It corresponds to WebAssembly's load*_splat operations, which were previously represented as a combination of Load and Splat instructions. However, there are architectures such as Armv8-A that have a single machine instruction equivalent to the Wasm operations. In order to generate it, it is necessary to merge the Load and the Splat in the backend, which is not possible because the load may have side effects. The new IR instruction works around this limitation.

The AArch64 backend leverages the new instruction to improve code generation. I am not really qualified to implement the optimization for x86, so I have added a temporary work-around to avoid any test breakage. It's a bit of a hack, but it should be incorrect only when cross-compiling to x86 on an Arm platform.

Fixes #1175 (but other backends need work too).

cc @abrown @cfallin

abrown

Looks good to me (though I am not as familiar with the aarch64 side of things); I'll make a note to remove the if cfg! in the code translator and lower things in x64 instead.

cfallin

Thanks for doing work on this on AArch64!

The aarch64-specific lowering logic in cranelift-wasm feels wrong to me, though -- it's too much of a leaky abstraction. E.g., it means that we have some IR that we can only codegen on one architecture; and it means that the lowering result depends on where we've compiled the crate.

So, if we add the CLIF opcode, we should really handle it in all backends. Since the old x86 backend is still alive, unfortunately that means x86-old, x64-new, and aarch64. The first of these can probably use a legalization rule that decomposes back to load and splat ops; and you've done the third; so then we need a lowering for the x64 backend. @abrown, would you be willing to help with this (and then we can land it all at once)?

abrown · 2020-10-07T20:19:43Z

@abrown, would you be willing to help with this (and then we can land it all at once)?

Yes, but I might not get around to it until next week? If the delay is not a problem, I don't mind doing that so that the issue is resolved all at once. (I think I can push to this PR, right?)

cfallin · 2020-10-07T21:30:49Z

That sounds fine to me at least!

akirilov-arm · 2020-10-08T15:30:06Z

I don't mind a one week delay either.

akirilov-arm · 2020-10-09T13:15:06Z

I added the necessary legalization rule to the old x86 backend. I also removed the temporary work-around in cranelift-wasm, so the lowering logic is no longer target-specific.

abrown · 2020-10-09T16:37:21Z

@akirilov-arm, nice! So I'm tracking the only thing I need to do next week is lower LoadSplat in the x64 backend, right? (Unless you get to it first...).

akirilov-arm · 2020-10-09T17:00:01Z

@abrown Yes, that's the only thing remaining, and no, I don't intend to do any other x86 changes (unless I am addressing review feedback, of course).

However, I have noticed that the peephole optimizer tests are failing and according to the logs the reason seems unrelated to my changes...

abrown · 2020-10-13T17:14:23Z

@akirilov-arm, @cfallin: I committed an implementation of LoadSplat on x64--please take a look. Unfortunately I had to thread SourceLocs through more Inst formats in c9c8de6 and I feel there may a better way to do this so I opened #2290.

abrown · 2020-10-13T17:35:49Z

Looking at the failures in the CI, I think something is behaving weirdly re: Git. The lines it claims need extra parameters should have those added by c9c8de6 and ci/run-experimental-x64-ci.sh. Perhaps we should rebase on main and re-push?

akirilov-arm · 2020-10-13T18:04:43Z

@abrown I have just rebased, but the tests are still failing. Your changes look fine to me otherwise, but I am not particularly familiar with the x64 backend.

It corresponds to WebAssembly's `load*_splat` operations, which were previously represented as a combination of `Load` and `Splat` instructions. However, there are architectures such as Armv8-A that have a single machine instruction equivalent to the Wasm operations. In order to generate it, it is necessary to merge the `Load` and the `Splat` in the backend, which is not possible because the load may have side effects. The new IR instruction works around this limitation. The AArch64 backend leverages the new instruction to improve code generation. Copyright (c) 2020, Arm Limited.

akirilov-arm · 2020-10-14T12:45:13Z

I am not sure why, but I have moved the new IR instruction definition to the end of the function defining all the operations in cranelift/codegen/meta/src/shared/instructions.rs and now the peephole optimizer tests are no longer failing.

cc @fitzgen

akirilov-arm · 2020-10-14T13:08:52Z

@abrown It looks like you missed updating a couple of Inst::xmm_rm_r and Inst::xmm_rm_r_imm call sites, probably because my branch was outdated when you started working on it. I just rebased, so would you be able to fix it?

julian-seward1 · 2020-10-14T15:42:00Z

I apologise for getting to this so late in the day.

I would like to request that this be reconsidered. I believe the approach discussed so far works against the design goals of the new backend framework, and that an alternative approach would be preferable.

A primary goal of CLIF, in the new backend framework, is to provide a way to conveniently do redundancy elimination. One key aspect is to provide only a single way to represent any computation. This change runs counter to those goals. For example, consider GVN: with this change in place, there is no way to GVN a single LoadSplat CLIF with an equivalent two-CLIF load and splat sequence.

Of course I want this to become a single instruction on AArch64. The instruction selection framework in the new backend has been designed from the start to allow matching multiple CLIF nodes into a single machine instruction, and so we should use that here, and not add the proposed LoadSplat CLIF node.

@akirilov-arm writes:

In order to generate it, it is necessary to merge the Load and the Splat in the backend, which is not possible because the load may have side effects.

I don't follow .. can you clarify this? What instruction are you referring to?

akirilov-arm · 2020-10-14T15:55:00Z

What instruction are you referring to?

Opcode::Load and Opcode::Splat. I actually tried what you are talking about and it didn't work - refer to the discussion in #1175.

julian-seward1 · 2020-10-14T16:10:30Z

Ah, I see. Yes, it's true; the new framework currently doesn't allow what you tried -- it only allows a load to be at the root of a pattern tree. Whereas here, the splat would be at the root.

Having said that, @cfallin was talking recently about lifting that restriction in the framework. This isn't of much significance in the AArch64 case (apart from here, obviously) since AArch64 has very few load-op instructions. But for x64, almost all instructions have a load-op form, and so it's important that the framework can support that idiom effectively. Hence it might be the case that this limitation gets lifted sooner rather than later.

In order to register traps for `load_splat`, several instruction formats need knowledge of `SourceLoc`s; however, since the x64 backend does not correctly and completely register traps for `RegMem::Mem` variants I opened bytecodealliance#2290 to discuss and resolve this issue. In the meantime, the current behavior (i.e. remaining largely unaware of `SourceLoc`s) is retained.

abrown · 2020-10-14T16:46:01Z

@akirilov-arm: good call, now d990dd4 should have source locations in more places. @julian-seward1: regardless of what we decide to do with this PR, you should also take a look at #2290--we need a good way to trap at all of the places an x64 instruction could load (which could be quite a few) so I suggest a possible alternative in that issue.

akirilov-arm · 2020-10-15T11:14:36Z

Thanks, @abrown!

@cfallin @julian-seward1 How are we going to proceed with this? The implementation is now complete (covering all three major backends) and passes all tests. Based on the discussion in #1175, I believed that this would be an acceptable approach, but even if it is not, most of my code is actually independent of whether there is a new CLIF operation or not, and I think that's the case for @abrown's code as well. Are there plans to change the new framework soon, so that it supports the type of pattern matching that would be applicable to this case?

cfallin · 2020-10-15T18:51:15Z

@akirilov-arm @julian-seward1 @abrown so, I'm off for this week and next (but I'll try to keep up with this when I can). At a high level, I am fine with the pragmatic approach here (and in #1175) of creating these merged opcodes for now, and then eventually removing them when we have the story for load-related code motion sorted out.

In other words, I don't necessarily think that we are so close to having the "right thing" built that we should hold back true codegen improvements in the meantime and create a dependence on a to-be-built part of the framework; I think the cost of removing these opcodes later is minimal, and can be done as part of a PR that introduces the proper pattern-matching. It's not the ideal end-state, but it lets us move forward in the meantime.

@julian-seward1, what do you think?

cfallin · 2020-10-28T04:09:53Z

So, @julian-seward1 and I discussed this last week; apologies for taking so long to get back to it (I'm back to work again, albeit somewhat busy with onboarding stuff.)

In general I agree with @julian-seward1 (and it seems others would be on board with) a better approach eventually that can actually handle pattern-matching load sources to instructions. This has been on my to-do list for a while; but in theory is possible if we loosen the load-motion conditions a bit.

However, I think that having the better codegen as a starting point is valuable; if we take this as-is for now, then we have a baseline whose codegen output we should match when we do have the proper pattern-matching in place.

Thus, I'm happy to take this, and I'll commit to removing the LoadSplat as part of my pattern-matching fixup later, as long as @julian-seward1 doesn't strongly object to this transitional state. Thoughts?

akirilov-arm · 2020-10-28T18:55:07Z

@cfallin Sounds good to me. It's always better to have a good test case before adding non-trivial functionality and, if I understand correctly, previously there were no actual use cases for the enhanced pattern matching.

julian-seward1 · 2020-10-28T19:24:22Z

Well, I can't say I'm enthusiastic; I think it moves the design in the wrong direction. But if it's only temporary, and since it's useful to land the rest of it anyway, I won't stand in the way.

I should add .. Today I implemented the proposed wasm instructions v128.load32_zero and v128.load64_zero, although the patch isn't on any PR yet. These are almost identical to the load-splat insns, except that all lanes other than lane zero are zeroed out, not made into clones of it. I used a two-CLIF-insn scheme -- a scalar load, and a move-scalar-to-lowest-lane-and-zero-others. Once the isel framework can do load-op merging, I'll add a matching rule so as to only generate a single insn.

So .. if/when that lands, it'll create further CLIF-level inconsistency, at least in the short term :-/

cfallin · 2020-10-28T19:51:49Z

@julian-seward1 understood, and yes, not an ideal situation; but I think this should be a useful test case in the sense that if a new matching framework can handle this case (allow LoadSplat to be removed while retaining identical output), it will have succeeded. I definitely want to get to that point as soon as possible :-)

I'll go ahead and approve and merge now; thanks for the discussion, all!

The changes in bytecodealliance#2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.

The changes in #2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.

This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

The changes in #2278 added `SourceLoc`s to several x64 `Inst` variants; between when that PR was last run in CI and when it was merged, new instructions were added that require this new parameter. This change adds the parameter in order to fix CI.

This was added as an incremental step to improve AArch64 code quality in PR #2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR #2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

abrown approved these changes Oct 7, 2020

View reviewed changes

cfallin requested changes Oct 7, 2020

View reviewed changes

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:meta Everything related to the meta-language. cranelift:wasm labels Oct 7, 2020

akirilov-arm force-pushed the load_splat branch from 9b14e57 to d99bfa9 Compare October 9, 2020 12:50

github-actions bot added the cranelift:area:x64 Issues related to x64 codegen label Oct 13, 2020

akirilov-arm force-pushed the load_splat branch from c5b65b0 to 06b9510 Compare October 13, 2020 17:58

akirilov-arm force-pushed the load_splat branch from 06b9510 to 46a080d Compare October 14, 2020 12:09

abrown added 2 commits October 14, 2020 09:43

[machinst x64]: lower load_splat using memory addressing

a26e9e9

abrown force-pushed the load_splat branch from 46a080d to a26e9e9 Compare October 14, 2020 16:43

cfallin approved these changes Oct 28, 2020

View reviewed changes

cfallin merged commit c35904a into bytecodealliance:main Oct 28, 2020

cfallin mentioned this pull request Oct 28, 2020

Revert "Introduce the Cranelift IR instruction LoadSplat" for build failures. #2334

Closed

abrown mentioned this pull request Oct 28, 2020

Add missing SourceLoc to newly-emitted instructions #2335

Merged

jlb6740 added a commit to jlb6740/wasmtime that referenced this pull request Oct 28, 2020

Fix for build error due to API update in bytecodealliance#2278

d7a0709

akirilov-arm deleted the load_splat branch October 28, 2020 22:21

cfallin mentioned this pull request Oct 30, 2020

Allow loads to merge into other operations during instruction selection in MachInst backends #2340

Closed

cfallin mentioned this pull request Nov 7, 2020

AArch64 SIMD: replace LoadSplat with pattern-matching on load+splat #2376

Merged

cfallin mentioned this pull request Jun 24, 2021

Implements f64x2.convert_low_i32x4_u for x64 #2982

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce the Cranelift IR instruction `LoadSplat` #2278

Introduce the Cranelift IR instruction `LoadSplat` #2278

akirilov-arm commented Oct 7, 2020

abrown left a comment

cfallin left a comment

abrown commented Oct 7, 2020

cfallin commented Oct 7, 2020

akirilov-arm commented Oct 8, 2020

akirilov-arm commented Oct 9, 2020 •

edited

Loading

abrown commented Oct 9, 2020

akirilov-arm commented Oct 9, 2020

abrown commented Oct 13, 2020

abrown commented Oct 13, 2020

akirilov-arm commented Oct 13, 2020 •

edited

Loading

akirilov-arm commented Oct 14, 2020

akirilov-arm commented Oct 14, 2020

julian-seward1 commented Oct 14, 2020

akirilov-arm commented Oct 14, 2020 •

edited

Loading

julian-seward1 commented Oct 14, 2020 •

edited

Loading

abrown commented Oct 14, 2020

akirilov-arm commented Oct 15, 2020

cfallin commented Oct 15, 2020

cfallin commented Oct 28, 2020

akirilov-arm commented Oct 28, 2020

julian-seward1 commented Oct 28, 2020

cfallin commented Oct 28, 2020

Introduce the Cranelift IR instruction LoadSplat #2278

Introduce the Cranelift IR instruction LoadSplat #2278

Conversation

akirilov-arm commented Oct 7, 2020

abrown left a comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

abrown commented Oct 7, 2020

cfallin commented Oct 7, 2020

akirilov-arm commented Oct 8, 2020

akirilov-arm commented Oct 9, 2020 • edited Loading

abrown commented Oct 9, 2020

akirilov-arm commented Oct 9, 2020

abrown commented Oct 13, 2020

abrown commented Oct 13, 2020

akirilov-arm commented Oct 13, 2020 • edited Loading

akirilov-arm commented Oct 14, 2020

akirilov-arm commented Oct 14, 2020

julian-seward1 commented Oct 14, 2020

akirilov-arm commented Oct 14, 2020 • edited Loading

julian-seward1 commented Oct 14, 2020 • edited Loading

abrown commented Oct 14, 2020

akirilov-arm commented Oct 15, 2020

cfallin commented Oct 15, 2020

cfallin commented Oct 28, 2020

akirilov-arm commented Oct 28, 2020

julian-seward1 commented Oct 28, 2020

cfallin commented Oct 28, 2020

Introduce the Cranelift IR instruction `LoadSplat` #2278

Introduce the Cranelift IR instruction `LoadSplat` #2278

akirilov-arm commented Oct 9, 2020 •

edited

Loading

akirilov-arm commented Oct 13, 2020 •

edited

Loading

akirilov-arm commented Oct 14, 2020 •

edited

Loading

julian-seward1 commented Oct 14, 2020 •

edited

Loading