Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove optimization for memory.copy(x, x, C) #3073

Merged
merged 1 commit into from
Aug 23, 2020

Conversation

MaxGraey
Copy link
Contributor

It seems we couldn't simply remove memory.copy(x, x, C) even if all arguments are side effect free due to x value could exceeds memory bounds.

See: #3038 (comment)

@MaxGraey
Copy link
Contributor Author

Start fuzzing.

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 23, 2020

We can't just fold everything to nop. Unless you are until make sure:

  1. all arguments are side effect free
  2. dest + size < memory.size. But we can check this only in runtime for general case.

Or I miss something?

@kripken
Copy link
Member

kripken commented Aug 23, 2020

I think this is the implicit trap issue. When the instruction can trap, we can't optimize it out.

What I think we can do here is calculate the side effects of the entire node. Then since we know dst == src we can ignore the read and write of memory. Then we can check if any side effects remain, and only optimize if there are none.

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 23, 2020

What I think we can do here is calculate the side effects of the entire node. Then since we know dst == src we can ignore the read and write of memory. Then we can check if any side effects remain, and only optimize if there are none.

That's sound quite complicated. But I have only one extremely radical solution) Yes, I mean update spec of validation for bulk-memories. If followed by spec there are no check for src == dst with early return. Instead of that it always check:

if (src + size > memory.size || dst + size > memory.size) {
  oob_trap();
}

what if we add one extra check for runtime?

if (src + size > memory.size || dst + size > memory.size) {
  if (src == dst) {
    return nop();
  } else {
    oob_trap();
  }
}

Or perform src == dst check even before (outside) main oop check.

in this case we could safely perform this optimization. But I'm not sure is it make sense) It's pretty radical solution. Especially due to bulk memory already standardized

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's land this for now to unbreak things, and consider other options.

@kripken kripken merged commit 33ccea3 into WebAssembly:master Aug 23, 2020
@MaxGraey MaxGraey deleted the fix-memcopy-2 branch August 23, 2020 19:32
@kripken
Copy link
Member

kripken commented Aug 23, 2020

I don't think it's that complicated myself, but up to you.

A spec change is possible to suggest, I have no idea of the chances there though.

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 23, 2020

I don't think it's that complicated myself, but up to you.

Hmm, I just thought it required precise range check similar to getMaxBits but care about actual value's range. But if you see much more simpler way It will be nice to handle this. Just not sure all of this worth for such infrequent case.

@tlively
Copy link
Member

tlively commented Aug 24, 2020

Sorry for missing this in review. Changing the bulk memory spec would be infeasible at this point because the semantics of memory.copy have been discussed extensively already and the proposal has already moved to phase 4.

@MaxGraey
Copy link
Contributor Author

Changing the bulk memory spec would be infeasible at this point because the semantics of memory.copy have been discussed extensively already and the proposal has already moved to phase 4

Yes, but we already have precedence with even more drastic change in reference-types on phase 4.

memory.copy have been discussed extensively

Also spec have one special case with skipping bounds check when size of copy bytes is zero: https://github.com/WebAssembly/bulk-memory-operations/blob/996ef3f7f9feaa53fd094050e7baeea4c5c0c361/interpreter/exec/eval.ml#L323

@conrad-watt
Copy link

conrad-watt commented Aug 24, 2020

More specifically, we already discussed these different possibilities for bulk memory bounds checking explicitly (see discussions linked from WebAssembly/bulk-memory-operations#126).

The motivation for the current semantics is that it's the simplest possible conditional check (i.e. max_offset+len > memory.size). Having a semantics where there's no trap if the size is 0 would require an additional check in some implementations.

If it's not possible to eliminate the op in this case, could it be rewritten to a bare byte load at the maximum byte index of the copy?
There might be an overflow issue.

EDIT: in the linked implementation, the zero special case is only executed if the bounds check succeeds (i.e. there is no trap; see the shift made here, combined with the when clause of the previous case https://github.com/WebAssembly/bulk-memory-operations/pull/126/files#diff-39718c053592dcf46dfea22c69b94f0eL313).

@MaxGraey
Copy link
Contributor Author

I see and understand the main design goal is reduce runtime checking cost. But if all this special cases will do inside main oop check?

if (src + size > memory.size || dst + size > memory.size) {
  if (src == dst || size == 0) {
    return nop();
  }
  oob_trap();
}

@conrad-watt
Copy link

I see and understand the main design goal is reduce runtime checking cost. But if all this special cases will do inside main oop check?

if (src + size > memory.size || dst + size > memory.size) {
  if (src == dst || size == 0) {
    return nop();
  }
  oob_trap();
}

Maybe I'm not understanding, but in a runtime implementation with explicit bounds checks, I think those two checks would have to be performed independently of each other?

i.e. memory.copy(x, x, C): x+C could be in-bounds, but you would still need it to be a no-op in order to eliminate it.

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 24, 2020

The main idea make memory.copy(x, x, C) -> nop possible even if x + C out of bound. Similar to memory.copy(x, x, 0) which it seems valid if x is out of bound as well. Or I not right understand how handling case with zero size?

Because if memory.copy(x, y, 0) should also trap if x > memory.size or y > memory.size. We should also remove another one optimization rule

Upd: found this discussion:
WebAssembly/bulk-memory-operations#111

@conrad-watt
Copy link

Because if memory.copy(x, y, 0) should also trap if x > memory.size or y > memory.size. We should also remove another one optimization rule

That is currently the case. The check is described in the proposal overview.

Upd: found this discussion:
WebAssembly/bulk-memory-operations#111

To give extra context, the current semantics were decided as a result of discussions in WebAssembly/bulk-memory-operations#124 (towards the end).

@MaxGraey
Copy link
Contributor Author

Hmm, it seems in this case we should remove all reductions to nop in memory.copy optimization.
cc @kripken @tlively

@conrad-watt
Copy link

conrad-watt commented Aug 24, 2020

In the example you linked, would it be possible to preserve the trapping behaviour by replacing the 0-byte copy with a byte load on location max(x,y) + a drop?

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 24, 2020

n the example you linked, would it be possible to preserve the trapping behaviour by replacing the 0-byte copy with a byte load on location max(x,y) + a drop?

It will significantly increase binary size due to we add i32.select, extra 2 local reads, one i32.gt, one load instruction and one drop instruction. So it doesn't make any sense unfortunately

this variant slightly better:

(module
  (memory 0)
  (func (export "a") (param i32 i32)
    local.get 0
    i32.load
    drop
    local.get 1
    i32.load
    drop
  )
)

But also add 1 extra byte compare to

(module
  (memory 0)
  (func (export "a") (param i32 i32)
    local.get 0
    local.get 1
    i32.const 0
    memory.copy
  )
)

@conrad-watt
Copy link

conrad-watt commented Aug 24, 2020

Ah I see. Do constant zero-byte copies appear often in Wasm code passed to binaryen? The C/C++/Rust -> Wasm stage would have much more flexibility in eliminating them (or even the Wasm -> platform stage in engines).

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 24, 2020

Binaryen also used as prime codegenerator without LLVM for couple of languages. But even for LLVM it quite useful. emscripten, wasm-pack (Rust) and other toolkits or compilers use binaryen in final stage. The main goal of Binaryen is code size reduction. And you could always got mem.copy(x, x, s) or mem.copy(x, y, 0) cases after some optimizations passes like DCE, global / local simplification, inlining and etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants