Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding semantics for i32x4.relaxed_dot_i8x16_i7x16_add_s #129

Closed
alexcrichton opened this issue Feb 27, 2023 · 5 comments
Closed

Rounding semantics for i32x4.relaxed_dot_i8x16_i7x16_add_s #129

alexcrichton opened this issue Feb 27, 2023 · 5 comments

Comments

@alexcrichton
Copy link
Contributor

I've stumbled upon an issue where the Overview.md isn't clear enough to me to figure out what to do and the included spec tests in this repository disagree with the implementation of this instruction in both v8 and SpiderMonkey. This instruction in these engines is implemented, on x86_64, with:

  • pmaddubsw - does the 8-to-16 dot product
  • pmaddwd - adds pairwise 16-bit results to 32-bit results
  • paddd - adds the third 32-bit argument

The spec tests currently have a test that looks like:

;; signed * unsigned   : -128 *  129 * 4 = -66,048 (+ 1)
;; signed * signed     : -128 * -127 * 4 =  65,024 (+ 1)
;; unsigned * unsigned :  128 *  129 * 2 =  66,048 (+ 1)
(assert_return (invoke "i32x4.relaxed_dot_i8x16_i7x16_add_s"
                       (v128.const i8x16 -128 -128 -128 -128 0 0 0 0 0 0 0 0 0 0 0 0)
                       (v128.const i8x16 -127 -127 -127 -127 0 0 0 0 0 0 0 0 0 0 0 0)
                       (v128.const i32x4 1 2 3 4))
               (either
                 (v128.const i32x4 -66047 2 3 4)
                 (v128.const i32x4  65025 2 3 4)
                 (v128.const i32x4  66049 2 3 4)))

but the implementation on x64 matches none of these answers. The first row is intended to be selected on x64 but the pmaddubsw does a saturating multiply-and-add so the intermediate result of the pmaddubsw instruction is -32768, not the intended -33024. This means that the final result of this instruction sequence is -65535, not the expected -66047 according to the spec test.

I'm not sure what the intention here and what's right and what's wrong. The Overview.md doesn't mention the saturating semantics so this lowering for x86_64 doesn't match the overview or the in-progress spec PR I think. On the other hand I suspect the intention of this instruction was to use pmaddubsw on x86_64 so it may be a case where the wording for this instruction needs to be updated?

While I'm here on this topic as well, some other side notes I would have are:

  • The spec tests for i16x8.relaxed_dot_i8x16_i7x16_s have a specific test for the saturating behavior, but this saturation I don't see mentioned in the Overview.md or the spec PR (which has an imul of the sign-extended operands which looks like a non-saturating mul?)
  • It seems that v8 and spidermonkey do not execute the spec tests as-written in this repository since in addition to this issue there are some minor mistakes in the i32x4_relaxed_trunc.wast test, for example 2147483647.0 can't be represented as an f32 so the first i32x4.relaxed_trunc_f32x4_s test fails and additionally some expectations for NaN use 0x8000000 as a constant which is one zero off from being INT_MIN (0x80000000). I was planning on sending a PR once I got everything working for these mistakes as they seem benign, but I wanted to additionally ask about the process. Given this evidence it seems that these spec tests aren't run by engines yet, but this seems like a possibility for improvement since presumably each engine has their own set of tests and ideally at least it seems like many of such tests should live in this repository or the official spec somehow.
@ngzhian
Copy link
Member

ngzhian commented Feb 27, 2023

#52 should be clearer on what the intention is.

The first row, -66047 is intended to be VPDPBUSD (AVX2-VNNI or AVX512-VNNI):

Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.

I think the spec test (and overview) is missing the case that you mentioned using pmaddubsw + pmaddwd + paddd.

which should be as you said:

  • -128 * 129 * 2 = -33024 saturated to -32768
  • -32758 + -32768 = -65536
  • -65536 + 1 = -65535

I'll add the missing test and update the overview.

Re: i16x8.relaxed_dot_i8x16_i7x16_s, the saturation should be allowed, #52 mentions it, it is missing from overview and likely spec text as well.

@ngzhian
Copy link
Member

ngzhian commented Feb 27, 2023

re: engine and spec tests, i appreciate all your reports on the bugs in the spec tests and will fix all of them. The spec test should at least cover all the edge cases, if I'm missing any, lmk.

I expect that engines will have their own tests, and then additionally run spec tests. It would be nice if spec tests is a superset of all engine tests, but it will require contribution effort.

I think engines don't run these tests yet, but as this proposal has been making swift progress, I think they are all starting to import and run the tests (and hence discover issues.)

@alexcrichton
Copy link
Contributor Author

This was addressed in #130 so closing, thanks!

Also FWIW I think the testing situation isn't something that's specific to this proposal itself, it's sort of more broad for all proposals. I have found it nontrivial to contribute tests in the past which deters me from proposing more tests, but nevertheless this isn't the place to discuss the nuances of test contributions I mostly wanted to ask and see if there was a shared body of tests elsewhere but sounds like there isn't.

@ngzhian
Copy link
Member

ngzhian commented Feb 27, 2023

This was addressed in #130 so closing, thanks!

Also FWIW I think the testing situation isn't something that's specific to this proposal itself, it's sort of more broad for all proposals. I have found it nontrivial to contribute tests in the past which deters me from proposing more tests, but nevertheless this isn't the place to discuss the nuances of test contributions I mostly wanted to ask and see if there was a shared body of tests elsewhere but sounds like there isn't.

Agree that there is friction to contribute tests to the spec. You have been really helpful with your bug reports. If you find something lacking in the tests, please continue filing bugs, and I can turn it into spec tests and get your reviews on those. Thanks!

@ngzhian
Copy link
Member

ngzhian commented Feb 27, 2023

Updated spec text in #115

I think the imul usage is correct, the imul is used in a helper operation that performs the (lane-wise) multiply (which can be signed * signed or unsigned * unsigned (depending on the host). The overflow happens when adding the intermediate 16-bit, which I was using a iadd, which is wrong.

I fixed that by making the addition a saturating addition.

In the signed * signed case, the addition should not overflow -128 * -128 * 2 = 32768, the saturation does nothing.

In the signed * unsigned case -128 * 129 * 2 = −33024 < -32768 so it saturates it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants