-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use simd masking for amd64&arm64 #326
Conversation
e6fb843
to
0caa997
Compare
Finally gotten around to reviewing this. I'm not very familiar with writing assembly of any kind. Why use AVX2 instead of AVX-512? |
Also don't worry about the merge conflicts, I'll fix them myself. |
2287b4d
to
cfca343
Compare
For some reason it slows down at the 512 byte benchmark. Not sure what's going on there. |
More clearly:
Super weird. |
cfca343
to
083d297
Compare
Disabling AVX2 seems to have fixed it.
|
1e8bf28
to
32d0aa1
Compare
The amd64 code looks good to me so far but the arm64 code doesn't seem to produce any speedup at least through qemu.
In fact it's slower. Not sure what's going on. |
Will test on a proper VM too. |
7d0c6f4
to
9f298ec
Compare
json.Encoder is 42% faster than json.Marshal thanks to the memory reuse. goos: linux goarch: amd64 pkg: nhooyr.io/websocket/wsjson cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkJSON/json.Encoder-12 3517579 340.2 ns/op 24 B/op 1 allocs/op BenchmarkJSON/json.Marshal-12 2374086 484.3 ns/op 728 B/op 2 allocs/op Closes coder#409
[qrvnl@dios ~/src/websocket] 130$ go test -bench=. ./wsjson/ goos: linux goarch: amd64 pkg: nhooyr.io/websocket/wsjson cpu: 12th Gen Intel(R) Core(TM) i5-1235U BenchmarkJSON/json.Encoder/8-12 14041426 72.59 ns/op 110.21 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/16-12 13936426 86.99 ns/op 183.92 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/32-12 11416401 115.3 ns/op 277.59 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/128-12 4600574 264.7 ns/op 483.55 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/256-12 2710398 433.9 ns/op 590.06 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/512-12 1588930 717.3 ns/op 713.82 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/1024-12 823138 1484 ns/op 689.80 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/2048-12 402823 2875 ns/op 712.32 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/4096-12 213926 5602 ns/op 731.14 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/8192-12 92864 11281 ns/op 726.19 MB/s 16 B/op 1 allocs/op BenchmarkJSON/json.Encoder/16384-12 39318 29203 ns/op 561.04 MB/s 19 B/op 1 allocs/op BenchmarkJSON/json.Marshal/8-12 10768671 114.5 ns/op 69.89 MB/s 48 B/op 2 allocs/op BenchmarkJSON/json.Marshal/16-12 10140996 113.9 ns/op 140.51 MB/s 64 B/op 2 allocs/op BenchmarkJSON/json.Marshal/32-12 9211780 121.6 ns/op 263.06 MB/s 64 B/op 2 allocs/op BenchmarkJSON/json.Marshal/128-12 4632796 264.2 ns/op 484.53 MB/s 224 B/op 2 allocs/op BenchmarkJSON/json.Marshal/256-12 2441511 473.5 ns/op 540.65 MB/s 432 B/op 2 allocs/op BenchmarkJSON/json.Marshal/512-12 1298788 896.2 ns/op 571.27 MB/s 912 B/op 2 allocs/op BenchmarkJSON/json.Marshal/1024-12 602084 1866 ns/op 548.83 MB/s 1808 B/op 2 allocs/op BenchmarkJSON/json.Marshal/2048-12 341151 3817 ns/op 536.61 MB/s 3474 B/op 2 allocs/op BenchmarkJSON/json.Marshal/4096-12 175594 7034 ns/op 582.32 MB/s 6548 B/op 2 allocs/op BenchmarkJSON/json.Marshal/8192-12 83222 15023 ns/op 545.30 MB/s 13591 B/op 2 allocs/op BenchmarkJSON/json.Marshal/16384-12 33087 39348 ns/op 416.39 MB/s 27304 B/op 2 allocs/op PASS ok nhooyr.io/websocket/wsjson 32.934s
8e84a57
to
53c53c2
Compare
I guess qemu simd emulation harms performance on aliyun(alibabacloud) yitian710 (arm64 armv8) 2c4g machine:
on aliyun(alibabacloud) ampere altra (arm64 armv8) 2c4g machine:
|
Right on, thanks for testing @dixyes |
AVX-512 is not widely supported, while AVX2 is everywhere. |
I'm just not good enough at assembly. I added tests to confirm that @wdvxdr's implementation works correctly and matches the output of the basic masking loop.
53c53c2
to
c643e71
Compare
Standard library does this too. Unfortunate wish they just exposed it in the standard library. Perhaps we can isolate the specific code we need later.
c643e71
to
17e1b86
Compare
Final results:
Thanks again @wdvxdr1123 and sorry for the large delay. |
goos: windows
goarch: amd64
pkg: nhooyr.io/websocket
cpu: Intel(R) Core(TM) i5-9300H CPU @ 2.40GHz
Benchmark_mask/2/basic-8 425339004 2.795 ns/op 715.66 MB/s
Benchmark_mask/2/nhooyr-8 379937766 3.186 ns/op 627.78 MB/s
Benchmark_mask/2/gorilla-8 392164167 3.071 ns/op 651.24 MB/s
Benchmark_mask/2/gobwas-8 310037222 3.880 ns/op 515.46 MB/s
Benchmark_mask/3/basic-8 321408024 3.806 ns/op 788.32 MB/s
Benchmark_mask/3/nhooyr-8 350726338 3.478 ns/op 862.58 MB/s
Benchmark_mask/3/gorilla-8 332217727 3.634 ns/op 825.43 MB/s
Benchmark_mask/3/gobwas-8 247376214 4.886 ns/op 614.01 MB/s
Benchmark_mask/4/basic-8 261182472 4.582 ns/op 872.91 MB/s
Benchmark_mask/4/nhooyr-8 381830712 3.262 ns/op 1226.05 MB/s
Benchmark_mask/4/gorilla-8 272616304 4.395 ns/op 910.04 MB/s
Benchmark_mask/4/gobwas-8 204574558 5.855 ns/op 683.19 MB/s
Benchmark_mask/8/basic-8 191330037 6.162 ns/op 1298.24 MB/s
Benchmark_mask/8/nhooyr-8 369694992 3.285 ns/op 2435.65 MB/s
Benchmark_mask/8/gorilla-8 175388466 6.743 ns/op 1186.48 MB/s
Benchmark_mask/8/gobwas-8 241719933 4.886 ns/op 1637.45 MB/s
Benchmark_mask/16/basic-8 100000000 10.92 ns/op 1464.83 MB/s
Benchmark_mask/16/nhooyr-8 272565096 4.436 ns/op 3606.98 MB/s
Benchmark_mask/16/gorilla-8 100000000 11.20 ns/op 1428.53 MB/s
Benchmark_mask/16/gobwas-8 221356798 5.405 ns/op 2960.45 MB/s
Benchmark_mask/32/basic-8 61476984 20.40 ns/op 1568.80 MB/s
Benchmark_mask/32/nhooyr-8 238665572 5.050 ns/op 6337.22 MB/s
Benchmark_mask/32/gorilla-8 100000000 12.09 ns/op 2647.28 MB/s
Benchmark_mask/32/gobwas-8 186077235 6.477 ns/op 4940.36 MB/s
Benchmark_mask/128/basic-8 14629720 80.90 ns/op 1582.19 MB/s
Benchmark_mask/128/nhooyr-8 181241968 6.565 ns/op 19497.98 MB/s
Benchmark_mask/128/gorilla-8 68308342 16.76 ns/op 7639.37 MB/s
Benchmark_mask/128/gobwas-8 94582026 12.97 ns/op 9872.11 MB/s
Benchmark_mask/512/basic-8 3921001 305.6 ns/op 1675.55 MB/s
Benchmark_mask/512/nhooyr-8 123102199 9.721 ns/op 52669.11 MB/s
Benchmark_mask/512/gorilla-8 32355914 38.18 ns/op 13411.43 MB/s
Benchmark_mask/512/gobwas-8 31528501 37.80 ns/op 13544.37 MB/s
Benchmark_mask/4096/basic-8 491804 2381 ns/op 1720.39 MB/s
Benchmark_mask/4096/nhooyr-8 26159691 46.98 ns/op 87187.73 MB/s
Benchmark_mask/4096/gorilla-8 4898440 243.6 ns/op 16817.89 MB/s
Benchmark_mask/4096/gobwas-8 4336398 277.2 ns/op 14776.40 MB/s
Benchmark_mask/16384/basic-8 113842 9623 ns/op 1702.66 MB/s
Benchmark_mask/16384/nhooyr-8 8088847 154.5 ns/op 106058.18 MB/s
Benchmark_mask/16384/gorilla-8 1282993 933.6 ns/op 17549.90 MB/s
Benchmark_mask/16384/gobwas-8 997347 1086 ns/op 15093.49 MB/s