-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster encoding for lengths in BufferBackend. #51
Conversation
Instead of varlen encoding, which requires up to 8 conditional branches in a loop, we use a simpler encoding which is just 4 instructions without branching. New encoding: * length<=254: <1-byte length> * otherwise: <0xFF> <usize length> <0xFF> Improves throughputs by 10-80%, with the biggest improvement in resolve(). get_or_intern/fill-empty/new/BufferBackend time: [6.4562 ms 6.5068 ms 6.5684 ms] thrpt: [15.224 Melem/s 15.369 Melem/s 15.489 Melem/s] change: time: [-8.4289% -7.3761% -6.1844%] (p = 0.00 < 0.05) thrpt: [+6.5920% +7.9635% +9.2048%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe get_or_intern/fill-empty/with_capacity/BufferBackend time: [3.8182 ms 3.8467 ms 3.8796 ms] thrpt: [25.776 Melem/s 25.996 Melem/s 26.190 Melem/s] change: time: [-18.750% -18.024% -17.246%] (p = 0.00 < 0.05) thrpt: [+20.840% +21.986% +23.077%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe get_or_intern/already-filled/BufferBackend time: [2.7638 ms 2.8026 ms 2.8477 ms] thrpt: [35.116 Melem/s 35.681 Melem/s 36.182 Melem/s] change: time: [-7.9910% -6.4121% -4.8432%] (p = 0.00 < 0.05) thrpt: [+5.0897% +6.8514% +8.6851%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe get_or_intern_static/BufferBackend/get_or_intern time: [2.8467 µs 2.8632 µs 2.8807 µs] thrpt: [18.051 Melem/s 18.162 Melem/s 18.267 Melem/s] change: time: [-1.7629% +0.8260% +4.7452%] (p = 0.71 > 0.05) thrpt: [-4.5302% -0.8193% +1.7945%] No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe get_or_intern_static/BufferBackend/get_or_intern_static time: [2.8259 µs 2.8456 µs 2.8634 µs] thrpt: [18.160 Melem/s 18.274 Melem/s 18.401 Melem/s] change: time: [-6.7386% -3.1051% -0.1965%] (p = 0.06 > 0.05) thrpt: [+0.1969% +3.2046% +7.2255%] No change in performance detected. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe resolve/already-filled/BufferBackend time: [239.71 µs 242.17 µs 245.19 µs] thrpt: [407.85 Melem/s 412.93 Melem/s 417.16 Melem/s] change: time: [-46.443% -45.470% -44.561%] (p = 0.00 < 0.05) thrpt: [+80.377% +83.386% +86.719%] Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe get/already-filled/BufferBackend time: [2.3035 ms 2.3222 ms 2.3427 ms] thrpt: [42.686 Melem/s 43.062 Melem/s 43.411 Melem/s] change: time: [-2.6436% -1.3829% -0.0703%] (p = 0.03 < 0.05) thrpt: [+0.0703% +1.4023% +2.7154%] Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe iter/already-filled/BufferBackend time: [278.13 µs 279.55 µs 281.19 µs] thrpt: [355.64 Melem/s 357.72 Melem/s 359.54 Melem/s] change: time: [-46.368% -42.429% -39.796%] (p = 0.00 < 0.05) thrpt: [+66.102% +73.699% +86.454%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 2 (2.00%) high mild 3 (3.00%) high severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR. Benchmarks look good. Gonna give it a proper review after the CI has passed.
let len_bytes = | ||
unsafe { self.buffer.get_unchecked(index - LEN_LOOKBEHIND..index) }; | ||
let len_bytes = <&[u8; LEN_LOOKBEHIND]>::try_from(len_bytes).unwrap(); | ||
let len = decode_len_backwards(len_bytes); | ||
let str_bytes = unsafe { self.buffer.get_unchecked(index..index + len) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this fix #49 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No; the usize value can take any set of 8 bytes, some of which may fall outside the utf8 encoding.
One possibility is to make a more complex encoding for the usize, with 7 bits per byte, leaving the 8th bit unset. This would keep it within the ASCII range. This requires some encoding/decoding. The problem is that this encoding/decoding requires a shift and an or, per byte, which would defeat most of the performance gains in this PR.
On x86 machines with fast pext/pdep instructions (Intel Haswell and later; AMD Zen 3 and later) we can use pext rax, 0x7f7f7f7f7f7f7f7f
for this decoding and pdep rax, 0x7f7f7f7f7f7f7f7f
for the encoding. Sadly, older machines either don't have these instruction (Intel) or run them very slowly (Zen 1 and Zen 2).
The "bit-table indicating string start" approach would allow a fast (single-instruction) check, but at the cost of 12.5% space overhead. If we required strings to be aligned at 2 bytes, we could have a bit per 2 bytes (6.25% overhead) or a bit per 4 bytes (3.125% overhead), etc.
I'm not sure which approach is best. I'd probably go with a bit-table with 2 byte alignment as the least-worst compromise.
@reinerp Sorry for taking so long. (a whole year ...) I want to release a new version including your improvements soon. Could you resolve the conflicts and rebase on the current |
On my Macbook Pro M2 these are the speed-ups I see:
So no speed-ups for me for |
Closed due to inactivity. @reinerp feel free to re-open if you want to work on this. |
Instead of varlen encoding, which requires up to 8 conditional branches in a loop, we use a simpler encoding which is just 4 instructions without branching.
New encoding:
<1-byte length>
<0xFF> <usize length> <0xFF>
Improves throughputs by 10-80%, with the biggest improvement in resolve().