-
Notifications
You must be signed in to change notification settings - Fork 502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vertexcodec: Augment XOR delta encoding with rotation #823
Conversation
This can be useful to align the bits better in certain cases. However, rotates come with a set of questions: what's the granularity? which delta modes do they work in? how much do they cost to decode? Ideally perhaps we would do fully orthogonal rotates. However, rotates can be done as a filter step; as such, from the decoder perspective it's best to focus on rotate applications that are minimal. When combined with SUB deltas, rotates must be done after deltas get recombined. This makes them expensive, as this step is ~scalarized. When using XOR deltas, rotates can be done before or after undelta; that makes them cheaper and aligns XOR decoding with other delta forms, as the cost is similar to unzigzag. Conceptually, there's also some value in this split: SUB deltas assume integer-like values and bit propagation from MSB to LSB, so bit alignment is expected; for arbitrary bitpacked data, SUB deltas end up crossing packing thresholds so they are not optimal. With XOR deltas, we could still choose to encode 8-bit or 16-bit rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit rotates require fixing both halves to do the same rotation which is not really better than a 32-bit rotate; 8-bit rotates could be done before transposition but require 12 bits (3x4) of extra channel encoding, and are not much stronger vs a single 32-bit rotate that neatly fits into existing channel encoding. This change only contains the encoding support, with no decoding support or actually changing the rotate value from the default of zero; that will happen separately.
With just 32-bit rotates, this is straightforward: it can be done after byte transpose, as we have fully assembled 32-bit values; doing this before XOR is correct and efficient since it can be done in parallel.
We try all 32 possibilities for every group and record the best one, assuming XOR is a good choice of the encoding. Then we will use this value when benchmarking XOR vs SUB variants. This makes the encoding significantly slower, of course, but we can implement better heuristics separately.
When we removed 32-bit integer deltas the trace output became wrong; also we now need to incorporate rotate amount into the output.
This tests wider deltas as well as xor+rotates by forcing the pattern to be optimal to encode in a given fashion.
We no longer need to support encoder without a working decoder.
The entire 32-bit rotation space is unnecessary for our purpose: rotating by 8 yields the same compression as rotating by 0, as it just changes the byte order and bytes are compressed individually. This reduces the channel encoding consumption to 5 bits (2+3) and reduces the iteration during encoding 4x. We could technically encode using 4 bits (using 0, 1, 2..9 values...), but this doesn't meaningfully impact tail sizes so maybe isn't that necessary.
Shift right by 32 is technically UB; in practice it should not matter because two valid HW behaviors would either keep v as is, or return 0, and both result in the same final output, but we should fix this regardless.
Xor path only gets selected for unsigned int, but the code needs to remain warning-free for other instantiations of the template.
- We don't really need to mask off the bit rotation; the out of range values are invalid and should never be encoded. - We could replace switch/case using a macro similar to how we do the rest of the work for SIMD; this makes it possible to move the computation closer to where it's needed to avoid relying too much on compiler reordering. - Validate the channel after estimation vs valid encoding range.
There's some uncertainties with whether or not rotation will make it into final v1 format; this depends on faster encoding support and further analysis on decoding performance tradeoffs between having rotates and not having rotates and using XOR for fast-decode paths. However in the interest of streamlining the development I'm going to merge this speculatively. If further analysis will determine that rotation doesn't sufficiently pull its weight then it will be removed. Some other commits in this PR would be nice to have regardless so it's probably reasonable to do this in a separate commit when v1 gets finalized. |
This can be useful to align the bits better in certain cases. However,
rotates come with a set of questions: what's the granularity? which
delta modes do they work in? how much do they cost to decode?
Ideally perhaps we would do fully orthogonal rotates. However, rotates
can be done as a filter step; as such, from the decoder perspective it's
best to focus on rotate applications that are minimal.
When combined with SUB deltas, rotates must be done after deltas get
recombined. This makes them expensive, as this step is ~scalarized. When
using XOR deltas, rotates can be done before or after undelta; that
makes them cheaper and aligns XOR decoding with other delta forms, as
the cost is similar to unzigzag.
Conceptually, there's also some value in this split: SUB deltas assume
integer-like values and bit propagation from MSB to LSB, so bit
alignment is expected; for arbitrary bitpacked data, SUB deltas end up
crossing packing thresholds so they are not optimal.
With XOR deltas, we could still choose to encode 8-bit or 16-bit
rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit
rotates require fixing both halves to do the same rotation which is not
really better than a 32-bit rotate; 8-bit rotates could be done before
transposition but require 12 bits (3x4) of extra channel encoding, 4x more
time to encode, are not much stronger vs a single 32-bit rotate that neatly
fits into existing channel encoding.
This change makes encoding substantially slower; however, future PRs are
expected to improve this as well as introduce support for compression
levels.
This contribution is sponsored by Valve.