vertexcodec: Augment XOR delta encoding with rotation #823

zeux · 2024-12-18T22:39:57Z

This can be useful to align the bits better in certain cases. However,
rotates come with a set of questions: what's the granularity? which
delta modes do they work in? how much do they cost to decode?

Ideally perhaps we would do fully orthogonal rotates. However, rotates
can be done as a filter step; as such, from the decoder perspective it's
best to focus on rotate applications that are minimal.

When combined with SUB deltas, rotates must be done after deltas get
recombined. This makes them expensive, as this step is ~scalarized. When
using XOR deltas, rotates can be done before or after undelta; that
makes them cheaper and aligns XOR decoding with other delta forms, as
the cost is similar to unzigzag.

Conceptually, there's also some value in this split: SUB deltas assume
integer-like values and bit propagation from MSB to LSB, so bit
alignment is expected; for arbitrary bitpacked data, SUB deltas end up
crossing packing thresholds so they are not optimal.

With XOR deltas, we could still choose to encode 8-bit or 16-bit
rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit
rotates require fixing both halves to do the same rotation which is not
really better than a 32-bit rotate; 8-bit rotates could be done before
transposition but require 12 bits (3x4) of extra channel encoding, 4x more
time to encode, are not much stronger vs a single 32-bit rotate that neatly
fits into existing channel encoding.

This change makes encoding substantially slower; however, future PRs are
expected to improve this as well as introduce support for compression
levels.

This contribution is sponsored by Valve.

This can be useful to align the bits better in certain cases. However, rotates come with a set of questions: what's the granularity? which delta modes do they work in? how much do they cost to decode? Ideally perhaps we would do fully orthogonal rotates. However, rotates can be done as a filter step; as such, from the decoder perspective it's best to focus on rotate applications that are minimal. When combined with SUB deltas, rotates must be done after deltas get recombined. This makes them expensive, as this step is ~scalarized. When using XOR deltas, rotates can be done before or after undelta; that makes them cheaper and aligns XOR decoding with other delta forms, as the cost is similar to unzigzag. Conceptually, there's also some value in this split: SUB deltas assume integer-like values and bit propagation from MSB to LSB, so bit alignment is expected; for arbitrary bitpacked data, SUB deltas end up crossing packing thresholds so they are not optimal. With XOR deltas, we could still choose to encode 8-bit or 16-bit rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit rotates require fixing both halves to do the same rotation which is not really better than a 32-bit rotate; 8-bit rotates could be done before transposition but require 12 bits (3x4) of extra channel encoding, and are not much stronger vs a single 32-bit rotate that neatly fits into existing channel encoding. This change only contains the encoding support, with no decoding support or actually changing the rotate value from the default of zero; that will happen separately.

With just 32-bit rotates, this is straightforward: it can be done after byte transpose, as we have fully assembled 32-bit values; doing this before XOR is correct and efficient since it can be done in parallel.

We try all 32 possibilities for every group and record the best one, assuming XOR is a good choice of the encoding. Then we will use this value when benchmarking XOR vs SUB variants. This makes the encoding significantly slower, of course, but we can implement better heuristics separately.

When we removed 32-bit integer deltas the trace output became wrong; also we now need to incorporate rotate amount into the output.

This tests wider deltas as well as xor+rotates by forcing the pattern to be optimal to encode in a given fashion.

We no longer need to support encoder without a working decoder.

The entire 32-bit rotation space is unnecessary for our purpose: rotating by 8 yields the same compression as rotating by 0, as it just changes the byte order and bytes are compressed individually. This reduces the channel encoding consumption to 5 bits (2+3) and reduces the iteration during encoding 4x. We could technically encode using 4 bits (using 0, 1, 2..9 values...), but this doesn't meaningfully impact tail sizes so maybe isn't that necessary.

Shift right by 32 is technically UB; in practice it should not matter because two valid HW behaviors would either keep v as is, or return 0, and both result in the same final output, but we should fix this regardless.

Xor path only gets selected for unsigned int, but the code needs to remain warning-free for other instantiations of the template.

- We don't really need to mask off the bit rotation; the out of range values are invalid and should never be encoded. - We could replace switch/case using a macro similar to how we do the rest of the work for SIMD; this makes it possible to move the computation closer to where it's needed to avoid relying too much on compiler reordering. - Validate the channel after estimation vs valid encoding range.

zeux · 2024-12-18T23:34:31Z

There's some uncertainties with whether or not rotation will make it into final v1 format; this depends on faster encoding support and further analysis on decoding performance tradeoffs between having rotates and not having rotates and using XOR for fast-decode paths. However in the interest of streamlining the development I'm going to merge this speculatively.

If further analysis will determine that rotation doesn't sufficiently pull its weight then it will be removed. Some other commits in this PR would be nice to have regardless so it's probably reasonable to do this in a separate commit when v1 gets finalized.

zeux added 7 commits December 18, 2024 10:45

vertexcodec: Implement SIMD and scalar decoding for rotates

4412dba

With just 32-bit rotates, this is straightforward: it can be done after byte transpose, as we have fully assembled 32-bit values; doing this before XOR is correct and efficient since it can be done in parallel.

vertexcodec: Fix trace output for channels

65eb339

When we removed 32-bit integer deltas the trace output became wrong; also we now need to incorporate rotate amount into the output.

demo: Add more unit tests for new decoding types

85599de

This tests wider deltas as well as xor+rotates by forcing the pattern to be optimal to encode in a given fashion.

demo: Remove validate parameter from encodeVertex

265db94

We no longer need to support encoder without a working decoder.

zeux changed the title ~~vertexcodec: Augment XOR delta encoding with rotation support~~ vertexcodec: Augment XOR delta encoding with rotation Dec 18, 2024

zeux added 2 commits December 18, 2024 14:41

vertexcodec: Fix UBSAN violation

c4d1d42

Shift right by 32 is technically UB; in practice it should not matter because two valid HW behaviors would either keep v as is, or return 0, and both result in the same final output, but we should fix this regardless.

vertexcodec: Fix MSVC warnings

27a452c

Xor path only gets selected for unsigned int, but the code needs to remain warning-free for other instantiations of the template.

zeux force-pushed the vcone-rotx branch from 32bc874 to 27a452c Compare December 18, 2024 22:53

zeux force-pushed the vcone-rotx branch from c7db862 to 050dd14 Compare December 18, 2024 23:05

zeux merged commit ee7fef9 into master Dec 18, 2024
12 checks passed

zeux deleted the vcone-rotx branch December 18, 2024 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vertexcodec: Augment XOR delta encoding with rotation #823

vertexcodec: Augment XOR delta encoding with rotation #823

zeux commented Dec 18, 2024

zeux commented Dec 18, 2024

vertexcodec: Augment XOR delta encoding with rotation #823

vertexcodec: Augment XOR delta encoding with rotation #823

Conversation

zeux commented Dec 18, 2024

zeux commented Dec 18, 2024