Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertexcodec: Augment XOR delta encoding with rotation #823

Merged
merged 10 commits into from
Dec 18, 2024
Merged

Conversation

zeux
Copy link
Owner

@zeux zeux commented Dec 18, 2024

This can be useful to align the bits better in certain cases. However,
rotates come with a set of questions: what's the granularity? which
delta modes do they work in? how much do they cost to decode?

Ideally perhaps we would do fully orthogonal rotates. However, rotates
can be done as a filter step; as such, from the decoder perspective it's
best to focus on rotate applications that are minimal.

When combined with SUB deltas, rotates must be done after deltas get
recombined. This makes them expensive, as this step is ~scalarized. When
using XOR deltas, rotates can be done before or after undelta; that
makes them cheaper and aligns XOR decoding with other delta forms, as
the cost is similar to unzigzag.

Conceptually, there's also some value in this split: SUB deltas assume
integer-like values and bit propagation from MSB to LSB, so bit
alignment is expected; for arbitrary bitpacked data, SUB deltas end up
crossing packing thresholds so they are not optimal.

With XOR deltas, we could still choose to encode 8-bit or 16-bit
rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit
rotates require fixing both halves to do the same rotation which is not
really better than a 32-bit rotate; 8-bit rotates could be done before
transposition but require 12 bits (3x4) of extra channel encoding, 4x more
time to encode, are not much stronger vs a single 32-bit rotate that neatly
fits into existing channel encoding.

This change makes encoding substantially slower; however, future PRs are
expected to improve this as well as introduce support for compression
levels.

This contribution is sponsored by Valve.

zeux added 7 commits December 18, 2024 10:45
This can be useful to align the bits better in certain cases. However,
rotates come with a set of questions: what's the granularity? which
delta modes do they work in? how much do they cost to decode?

Ideally perhaps we would do fully orthogonal rotates. However, rotates
can be done as a filter step; as such, from the decoder perspective it's
best to focus on rotate applications that are minimal.

When combined with SUB deltas, rotates must be done after deltas get
recombined. This makes them expensive, as this step is ~scalarized. When
using XOR deltas, rotates can be done before or after undelta; that
makes them cheaper and aligns XOR decoding with other delta forms, as
the cost is similar to unzigzag.

Conceptually, there's also some value in this split: SUB deltas assume
integer-like values and bit propagation from MSB to LSB, so bit
alignment is expected; for arbitrary bitpacked data, SUB deltas end up
crossing packing thresholds so they are not optimal.

With XOR deltas, we could still choose to encode 8-bit or 16-bit
rotates. However, since SIMD ISAs commonly lack per-lane rotates, 16-bit
rotates require fixing both halves to do the same rotation which is not
really better than a 32-bit rotate; 8-bit rotates could be done before
transposition but require 12 bits (3x4) of extra channel encoding, and
are not much stronger vs a single 32-bit rotate that neatly fits into
existing channel encoding.

This change only contains the encoding support, with no decoding support
or actually changing the rotate value from the default of zero; that
will happen separately.
With just 32-bit rotates, this is straightforward: it can be done after
byte transpose, as we have fully assembled 32-bit values; doing this
before XOR is correct and efficient since it can be done in parallel.
We try all 32 possibilities for every group and record the best one,
assuming XOR is a good choice of the encoding. Then we will use this
value when benchmarking XOR vs SUB variants.

This makes the encoding significantly slower, of course, but we can
implement better heuristics separately.
When we removed 32-bit integer deltas the trace output became wrong;
also we now need to incorporate rotate amount into the output.
This tests wider deltas as well as xor+rotates by forcing the pattern to
be optimal to encode in a given fashion.
We no longer need to support encoder without a working decoder.
The entire 32-bit rotation space is unnecessary for our purpose:
rotating by 8 yields the same compression as rotating by 0, as it just
changes the byte order and bytes are compressed individually.

This reduces the channel encoding consumption to 5 bits (2+3) and
reduces the iteration during encoding 4x. We could technically encode
using 4 bits (using 0, 1, 2..9 values...), but this doesn't meaningfully
impact tail sizes so maybe isn't that necessary.
@zeux zeux changed the title vertexcodec: Augment XOR delta encoding with rotation support vertexcodec: Augment XOR delta encoding with rotation Dec 18, 2024
zeux added 2 commits December 18, 2024 14:41
Shift right by 32 is technically UB; in practice it should not matter
because two valid HW behaviors would either keep v as is, or return 0,
and both result in the same final output, but we should fix this
regardless.
Xor path only gets selected for unsigned int, but the code needs to
remain warning-free for other instantiations of the template.
- We don't really need to mask off the bit rotation; the out of range
  values are invalid and should never be encoded.

- We could replace switch/case using a macro similar to how we do the
  rest of the work for SIMD; this makes it possible to move the
  computation closer to where it's needed to avoid relying too much on
  compiler reordering.

- Validate the channel after estimation vs valid encoding range.
@zeux
Copy link
Owner Author

zeux commented Dec 18, 2024

There's some uncertainties with whether or not rotation will make it into final v1 format; this depends on faster encoding support and further analysis on decoding performance tradeoffs between having rotates and not having rotates and using XOR for fast-decode paths. However in the interest of streamlining the development I'm going to merge this speculatively.

If further analysis will determine that rotation doesn't sufficiently pull its weight then it will be removed. Some other commits in this PR would be nice to have regardless so it's probably reasonable to do this in a separate commit when v1 gets finalized.

@zeux zeux merged commit ee7fef9 into master Dec 18, 2024
12 checks passed
@zeux zeux deleted the vcone-rotx branch December 18, 2024 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant