Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertexcodec: Initial implementation of experimental decoders #811

Closed
wants to merge 25 commits into from

Conversation

zeux
Copy link
Owner

@zeux zeux commented Dec 6, 2024

This PR prototypes initial support for the next vertex codec version.

The change introduces version 0xe; this version has zero compatibility guarantees, and will be in the state of flux wrt bitstream format until it is finalized as version 1. The goals for the next version are to retain comparable decoding performance to v0 -- currently v0 decodes around 4.5 GB/s on Zen 4 (7950X), around 3.5 GB/s on M2 (base) -- and to increase compression ratio, targeting up to ~10% relative improvement on average for a set of geometry data. This will make the decoder more complicated and larger, but hopefully not dramatically so. Some performance loss is probably unavoidable, hopefully it can be kept at ~5% or so.

In this PR the initial infrastructure (version selection, enhanced codectest harness) is setup, and byte group encoding is expanded to support a palette of bit counts, adding support for 1-bit and 6-bit groups as well as literal byte blocks. This results in aggregate improvements as follows on the test set (~half of this is glTF data, ~half is proprietary engine data).

143 files: raw v1/v0 -4.83%, lz4 v1/v0 -3.37%, zstd v1/v0 -0.79%

It's not certain yet that the decoding performance can be maintained with this expansion due to some issues around codegen with compilers, but that will be investigated separately. To get to the target compression ratio improvements, more changes will be necessary but they will be orthogonal to byte group encoding enhancements.

This contribution is sponsored by Valve.

zeux added 25 commits December 3, 2024 11:29
In this mode, we take a set of file paths, load each one and recompress
it. The file names should have the form:

	vN_sK_name_etc

... where N is the number of vertices and K is the vertex stride. Each
file may either be compressed using vertex codec or uncompressed. The
file will be decompressed, recompressed, and various stats will be
measured.

Right now this will return meaningless (zero) stats as the compression
algorithm is the same; this, however, will be useful to develop the next
version of vertex codec.
This version is *not* for production; we allow to set it via the
official API to be able to test and incrementally develop it, but the
format will not be finalized for a while. When it is, the version number
will be changed to 1, and existing calls for 0xe / data with 0xe will
fail to encode/decode.
We don't encode these yet but once we do, we'll need to correctly gather
stats for tracing. For simplicity, we gather all bit group sizes but
only display those that will be encoded in v1.
Instead of hardcoding bit counts, we can now encode using arbitrary 4
bit counts. When using the original v0 values, this produces the exact
same output, but this gives us an opportunity to select a different set
of bits inside the same block.

As part of this, bits=1 is changed to bits=0; bits=0 was always correct,
as bits=1 refers to a 1-bit mask plus sentinels.
We previously only supported power of two bit counts, as that allowed to
encode an individual byte in registers. To support other bit counts like
6, we need a more general encoding that accumulates bits and flushes
bytes when ready.

This change retains the same encoding performance.
Also update codectest to use 0xe to trigger the new version encoder.
This makes the code flow easier to follow, and will make adjustments to
delta encoding easier to implement.
In v0 format, we encode using 0, 2, 4 and 8-bit groups. This provides a
reasonable coverage, but misses some other useful bit groups; notably,
1-bit groups (with sentinel encoding) can be used to more efficiently
encode mostly-0 sequences without spending an extra 2 bytes on bit
masks, and 6-bit groups (with sentinel encoding) provide a middle ground
between 4-bit and 8-bit that can be useful for data that is otherwise
difficult to compress.

While we could choose one of 6 groups for every 16 bytes instead of one
of 4, this noticeably increases the header overhead. Instead, we can
preselect a few 4-element arrangements for bits, and use control bits to
select the right one.

For now we choose a hard-coded palette of 3 entries. The 4-th entry may
be useful for literal encoding to save space on header bits for very
incompressible blocks. The palette itself is subject to change.

To select the palette entry, we allocate 2-bit control metadata for
every byte channel in the vertex block. We will likely need more bits in
the future to accomodate deltas et al, but for now let's start with
this.

For now we pick the palette by trying all three variants and choosing
the shorter option. That results in measuring every byte group 12 times;
it should be possible to do this faster if we measure every byte group 6
times and aggregate the results.
Instead of using the last slot of the table, we use it to signify that
the byte channel should be stored literally. This removes the overhead
of the header blocks for uncompressible data, and results in net gains
that outweigh any selection of the extra table slot in practice.
Sometimes two consecutive byte groups could be encoded with different
bit counts at the same size; we now break this tie to favor the last
encoded bit count, unless the size is 16 bytes (8 bits per byte).

This maintains the exact same output size, but results in data that is
slightly more compressible by a backend compressor like LZ4.
This is helpful when tuning changes that don't have a large impact.
This table was found by bruteforcing all 455 permutations of 15 possible
bit combinations in each entry; the best result was ~4.90% reduction,
but for now we pick the version with 0/2/4/8 as one of the entries
(which is ~4.83% reduction) as that makes it easier to retain
compatibility with v0. The previous table that this code replaces got
~4.59% reduction.

The new table also makes more logical sense: 0/1/2/4 optimizes for
well-compressible data, and 1/4/6/8 optimizes for poorly-compressible
data. 2/4/6/8 would perhaps make more sense, but 1/4/6/8 performs
better.
We now print ! when validation is off but the decoding result
mismatches, and also run a 100-iteration benchmark collecting
min/avg/stddev for decode timings.
This is very straightforward: just need to skip the control bits for v0,
and use them to select the bit table. To make decoding a little simpler,
for now we mandate that kBitsV1[0] == kBitsV0; this makes it possible to
index the bit table unconditionally.

Note that this doesn't include the byte group decoding, so we can't
fully parse experimental encoding yet.
Instead of reading one byte at a time we now read one or three bytes at
a time. This results in a decoding structure that is still fairly
generic but able to decode 1-bit and 6-bit groups with the same macros.

The preliminary performance analysis suggests that clang generates
roughly as efficient code as it used to, and gcc is a little behind.
Wasm code size increases by ~500 bytes post-gzip (~9%?). These all will
need to be reevaluated after we implement SIMD decoding anyhow - for
example, decodeBytesGroup could be implemented with more loops to
optimize for code size instead if we target platforms with fallbacks
like Wasm.
This change introduces decodeBytesGroupSimdX which can decode all bit
widths we need in the new encoding using SSE. Code for 0/2/4/8 bits is
copied from decodeBytesGroupSimd, and code for 1/6 bits is new.

To decode 1-bit groups, we can mostly just use the existing tables.
Unfortunately, the bits in these tables are reversed compared to what we
need; this does not affect counts, but this does affect shuffle masks.

For now we perform a byte reverse using scalar math for this.
Alternative options exist, including reworking encoding to flip the
bits, or alternative table construction.

For 6-bit groups, we mostly just need to move the 6-bit values into the
right place after a byte shuffle. We try to take advantage of the bit
arrangement to do this with fewer ands; better implementation schemes
may exist.

It's not clear yet how to do a latency optimization codepath for either
of these groups in a way that is faster than not doing it. This is
perhaps less important for 1-bit groups as the masks are directly
present in source data; 6-bit groups present challenges wrt value
realignment.

Additionally, the dispatch structure for switch here runs into more
optimization concerns; to minimize these for now we split the v1
decoding completely into a separate function with the same structure.
Ideally we should be able to improve performance of "X" variant so that
we can merge these back.
The code closely mirrors the SSE decoding strategy; for 6-bit groups,
we need a 16-byte table lookup -- if possible, we use vqtbl1q_u8, which
requires AArch64; the fallback involves splitting the loads and table
lookups in two.

For 1-bit groups, we do bit reverse using the same scalar sequence as on
Intel; we should have access to bit reverse intrinsic which may make
decoding faster, but since we might reverse the encoded data we can wait
to use it.
This is a straightforward port of SSE version with no surprises.
Blissfully this appears to run almost as fast as the original version,
at least on node/v8, so this might be a case where it's easy to just
share the code with the existing decoder. Even without this the size
expansion is tolerable.
Thanks to the power of multishift, we can decode 1-bit groups using the
same code and an adjusted table entry. For 6-bit groups, we run into a
problem where we were assuming 8 bytes are enough to hold all 16 values,
but for 6-bit groups we need 12 bytes. So we now read a full 16 bytes
and shuffle it using an extra per-bit shuffle mask.
To avoid the compiler inserting extra bounds checks for bits dispatch,
we now tag default: case as unreachable in SIMD implementations.

This is safe because bits[] comes from kBits[] which is statically
declared; attacker controlled values only involve indexing *into* kBits,
not the actual values.

This was less necessary before, because the compiler knew that the
dispatch value (bitslog2) was 2-bit so all cases were handled; it's now
less clear so it needs to be explicitly hinted.

We could do this for the scalar implementation in the future but there's
not a strong need for that.
Cast mask0r/mask1r to unsigned char and fix arguments to _mm_setr_epi8
Our previous version encoded zeroes because the structure of the
increments in the data perfectly matched the stride... Instead, switch
to pseudo-random bit limited inputs similarly to what codecbench does.
Also expand the attempt count a little to try to stabilize the timings,
and add -l option to make it run for a long time and produce a list of
timings.
This mostly just affects vertex codec (although enabling avx has
positive effects elsewhere). Having both options will let us distinguish
the effects.
vreinterpret casts are noop on Apple clang, but meaningful for gcc.
@zeux
Copy link
Owner Author

zeux commented Dec 6, 2024

Subsequent commits in this PR introduce decoding implementation which covers scalar and all SIMD targets; it has had a limited amount of optimizations, mostly focusing on clang/SSSE3 and clang/NEON.

Some compilers hit significant codegen issues with the new code, e.g. gcc/Zen4 is inflated because of some inlining issues I believe. The switch dispatch got more expensive, which is partially mitigated by the new UNREACHABLE macro (previously compilers deduced this fact from the limited selector bits).

Unfortunately, performance penalty remains severe: since the ratio only improves by 5%, it would be optimal to have ~5% performance loss here, but no compiler on no architecture reaches that. A big contributor to this is bit 6 decoding mode: it often replaces bit 8 in an attempt to squeeze a few more bytes out of the encoding, but is much much much more expensive to decode (it's the most expensive bit mode now, as 6 bits can't be cleanly packed into bytes).

This can be partially mitigated by removing sentinel support from bit6, but that reduces efficiency gains. Removing bit6 mode outright reduces efficiency gains even further...

This is a developing space; more commits will hopefully follow as I get new ideas.

Decode v0 clang v1 clang v0 gcc v1 gcc v0 msvc v1 msvc
SSSE3 (Zen4) 5.51 4.63 (-15%) 5.52 3.76 (-32%) 5.29 3.95 (-25%)
AVX512 (Zen4) 6.45 5.65 (-14%) 6.11 4.80 (-21%) 5.04 4.84 (-4%)
SSSE3 (Ice Lake) 2.68 2.39 (-11%) 2.36 2.17 (-8%) - -
AVX512 (Ice Lake) 2.87 2.64 (-8%) 2.84 2.52 (-13%) - -
NEON (M2) 3.46 3.02 (-13%) - - - -
NEON (Graviton3) 2.62 2.14 (-19%) 2.35 2.03 (-14%) - -
WASM (Zen4) 2.82 2.25 (-20%) - - - -

@zeux
Copy link
Owner Author

zeux commented Dec 6, 2024

Actually to simply this I will update the other PR instead and close this one; we don't need two PRs. I was originally hoping I can submit encoding code separately, but the decoding performance situation makes the form of the new version uncertain.

@zeux zeux closed this Dec 6, 2024
@zeux zeux deleted the vcone-bitdec branch December 6, 2024 02:09
@zeux zeux restored the vcone-bitdec branch December 6, 2024 22:03
@zeux zeux deleted the vcone-bitdec branch December 6, 2024 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant