vertexcodec: Initial implementation of experimental decoders #811

zeux · 2024-12-06T00:34:19Z

This PR prototypes initial support for the next vertex codec version.

The change introduces version 0xe; this version has zero compatibility guarantees, and will be in the state of flux wrt bitstream format until it is finalized as version 1. The goals for the next version are to retain comparable decoding performance to v0 -- currently v0 decodes around 4.5 GB/s on Zen 4 (7950X), around 3.5 GB/s on M2 (base) -- and to increase compression ratio, targeting up to ~10% relative improvement on average for a set of geometry data. This will make the decoder more complicated and larger, but hopefully not dramatically so. Some performance loss is probably unavoidable, hopefully it can be kept at ~5% or so.

In this PR the initial infrastructure (version selection, enhanced codectest harness) is setup, and byte group encoding is expanded to support a palette of bit counts, adding support for 1-bit and 6-bit groups as well as literal byte blocks. This results in aggregate improvements as follows on the test set (~half of this is glTF data, ~half is proprietary engine data).

143 files: raw v1/v0 -4.83%, lz4 v1/v0 -3.37%, zstd v1/v0 -0.79%

It's not certain yet that the decoding performance can be maintained with this expansion due to some issues around codegen with compilers, but that will be investigated separately. To get to the target compression ratio improvements, more changes will be necessary but they will be orthogonal to byte group encoding enhancements.

This contribution is sponsored by Valve.

In this mode, we take a set of file paths, load each one and recompress it. The file names should have the form: vN_sK_name_etc ... where N is the number of vertices and K is the vertex stride. Each file may either be compressed using vertex codec or uncompressed. The file will be decompressed, recompressed, and various stats will be measured. Right now this will return meaningless (zero) stats as the compression algorithm is the same; this, however, will be useful to develop the next version of vertex codec.

This version is *not* for production; we allow to set it via the official API to be able to test and incrementally develop it, but the format will not be finalized for a while. When it is, the version number will be changed to 1, and existing calls for 0xe / data with 0xe will fail to encode/decode.

We don't encode these yet but once we do, we'll need to correctly gather stats for tracing. For simplicity, we gather all bit group sizes but only display those that will be encoded in v1.

Instead of hardcoding bit counts, we can now encode using arbitrary 4 bit counts. When using the original v0 values, this produces the exact same output, but this gives us an opportunity to select a different set of bits inside the same block. As part of this, bits=1 is changed to bits=0; bits=0 was always correct, as bits=1 refers to a 1-bit mask plus sentinels.

We previously only supported power of two bit counts, as that allowed to encode an individual byte in registers. To support other bit counts like 6, we need a more general encoding that accumulates bits and flushes bytes when ready. This change retains the same encoding performance.

Also update codectest to use 0xe to trigger the new version encoder.

This makes the code flow easier to follow, and will make adjustments to delta encoding easier to implement.

In v0 format, we encode using 0, 2, 4 and 8-bit groups. This provides a reasonable coverage, but misses some other useful bit groups; notably, 1-bit groups (with sentinel encoding) can be used to more efficiently encode mostly-0 sequences without spending an extra 2 bytes on bit masks, and 6-bit groups (with sentinel encoding) provide a middle ground between 4-bit and 8-bit that can be useful for data that is otherwise difficult to compress. While we could choose one of 6 groups for every 16 bytes instead of one of 4, this noticeably increases the header overhead. Instead, we can preselect a few 4-element arrangements for bits, and use control bits to select the right one. For now we choose a hard-coded palette of 3 entries. The 4-th entry may be useful for literal encoding to save space on header bits for very incompressible blocks. The palette itself is subject to change. To select the palette entry, we allocate 2-bit control metadata for every byte channel in the vertex block. We will likely need more bits in the future to accomodate deltas et al, but for now let's start with this. For now we pick the palette by trying all three variants and choosing the shorter option. That results in measuring every byte group 12 times; it should be possible to do this faster if we measure every byte group 6 times and aggregate the results.

Instead of using the last slot of the table, we use it to signify that the byte channel should be stored literally. This removes the overhead of the header blocks for uncompressible data, and results in net gains that outweigh any selection of the extra table slot in practice.

Sometimes two consecutive byte groups could be encoded with different bit counts at the same size; we now break this tie to favor the last encoded bit count, unless the size is 16 bytes (8 bits per byte). This maintains the exact same output size, but results in data that is slightly more compressible by a backend compressor like LZ4.

This is helpful when tuning changes that don't have a large impact.

This table was found by bruteforcing all 455 permutations of 15 possible bit combinations in each entry; the best result was ~4.90% reduction, but for now we pick the version with 0/2/4/8 as one of the entries (which is ~4.83% reduction) as that makes it easier to retain compatibility with v0. The previous table that this code replaces got ~4.59% reduction. The new table also makes more logical sense: 0/1/2/4 optimizes for well-compressible data, and 1/4/6/8 optimizes for poorly-compressible data. 2/4/6/8 would perhaps make more sense, but 1/4/6/8 performs better.

We now print ! when validation is off but the decoding result mismatches, and also run a 100-iteration benchmark collecting min/avg/stddev for decode timings.

This is very straightforward: just need to skip the control bits for v0, and use them to select the bit table. To make decoding a little simpler, for now we mandate that kBitsV1[0] == kBitsV0; this makes it possible to index the bit table unconditionally. Note that this doesn't include the byte group decoding, so we can't fully parse experimental encoding yet.

Instead of reading one byte at a time we now read one or three bytes at a time. This results in a decoding structure that is still fairly generic but able to decode 1-bit and 6-bit groups with the same macros. The preliminary performance analysis suggests that clang generates roughly as efficient code as it used to, and gcc is a little behind. Wasm code size increases by ~500 bytes post-gzip (~9%?). These all will need to be reevaluated after we implement SIMD decoding anyhow - for example, decodeBytesGroup could be implemented with more loops to optimize for code size instead if we target platforms with fallbacks like Wasm.

This change introduces decodeBytesGroupSimdX which can decode all bit widths we need in the new encoding using SSE. Code for 0/2/4/8 bits is copied from decodeBytesGroupSimd, and code for 1/6 bits is new. To decode 1-bit groups, we can mostly just use the existing tables. Unfortunately, the bits in these tables are reversed compared to what we need; this does not affect counts, but this does affect shuffle masks. For now we perform a byte reverse using scalar math for this. Alternative options exist, including reworking encoding to flip the bits, or alternative table construction. For 6-bit groups, we mostly just need to move the 6-bit values into the right place after a byte shuffle. We try to take advantage of the bit arrangement to do this with fewer ands; better implementation schemes may exist. It's not clear yet how to do a latency optimization codepath for either of these groups in a way that is faster than not doing it. This is perhaps less important for 1-bit groups as the masks are directly present in source data; 6-bit groups present challenges wrt value realignment. Additionally, the dispatch structure for switch here runs into more optimization concerns; to minimize these for now we split the v1 decoding completely into a separate function with the same structure. Ideally we should be able to improve performance of "X" variant so that we can merge these back.

The code closely mirrors the SSE decoding strategy; for 6-bit groups, we need a 16-byte table lookup -- if possible, we use vqtbl1q_u8, which requires AArch64; the fallback involves splitting the loads and table lookups in two. For 1-bit groups, we do bit reverse using the same scalar sequence as on Intel; we should have access to bit reverse intrinsic which may make decoding faster, but since we might reverse the encoded data we can wait to use it.

This is a straightforward port of SSE version with no surprises. Blissfully this appears to run almost as fast as the original version, at least on node/v8, so this might be a case where it's easy to just share the code with the existing decoder. Even without this the size expansion is tolerable.

Thanks to the power of multishift, we can decode 1-bit groups using the same code and an adjusted table entry. For 6-bit groups, we run into a problem where we were assuming 8 bytes are enough to hold all 16 values, but for 6-bit groups we need 12 bytes. So we now read a full 16 bytes and shuffle it using an extra per-bit shuffle mask.

To avoid the compiler inserting extra bounds checks for bits dispatch, we now tag default: case as unreachable in SIMD implementations. This is safe because bits[] comes from kBits[] which is statically declared; attacker controlled values only involve indexing *into* kBits, not the actual values. This was less necessary before, because the compiler knew that the dispatch value (bitslog2) was 2-bit so all cases were handled; it's now less clear so it needs to be explicitly hinted. We could do this for the scalar implementation in the future but there's not a strong need for that.

Cast mask0r/mask1r to unsigned char and fix arguments to _mm_setr_epi8

Our previous version encoded zeroes because the structure of the increments in the data perfectly matched the stride... Instead, switch to pseudo-random bit limited inputs similarly to what codecbench does.

Also expand the attempt count a little to try to stabilize the timings, and add -l option to make it run for a long time and produce a list of timings.

This mostly just affects vertex codec (although enabling avx has positive effects elsewhere). Having both options will let us distinguish the effects.

vreinterpret casts are noop on Apple clang, but meaningful for gcc.

zeux · 2024-12-06T01:57:37Z

Subsequent commits in this PR introduce decoding implementation which covers scalar and all SIMD targets; it has had a limited amount of optimizations, mostly focusing on clang/SSSE3 and clang/NEON.

Some compilers hit significant codegen issues with the new code, e.g. gcc/Zen4 is inflated because of some inlining issues I believe. The switch dispatch got more expensive, which is partially mitigated by the new UNREACHABLE macro (previously compilers deduced this fact from the limited selector bits).

Unfortunately, performance penalty remains severe: since the ratio only improves by 5%, it would be optimal to have ~5% performance loss here, but no compiler on no architecture reaches that. A big contributor to this is bit 6 decoding mode: it often replaces bit 8 in an attempt to squeeze a few more bytes out of the encoding, but is much much much more expensive to decode (it's the most expensive bit mode now, as 6 bits can't be cleanly packed into bytes).

This can be partially mitigated by removing sentinel support from bit6, but that reduces efficiency gains. Removing bit6 mode outright reduces efficiency gains even further...

This is a developing space; more commits will hopefully follow as I get new ideas.

Decode	v0 clang	v1 clang	v0 gcc	v1 gcc	v0 msvc	v1 msvc
SSSE3 (Zen4)	5.51	4.63 (-15%)	5.52	3.76 (-32%)	5.29	3.95 (-25%)
AVX512 (Zen4)	6.45	5.65 (-14%)	6.11	4.80 (-21%)	5.04	4.84 (-4%)
SSSE3 (Ice Lake)	2.68	2.39 (-11%)	2.36	2.17 (-8%)	-	-
AVX512 (Ice Lake)	2.87	2.64 (-8%)	2.84	2.52 (-13%)	-	-
NEON (M2)	3.46	3.02 (-13%)	-	-	-	-
NEON (Graviton3)	2.62	2.14 (-19%)	2.35	2.03 (-14%)	-	-
WASM (Zen4)	2.82	2.25 (-20%)	-	-	-	-

zeux · 2024-12-06T02:07:04Z

Actually to simply this I will update the other PR instead and close this one; we don't need two PRs. I was originally hoping I can submit encoding code separately, but the decoding performance situation makes the form of the new version uncertain.

zeux added 25 commits December 3, 2024 11:29

vertexcodec: Update bit group tracing code to support more groups

a0191fc

We don't encode these yet but once we do, we'll need to correctly gather stats for tracing. For simplicity, we gather all bit group sizes but only display those that will be encoded in v1.

demo: Switch processDev to experimental codec encoding

05f488e

Also update codectest to use 0xe to trigger the new version encoder.

vertexcodec: Extract delta decoding into a separate function

2f08e47

This makes the code flow easier to follow, and will make adjustments to delta encoding easier to implement.

tools: Use slightly more precision for codectest summary

0d267f3

This is helpful when tuning changes that don't have a large impact.

demo: Add benchmark mode and an indication of decoding correctness

9167905

We now print ! when validation is off but the decoding result mismatches, and also run a 100-iteration benchmark collecting min/avg/stddev for decode timings.

vertexcodec: Fix MSVC warnings

280cf5f

Cast mask0r/mask1r to unsigned char and fix arguments to _mm_setr_epi8

js: Update benchmark with more realistic data

ccc3f33

Our previous version encoded zeroes because the structure of the increments in the data perfectly matched the stride... Instead, switch to pseudo-random bit limited inputs similarly to what codecbench does.

tools: Update codecbench to measure new decoder as well

828683f

Also expand the attempt count a little to try to stabilize the timings, and add -l option to make it run for a long time and produce a list of timings.

Add release-avx/release-avx512 config to Makefile for testing

93eab2f

This mostly just affects vertex codec (although enabling avx has positive effects elsewhere). Having both options will let us distinguish the effects.

vertexcodec: Fix AArch64 compilation for gcc/Linux

8675b0f

vreinterpret casts are noop on Apple clang, but meaningful for gcc.

zeux closed this Dec 6, 2024

zeux deleted the vcone-bitdec branch December 6, 2024 02:09

zeux restored the vcone-bitdec branch December 6, 2024 22:03

zeux deleted the vcone-bitdec branch December 6, 2024 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vertexcodec: Initial implementation of experimental decoders #811

vertexcodec: Initial implementation of experimental decoders #811

zeux commented Dec 6, 2024 •

edited

Loading

zeux commented Dec 6, 2024 •

edited

Loading

zeux commented Dec 6, 2024

vertexcodec: Initial implementation of experimental decoders #811

vertexcodec: Initial implementation of experimental decoders #811

Conversation

zeux commented Dec 6, 2024 • edited Loading

zeux commented Dec 6, 2024 • edited Loading

zeux commented Dec 6, 2024

zeux commented Dec 6, 2024 •

edited

Loading

zeux commented Dec 6, 2024 •

edited

Loading