vertexcodec: Implement support for compression levels #824

zeux · 2024-12-19T18:26:31Z

While individual components of the encoder can always be optimized
further, to ensure that control over encode time/size tradeoff is still
present, meshopt_encodeVertexBufferLevel can now choose compression
level. For v0 all levels are equivalent; for v1, currently:

level 0 is closest to v0 and picks a single bitgroup layout (while
still supporting zero channels as these are fast to reject)
level 1 picks the best bitgroup layout but only uses byte deltas
level 2 selects 1- or 2- byte deltas per channel, without XOR/rot
level 3 selects XOR/rot per channel

These may change as performance characteristics of the encoder change.

This PR also significantly optimizes all levels in subsequent commits.
The initial numbers after the first commit were:

v0: 0.585 GB/s
v1 level 0: 0.548 GB/s
v1 level 1: 0.398 GB/s
v1 level 2: 0.217 GB/s
v1 level 3: 0.060 GB/s

The new numbers are:

v0 encode: 0.778 GB/s
v1 level 0: 0.722 GB/s
v1 level 1: 0.591 GB/s
v1 level 2: 0.482 GB/s
v1 level 3: 0.386 GB/s

For now level 2 is the default; while it is currently ~20% slower to encode vs what v0 used to encode at, using level 1 as a baseline would disable wider deltas and realistically almost 500 MB/s of encoding speed is likely sufficient. Applications that need slightly more gains for complex bitpacked data could choose level 2; applications that need streaming encoding can choose levels 1 or 0.

This contribution is sponsored by Valve.

While individual components of the encoder are going to be optimized further, to ensure that control over encode time/size tradeoff is still present, meshopt_encodeVertexBufferLevel can now choose compression level. For v0 all levels are equivalent; for v1, currently: - level 0 is closest to v0 and picks a single bitgroup layout (while still supporting zero channels as these are fast to reject) - level 1 picks the best bitgroup layout but only uses byte deltas - level 2 selects 1- or 2- byte deltas per channel, without XOR/rot - level 3 selects XOR/rot per channel These may change as performance characteristics of the encoder change.

This will allow for easier performance tuning and comparison.

codectest -test -1 will select compression level 1; for now this only works for file based testing and not for pipe mode which always uses level 2 (default).

Instead of bruteforcing all rotations and computing the xor-encoded size, we can estimate the size without using encoding, and do this analysis on groups of 16 deltas. For each group, we OR the deltas together, which results in "1" bits whenever there's any inconsistency within the group. This is an estimate that is not always as good as the correct value, but it's generally pretty close: if there is a significant bias towards selecting a specific rotation, this algorithm will still find it, and do so 10x faster.

This centralizes the estimation overhead and makes it easier to profile and understand the code. Also, instead of guarding against vertex count in estimate*, move that to the call site.

We do not need to analyze groups that have all bits in equally consistent state; a branch here ends up being slightly beneficial on average.

Since we can't address this table using implicit address math in either layout based on hbits, we might as well use a shorter and more readable layout where shifts and sentinels go together.

Instead of calling a generic encodeBytesMeasure twice, we can note that it redundantly recomputes almost all of the information: in v1, bits 1/2/4 are available to both control modes, so we just need to compute 0 and 8 sizes - of which, 8 is a constant size. So we can compute the sizes for both streams in parallel. Doing this preserves the bitstream exactly, but results in ~20% faster encoding at level 1.

The sentinel branch is difficult to predict; since we have enough space to encode all bytes as sentinels due to decode limit padding, it is safe to append every byte unconditionally and move the pointer for out of range values. This accelerates all levels of encoding further, up to 30% for levels 0 and 1. Also rework encodeBytes flow to call the function just once; this mostly just makes sure the compiler can inline it without issues, as otherwise the function is too large to be inlined into two separate paths.

Deltas requires level 2; while this is currently the default, it's better to be explicit to avoid losing coverage. BitXor requires level 3; the previously specified level was incorrect so the code was not exercised properly. Also adjust BitXor to select xor for two last channels instead of just one.

This is the same wrt the encoding in practice, but this is somewhat cleaner for potential future expansions of channel encoding, as it leaves the full 4 lower bits to store additional modes.

We test all 4 levels for the new version to check no level has encoding issues.

estimateBits takes unsigned char so we need to explicitly truncate to silence.

After previous changes, encodeBytesMeasure is no longer used by any other function than estimateChannel. Inlining the function into estimateChannel allows us to simplify the code, and improves optimizations as an explicit measure is faster vs table selection in practice. This also allows us to drop one of the bit group modes to gain extra performance in the future. In addition we also fix last_vertex handling (this was incorrectly using the first vertex for all blocks instead of last vertex of the previous block) and reduce memset overhead by limiting it to the last (partial) block.

Instead of analyzing every block we could look at a subset of blocks and assume that the statistics of data in different blocks is reasonably close. This is a little brute-force, but gets almost the same compression results on a variety of files, so for now we can do this unconditionally at every level, which significantly increases the encoding throughput of levels 2 and 3.

Makes sure level is not negative and the usage doesn't contain mistakes such as passing vertex_size as a level instead. We allow 0..9 range to allow for possible future expansion of the current 0..3 range.

Instead of having three versions of zigzag per type we can use a template similarly to unzigzag. This produces ~same code with less duplication.

zeux added 17 commits December 19, 2024 06:54

demo: Add encoding for all levels to processDev

011b88f

This will allow for easier performance tuning and comparison.

tools: Allow overriding default compression level in codectest

0f45dc8

codectest -test -1 will select compression level 1; for now this only works for file based testing and not for pipe mode which always uses level 2 (default).

vertexcodec: Extract estimateControl into a separate function

701a194

This centralizes the estimation overhead and makes it easier to profile and understand the code. Also, instead of guarding against vertex count in estimate*, move that to the call site.

vertexcodec: Minor further improvement to estimateRotate

3fd601b

We do not need to analyze groups that have all bits in equally consistent state; a branch here ends up being slightly beneficial on average.

vertexcodec: Clean up decode table for AVX512

d57004e

Since we can't address this table using implicit address math in either layout based on hbits, we might as well use a shorter and more readable layout where shifts and sentinels go together.

vertexcodec: Move rotation to the high 4 bits

8bdd431

This is the same wrt the encoding in practice, but this is somewhat cleaner for potential future expansions of channel encoding, as it leaves the full 4 lower bits to store additional modes.

tools: Expand codecfuzz roundtrip testing to cover new version

21f7495

We test all 4 levels for the new version to check no level has encoding issues.

vertexcodec: Fix MSVC warnings

d74f96d

estimateBits takes unsigned char so we need to explicitly truncate to silence.

vertexcodec: Add an assertion guarding level range

c9c4869

Makes sure level is not negative and the usage doesn't contain mistakes such as passing vertex_size as a level instead. We allow 0..9 range to allow for possible future expansion of the current 0..3 range.

vertexcodec: Use generic zigzag() to match unzigzag()

f5404c3

Instead of having three versions of zigzag per type we can use a template similarly to unzigzag. This produces ~same code with less duplication.

zeux merged commit b1d7cf5 into master Dec 20, 2024
12 checks passed

zeux deleted the vcone-level branch December 20, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vertexcodec: Implement support for compression levels #824

vertexcodec: Implement support for compression levels #824

zeux commented Dec 19, 2024 •

edited

Loading

vertexcodec: Implement support for compression levels #824

vertexcodec: Implement support for compression levels #824

Conversation

zeux commented Dec 19, 2024 • edited Loading

zeux commented Dec 19, 2024 •

edited

Loading