Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vertexcodec: Implement support for compression levels #824

Merged
merged 17 commits into from
Dec 20, 2024
Merged

Conversation

zeux
Copy link
Owner

@zeux zeux commented Dec 19, 2024

While individual components of the encoder can always be optimized
further, to ensure that control over encode time/size tradeoff is still
present, meshopt_encodeVertexBufferLevel can now choose compression
level. For v0 all levels are equivalent; for v1, currently:

  • level 0 is closest to v0 and picks a single bitgroup layout (while
    still supporting zero channels as these are fast to reject)
  • level 1 picks the best bitgroup layout but only uses byte deltas
  • level 2 selects 1- or 2- byte deltas per channel, without XOR/rot
  • level 3 selects XOR/rot per channel

These may change as performance characteristics of the encoder change.

This PR also significantly optimizes all levels in subsequent commits.
The initial numbers after the first commit were:

v0: 0.585 GB/s
v1 level 0: 0.548 GB/s
v1 level 1: 0.398 GB/s
v1 level 2: 0.217 GB/s
v1 level 3: 0.060 GB/s

The new numbers are:

v0 encode: 0.778 GB/s
v1 level 0: 0.722 GB/s
v1 level 1: 0.591 GB/s
v1 level 2: 0.482 GB/s
v1 level 3: 0.386 GB/s

For now level 2 is the default; while it is currently ~20% slower to encode vs what v0 used to encode at, using level 1 as a baseline would disable wider deltas and realistically almost 500 MB/s of encoding speed is likely sufficient. Applications that need slightly more gains for complex bitpacked data could choose level 2; applications that need streaming encoding can choose levels 1 or 0.

This contribution is sponsored by Valve.

zeux added 17 commits December 19, 2024 06:54
While individual components of the encoder are going to be optimized
further, to ensure that control over encode time/size tradeoff is still
present, meshopt_encodeVertexBufferLevel can now choose compression
level. For v0 all levels are equivalent; for v1, currently:

- level 0 is closest to v0 and picks a single bitgroup layout (while
  still supporting zero channels as these are fast to reject)
- level 1 picks the best bitgroup layout but only uses byte deltas
- level 2 selects 1- or 2- byte deltas per channel, without XOR/rot
- level 3 selects XOR/rot per channel

These may change as performance characteristics of the encoder change.
This will allow for easier performance tuning and comparison.
codectest -test -1 will select compression level 1; for now this only
works for file based testing and not for pipe mode which always uses
level 2 (default).
Instead of bruteforcing all rotations and computing the xor-encoded
size, we can estimate the size without using encoding, and do this
analysis on groups of 16 deltas. For each group, we OR the deltas
together, which results in "1" bits whenever there's any inconsistency
within the group.

This is an estimate that is not always as good as the correct value, but
it's generally pretty close: if there is a significant bias towards
selecting a specific rotation, this algorithm will still find it, and do
so 10x faster.
This centralizes the estimation overhead and makes it easier to profile
and understand the code.

Also, instead of guarding against vertex count in estimate*, move that
to the call site.
We do not need to analyze groups that have all bits in equally
consistent state; a branch here ends up being slightly beneficial on
average.
Since we can't address this table using implicit address math in either
layout based on hbits, we might as well use a shorter and more readable
layout where shifts and sentinels go together.
Instead of calling a generic encodeBytesMeasure twice, we can note that
it redundantly recomputes almost all of the information: in v1, bits
1/2/4 are available to both control modes, so we just need to compute 0
and 8 sizes - of which, 8 is a constant size. So we can compute the
sizes for both streams in parallel.

Doing this preserves the bitstream exactly, but results in ~20% faster
encoding at level 1.
The sentinel branch is difficult to predict; since we have enough space
to encode all bytes as sentinels due to decode limit padding, it is safe
to append every byte unconditionally and move the pointer for out of
range values. This accelerates all levels of encoding further, up to 30%
for levels 0 and 1.

Also rework encodeBytes flow to call the function just once; this mostly
just makes sure the compiler can inline it without issues, as otherwise
the function is too large to be inlined into two separate paths.
Deltas requires level 2; while this is currently the default, it's
better to be explicit to avoid losing coverage.

BitXor requires level 3; the previously specified level was incorrect so
the code was not exercised properly.

Also adjust BitXor to select xor for two last channels instead of just
one.
This is the same wrt the encoding in practice, but this is somewhat
cleaner for potential future expansions of channel encoding, as it
leaves the full 4 lower bits to store additional modes.
We test all 4 levels for the new version to check no level has encoding
issues.
estimateBits takes unsigned char so we need to explicitly truncate to
silence.
After previous changes, encodeBytesMeasure is no longer used by any
other function than estimateChannel. Inlining the function into
estimateChannel allows us to simplify the code, and improves
optimizations as an explicit measure is faster vs table selection in
practice. This also allows us to drop one of the bit group modes to gain
extra performance in the future.

In addition we also fix last_vertex handling (this was incorrectly using
the first vertex for all blocks instead of last vertex of the previous
block) and reduce memset overhead by limiting it to the last (partial)
block.
Instead of analyzing every block we could look at a subset of blocks and
assume that the statistics of data in different blocks is reasonably
close. This is a little brute-force, but gets almost the same
compression results on a variety of files, so for now we can do this
unconditionally at every level, which significantly increases the
encoding throughput of levels 2 and 3.
Makes sure level is not negative and the usage doesn't contain mistakes
such as passing vertex_size as a level instead. We allow 0..9 range to
allow for possible future expansion of the current 0..3 range.
Instead of having three versions of zigzag per type we can use a
template similarly to unzigzag. This produces ~same code with less
duplication.
@zeux zeux merged commit b1d7cf5 into master Dec 20, 2024
12 checks passed
@zeux zeux deleted the vcone-level branch December 20, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant