Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More triangles/vertices per meshlet #15023

Merged
merged 9 commits into from
Sep 8, 2024

Conversation

JMS55
Copy link
Contributor

@JMS55 JMS55 commented Sep 3, 2024

Builder changes

  • Increased meshlet max vertices/triangles from 64v/64t to 255v/128t (meshoptimizer won't allow 256v sadly). This gives us a much greater percentage of meshlets with max triangle count (128). Still not perfect, we still end up with some tiny <=10 triangle meshlets that never really get simplified, but it's progress.
  • Removed the error target limit. Now we allow meshoptimizer to simplify as much as possible. No reason to cap this out, as the cluster culling code will choose a good LOD level anyways. Again leads to higher quality LOD trees.
  • After some discussion and consulting the Nanite slides again, changed meshlet group error from adding the max child's error to the group error, to doing group_error = max(group_error, max_child_error). Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time.
  • Bumped the 65% simplification threshold to allow up to 95% of the original geometry (e.g. accept simplification as valid even if we only simplified 5% of the triangles). This gives us closer to log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the DAG.

Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer.

Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (zeux/meshoptimizer#750 (comment)) so hard to test.

Based on discussion from #14998, zeux/meshoptimizer#750, and discord.

Runtime changes

  • cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits, as max triangles per cluster are now 128 instead of 64
  • Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices to account for the new max triangles limit
    • Hardware raster now outputs NaN triangles (0 / 0) instead of zero-positioned triangles for extra vertex invocations over the cluster triangle count. Shouldn't really be a difference idt, but I did it anyways.
  • Software raster now does 128 threads per workgroup instead of 64 threads. Each thread now loads, projects, and caches a vertex (vertices 0-127), and then if needed does so again (vertices 128-254). Each thread then rasterizes one of 128 triangles.
  • Fixed a bug with needs_dispatch_remap. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.

@JMS55 JMS55 requested a review from atlv24 September 3, 2024 00:34
@JMS55 JMS55 added the A-Rendering Drawing game state to the screen label Sep 3, 2024
@JMS55 JMS55 added this to the 0.15 milestone Sep 3, 2024
@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times C-Bug An unexpected or incorrect behavior labels Sep 3, 2024
@alice-i-cecile alice-i-cecile added the S-Needs-Review Needs reviewer attention (from anyone!) to move forward label Sep 3, 2024
@JMS55
Copy link
Contributor Author

JMS55 commented Sep 3, 2024

@zeux @Scthe not asking for your review, but tagging you as you may be interested in this PR.

@zeux
Copy link
Contributor

zeux commented Sep 3, 2024

255v/256t

nit: 255v/128t (the code is correct, the PR description is not)

@JMS55
Copy link
Contributor Author

JMS55 commented Sep 3, 2024

Thanks for catching that, updated the PR description.

let world_from_local = affine3_to_square(instance_uniform.world_from_local);

// Load and project 1 vertex per thread, and then again if there are more than 128 vertices in the meshlet
for (var i = 0u; i <= 128u; i += 128u) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: up to you but it feels like i < 255u would be more clear; in general, this loop is

for (i = 0; i < max_vertices; i += workgroup_size)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, not quite the way I thought of it. My thought process was "we want to do this twice, with the second time using an offset of 128".

// Simplify the group to ~50% triangle count
// TODO: Simplify using vertex attributes
let mut error = 0.0;
let simplified_group_indices = simplify(
&group_indices,
vertices,
group_indices.len() / 2,
target_error,
f32::MAX,
Copy link

@Scthe Scthe Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, is there a scenario where both of the following happen:

  1. simplification fails to reach target triangle count,
  2. there are still leftover vertices that are not shared with other meshlets?

For Nanite simplification, all vertices mentioned in point 2 could probably be removed. While this might introduce artefacts, the error metric should show more detailed meshlets in such cases.

I suspect that to produce an optimal DAG ("Batched Multi Triangulation" section 3.2) you should always remove as many triangles as possible to reach 256 triangles (2 meshlets). The only constraint is Nanite's immutable shared vertices between meshlets. Any problems caused by simplification are shoved into the error metric.

See #14998 (comment) for context.

@Scthe
Copy link

Scthe commented Sep 4, 2024

I've noticed You have a following line:

if simplified_group_indices.len() as f32 / group_indices.len() as f32 > 0.65 {

Is the 65% a heuristic? You could also try 75%. If you had 4 meshlets, reduction by 25% ends up with 3 meshlets. In the next iteration METIS might "redistribute" these meshlets. In my experience this is a mixed bag.

This change should reduce number of roots at a cost of less uniform meshlet size and a bigger overall DAG.

It's also interesting to see what happens if you remove this code path (or just comment out return None). At some point it's an infinite loop where each iteration has same meshlet count as before. Striving for (at minimum) 25% reduction prevents this scenario. There are some subtleties here to relax this number, but it's probably a problem with simplification if you cannot remove even 25% of tris. When does this happen? E.g. meshoptimizer cannot simplify meshes with flat shading, where all vertices unique (wrt. to position and normal combination).

// Simplify the group to ~50% triangle count
// TODO: Simplify using vertex attributes
let mut error = 0.0;
let simplified_group_indices = simplify(
&group_indices,
vertices,
group_indices.len() / 2,
target_error,
f32::MAX,
SimplifyOptions::LockBorder | SimplifyOptions::Sparse | SimplifyOptions::ErrorAbsolute,
Copy link

@Scthe Scthe Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context: #14998 (comment) , section "What is a border?".

I think that meshoptimizer's definition of border vertex can be relaxed for Nanite. It includes vertices that are not shared with other meshlets (see image in the post lined above). You could provide locked vertices (const unsigned char* vertex_lock) to e.g. meshopt_simplifyWithAttributes(). This might lead to visual artefacts, but wouldn't that be an issue with an error metric instead?

Copy link
Contributor Author

@JMS55 JMS55 Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what open edge means in this context (google doesn't show anything). But check the comment I left in that thread - I'm not sure that we're on the same page about which vertices need locking.

@JMS55
Copy link
Contributor Author

JMS55 commented Sep 5, 2024

if simplified_group_indices.len() as f32 / group_indices.len() as f32 > 0.65 {

I ran the mesh converter on the "Huge Icelandic Lava Cliff" mesh from quixel megascans, with various percentages. Here's the data: https://pastebin.com/raw/pvQPfbZG. It's structured as key:value pairs where the key is a triangle count [0, 128], and the value is the number of meshlets in the current LOD level that have that many triangles.

I'm not sure what you would say is "best". 50% has less variance, but also less LOD levels compared to higher percentages. I think ideally we want log2(lod_0_meshlet_count) levels, so ~14 total here? So maybe higher percentage is better? Perhaps 99%, or basically as long as we simplified something (maybe one meshlet's worth?).

@JMS55
Copy link
Contributor Author

JMS55 commented Sep 5, 2024

Btw, here's the git patch (on top of the previous commit) that I use for testing https://pastebin.com/raw/mFKrCtYr.

Run it with cargo run --example meshlet --features meshlet,meshlet_processor --release. It'll print the triangle count statistics and save the newly created meshlet mesh, and then exit. If you want to visualize the meshlet mesh, discard the changes to meshlet.rs and then rerun the same command.

@JMS55 JMS55 requested a review from IceSentry September 6, 2024 01:45
@JMS55
Copy link
Contributor Author

JMS55 commented Sep 6, 2024

I think I want to ship these improvements for now, and I'll continue to experiment on my own, and open another PR if I end up finding further improvements.

Copy link
Member

@alice-i-cecile alice-i-cecile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the level of review from the other reviewers, and there's nothing concerning in here.

@alice-i-cecile alice-i-cecile added this pull request to the merge queue Sep 8, 2024
@alice-i-cecile alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Sep 8, 2024
Merged via the queue into bevyengine:main with commit a0faf9c Sep 8, 2024
31 checks passed
BD103 pushed a commit to BD103/bevy that referenced this pull request Sep 9, 2024
### Builder changes
- Increased meshlet max vertices/triangles from 64v/64t to 255v/128t
(meshoptimizer won't allow 256v sadly). This gives us a much greater
percentage of meshlets with max triangle count (128). Still not perfect,
we still end up with some tiny <=10 triangle meshlets that never really
get simplified, but it's progress.
- Removed the error target limit. Now we allow meshoptimizer to simplify
as much as possible. No reason to cap this out, as the cluster culling
code will choose a good LOD level anyways. Again leads to higher quality
LOD trees.
- After some discussion and consulting the Nanite slides again, changed
meshlet group error from _adding_ the max child's error to the group
error, to doing `group_error = max(group_error, max_child_error)`. Error
is already cumulative between LODs as the edges we're collapsing during
simplification get longer each time.
- Bumped the 65% simplification threshold to allow up to 95% of the
original geometry (e.g. accept simplification as valid even if we only
simplified 5% of the triangles). This gives us closer to
log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the
DAG.

Still more work to be done in the future here. Maybe trying METIS for
meshlet building instead of meshoptimizer.

Using ~8 clusters per group instead of ~4 might also make a big
difference. The Nanite slides say that they have 8-32 meshlets per
group, suggesting some kind of heuristic. Unfortunately meshopt's
compute_cluster_bounds won't work with large groups atm
(zeux/meshoptimizer#750 (comment))
so hard to test.

Based on discussion from
bevyengine#14998,
zeux/meshoptimizer#750, and discord.

### Runtime changes
- cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits,
as max triangles per cluster are now 128 instead of 64
- Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices
to account for the new max triangles limit
- Hardware raster now outputs NaN triangles (0 / 0) instead of
zero-positioned triangles for extra vertex invocations over the cluster
triangle count. Shouldn't really be a difference idt, but I did it
anyways.
- Software raster now does 128 threads per workgroup instead of 64
threads. Each thread now loads, projects, and caches a vertex (vertices
0-127), and then if needed does so again (vertices 128-254). Each thread
then rasterizes one of 128 triangles.
- Fixed a bug with `needs_dispatch_remap`. I had the condition backwards
in my last PR, I probably committed it by accident after testing the
non-default code path on my GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Bug An unexpected or incorrect behavior C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants