More triangles/vertices per meshlet #15023

JMS55 · 2024-09-03T00:34:25Z

Builder changes

Increased meshlet max vertices/triangles from 64v/64t to 255v/128t (meshoptimizer won't allow 256v sadly). This gives us a much greater percentage of meshlets with max triangle count (128). Still not perfect, we still end up with some tiny <=10 triangle meshlets that never really get simplified, but it's progress.
Removed the error target limit. Now we allow meshoptimizer to simplify as much as possible. No reason to cap this out, as the cluster culling code will choose a good LOD level anyways. Again leads to higher quality LOD trees.
After some discussion and consulting the Nanite slides again, changed meshlet group error from adding the max child's error to the group error, to doing group_error = max(group_error, max_child_error). Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time.
Bumped the 65% simplification threshold to allow up to 95% of the original geometry (e.g. accept simplification as valid even if we only simplified 5% of the triangles). This gives us closer to log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the DAG.

Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer.

Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (zeux/meshoptimizer#750 (comment)) so hard to test.

Based on discussion from #14998, zeux/meshoptimizer#750, and discord.

Runtime changes

cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits, as max triangles per cluster are now 128 instead of 64
Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices to account for the new max triangles limit
- Hardware raster now outputs NaN triangles (0 / 0) instead of zero-positioned triangles for extra vertex invocations over the cluster triangle count. Shouldn't really be a difference idt, but I did it anyways.
Software raster now does 128 threads per workgroup instead of 64 threads. Each thread now loads, projects, and caches a vertex (vertices 0-127), and then if needed does so again (vertices 128-254). Each thread then rasterizes one of 128 triangles.
Fixed a bug with needs_dispatch_remap. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.

JMS55 · 2024-09-03T00:39:06Z

@zeux @Scthe not asking for your review, but tagging you as you may be interested in this PR.

zeux · 2024-09-03T00:58:24Z

255v/256t

nit: 255v/128t (the code is correct, the PR description is not)

JMS55 · 2024-09-03T01:08:13Z

Thanks for catching that, updated the PR description.

zeux · 2024-09-03T02:58:06Z

crates/bevy_pbr/src/meshlet/visibility_buffer_software_raster.wgsl

+ let world_from_local = affine3_to_square(instance_uniform.world_from_local);
+
+ // Load and project 1 vertex per thread, and then again if there are more than 128 vertices in the meshlet
+ for (var i = 0u; i <= 128u; i += 128u) {


nit: up to you but it feels like i < 255u would be more clear; in general, this loop is

for (i = 0; i < max_vertices; i += workgroup_size)

Ah, not quite the way I thought of it. My thought process was "we want to do this twice, with the second time using an offset of 128".

Scthe · 2024-09-04T21:08:44Z

crates/bevy_pbr/src/meshlet/from_mesh.rs

 // Simplify the group to ~50% triangle count
 // TODO: Simplify using vertex attributes
 let mut error = 0.0;
 let simplified_group_indices = simplify(
 &group_indices,
 vertices,
 group_indices.len() / 2,
- target_error,
+ f32::MAX,


After this change, is there a scenario where both of the following happen:

simplification fails to reach target triangle count,

there are still leftover vertices that are not shared with other meshlets?

For Nanite simplification, all vertices mentioned in point 2 could probably be removed. While this might introduce artefacts, the error metric should show more detailed meshlets in such cases.

I suspect that to produce an optimal DAG ("Batched Multi Triangulation" section 3.2) you should always remove as many triangles as possible to reach 256 triangles (2 meshlets). The only constraint is Nanite's immutable shared vertices between meshlets. Any problems caused by simplification are shoved into the error metric.

See #14998 (comment) for context.

Scthe · 2024-09-04T21:12:11Z

I've noticed You have a following line:

if simplified_group_indices.len() as f32 / group_indices.len() as f32 > 0.65 {

Is the 65% a heuristic? You could also try 75%. If you had 4 meshlets, reduction by 25% ends up with 3 meshlets. In the next iteration METIS might "redistribute" these meshlets. In my experience this is a mixed bag.

This change should reduce number of roots at a cost of less uniform meshlet size and a bigger overall DAG.

It's also interesting to see what happens if you remove this code path (or just comment out return None). At some point it's an infinite loop where each iteration has same meshlet count as before. Striving for (at minimum) 25% reduction prevents this scenario. There are some subtleties here to relax this number, but it's probably a problem with simplification if you cannot remove even 25% of tris. When does this happen? E.g. meshoptimizer cannot simplify meshes with flat shading, where all vertices unique (wrt. to position and normal combination).

Scthe · 2024-09-04T21:30:04Z

crates/bevy_pbr/src/meshlet/from_mesh.rs

 // Simplify the group to ~50% triangle count
 // TODO: Simplify using vertex attributes
 let mut error = 0.0;
 let simplified_group_indices = simplify(
 &group_indices,
 vertices,
 group_indices.len() / 2,
- target_error,
+ f32::MAX,
 SimplifyOptions::LockBorder | SimplifyOptions::Sparse | SimplifyOptions::ErrorAbsolute,


Context: #14998 (comment) , section "What is a border?".

I think that meshoptimizer's definition of border vertex can be relaxed for Nanite. It includes vertices that are not shared with other meshlets (see image in the post lined above). You could provide locked vertices (const unsigned char* vertex_lock) to e.g. meshopt_simplifyWithAttributes(). This might lead to visual artefacts, but wouldn't that be an issue with an error metric instead?

I'm not sure what open edge means in this context (google doesn't show anything). But check the comment I left in that thread - I'm not sure that we're on the same page about which vertices need locking.

JMS55 · 2024-09-05T03:56:17Z

if simplified_group_indices.len() as f32 / group_indices.len() as f32 > 0.65 {

I ran the mesh converter on the "Huge Icelandic Lava Cliff" mesh from quixel megascans, with various percentages. Here's the data: https://pastebin.com/raw/pvQPfbZG. It's structured as key:value pairs where the key is a triangle count [0, 128], and the value is the number of meshlets in the current LOD level that have that many triangles.

I'm not sure what you would say is "best". 50% has less variance, but also less LOD levels compared to higher percentages. I think ideally we want log2(lod_0_meshlet_count) levels, so ~14 total here? So maybe higher percentage is better? Perhaps 99%, or basically as long as we simplified something (maybe one meshlet's worth?).

JMS55 · 2024-09-05T04:32:05Z

Btw, here's the git patch (on top of the previous commit) that I use for testing https://pastebin.com/raw/mFKrCtYr.

Run it with cargo run --example meshlet --features meshlet,meshlet_processor --release. It'll print the triangle count statistics and save the newly created meshlet mesh, and then exit. If you want to visualize the meshlet mesh, discard the changes to meshlet.rs and then rerun the same command.

JMS55 · 2024-09-06T01:46:36Z

I think I want to ship these improvements for now, and I'll continue to experiment on my own, and open another PR if I end up finding further improvements.

alice-i-cecile

I'm happy with the level of review from the other reviewers, and there's nothing concerning in here.

### Builder changes - Increased meshlet max vertices/triangles from 64v/64t to 255v/128t (meshoptimizer won't allow 256v sadly). This gives us a much greater percentage of meshlets with max triangle count (128). Still not perfect, we still end up with some tiny <=10 triangle meshlets that never really get simplified, but it's progress. - Removed the error target limit. Now we allow meshoptimizer to simplify as much as possible. No reason to cap this out, as the cluster culling code will choose a good LOD level anyways. Again leads to higher quality LOD trees. - After some discussion and consulting the Nanite slides again, changed meshlet group error from _adding_ the max child's error to the group error, to doing `group_error = max(group_error, max_child_error)`. Error is already cumulative between LODs as the edges we're collapsing during simplification get longer each time. - Bumped the 65% simplification threshold to allow up to 95% of the original geometry (e.g. accept simplification as valid even if we only simplified 5% of the triangles). This gives us closer to log2(initial_meshlet_count) LOD levels, and fewer meshlet roots in the DAG. Still more work to be done in the future here. Maybe trying METIS for meshlet building instead of meshoptimizer. Using ~8 clusters per group instead of ~4 might also make a big difference. The Nanite slides say that they have 8-32 meshlets per group, suggesting some kind of heuristic. Unfortunately meshopt's compute_cluster_bounds won't work with large groups atm (zeux/meshoptimizer#750 (comment)) so hard to test. Based on discussion from bevyengine#14998, zeux/meshoptimizer#750, and discord. ### Runtime changes - cluster:triangle packed IDs are now stored 25:7 instead of 26:6 bits, as max triangles per cluster are now 128 instead of 64 - Hardware raster now spawns 128 * 3 vertices instead of 64 * 3 vertices to account for the new max triangles limit - Hardware raster now outputs NaN triangles (0 / 0) instead of zero-positioned triangles for extra vertex invocations over the cluster triangle count. Shouldn't really be a difference idt, but I did it anyways. - Software raster now does 128 threads per workgroup instead of 64 threads. Each thread now loads, projects, and caches a vertex (vertices 0-127), and then if needed does so again (vertices 128-254). Each thread then rasterizes one of 128 triangles. - Fixed a bug with `needs_dispatch_remap`. I had the condition backwards in my last PR, I probably committed it by accident after testing the non-default code path on my GPU.

JMS55 added 3 commits September 2, 2024 11:55

WIP

f565cd5

Fix bug from previous PR

559535e

Fix broken software raster vertex cache

0866f4a

JMS55 requested a review from atlv24 September 3, 2024 00:34

JMS55 added the A-Rendering Drawing game state to the screen label Sep 3, 2024

JMS55 added this to the 0.15 milestone Sep 3, 2024

alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times C-Bug An unexpected or incorrect behavior labels Sep 3, 2024

Fix comment typo

852923e

alice-i-cecile added the S-Needs-Review Needs reviewer attention (from anyone!) to move forward label Sep 3, 2024

Add note about 255 vertex limit

e0180b9

Uncommit bunny

8bc76c6

zeux reviewed Sep 3, 2024

View reviewed changes

atlv24 approved these changes Sep 4, 2024

View reviewed changes

Scthe reviewed Sep 4, 2024

View reviewed changes

Change 65 -> 95% simplification threshold

ffb3d3a

JMS55 added 2 commits September 5, 2024 15:52

Add TODO

e9f3029

Add another TODO

ab4c870

JMS55 requested a review from IceSentry September 6, 2024 01:45

JMS55 mentioned this pull request Sep 7, 2024

Meshlet screenspace-derived tangents #15084

Merged

alice-i-cecile approved these changes Sep 8, 2024

View reviewed changes

alice-i-cecile added this pull request to the merge queue Sep 8, 2024

alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Sep 8, 2024

Merged via the queue into bevyengine:main with commit a0faf9c Sep 8, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More triangles/vertices per meshlet #15023

More triangles/vertices per meshlet #15023

JMS55 commented Sep 3, 2024 •

edited

Loading

JMS55 commented Sep 3, 2024

zeux commented Sep 3, 2024

JMS55 commented Sep 3, 2024

zeux Sep 3, 2024

JMS55 Sep 5, 2024

Scthe Sep 4, 2024 •

edited

Loading

Scthe commented Sep 4, 2024 •

edited

Loading

Scthe Sep 4, 2024 •

edited

Loading

JMS55 Sep 5, 2024 •

edited

Loading

JMS55 commented Sep 5, 2024

JMS55 commented Sep 5, 2024

JMS55 commented Sep 6, 2024

alice-i-cecile left a comment

More triangles/vertices per meshlet #15023

More triangles/vertices per meshlet #15023

Conversation

JMS55 commented Sep 3, 2024 • edited Loading

Builder changes

Runtime changes

JMS55 commented Sep 3, 2024

zeux commented Sep 3, 2024

JMS55 commented Sep 3, 2024

zeux Sep 3, 2024

Choose a reason for hiding this comment

JMS55 Sep 5, 2024

Choose a reason for hiding this comment

Scthe Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

Scthe commented Sep 4, 2024 • edited Loading

Scthe Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

JMS55 Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

JMS55 commented Sep 5, 2024

JMS55 commented Sep 5, 2024

JMS55 commented Sep 6, 2024

alice-i-cecile left a comment

Choose a reason for hiding this comment

JMS55 commented Sep 3, 2024 •

edited

Loading

Scthe Sep 4, 2024 •

edited

Loading

Scthe commented Sep 4, 2024 •

edited

Loading

Scthe Sep 4, 2024 •

edited

Loading

JMS55 Sep 5, 2024 •

edited

Loading