-
Notifications
You must be signed in to change notification settings - Fork 37
2020.07.06 Meeting Notes: Performance Call
Andrew Gaspar edited this page Jul 6, 2020
·
2 revisions
See discussion at: https://github.com/lanl/parthenon/issues/189
32^3 blocks vs. 256^3 blocks, 3x overhead for zone cycles when switching to a "big" buffer packing kernel.
It's worth noting the "big" kernel doesn't allow for per-variable MPI asynchrony.
Some ideas:
- Potentially communicate "variable packs" instead - gets you some MPI asynchrony
- There may be enough mesh blocks that you don't need per-variable communication
- Could potentially pack variables into a single message to make up for the loss of overlap
Low hanging fruit:
- Verify that data access/writes are coalesced in
pgrete/pack-in-one
-
FindMeshBlock
performance rears its head when targeting smaller mesh blocks. There's a fix in development "Athena" already - @pgrete will share patch file in https://github.com/lanl/parthenon/pull/213
Considerations:
- Streams, and therefore MeshBlocks, cannot be shared between threads - Kokkos has issues with sharing streams between threads
Jim:
- If he was redoing Athena today, he would pack all variables into a single buffer for MPI rather than how it's done today
Next Steps:
- Optimizing a "big kernel" variable pack for cell centered variables that we can get performance with a uniform mesh on a single grid
Threads to pull on:
- Use Kokkos hierarchical parallelism to implement the "big kernel" for packing variables
- Coalescing read/writes for buffer packing routines
- How similar are cell-centered and face-centered
- Adding micro-benchmarks