2021.05.13 Meeting Notes

Agenda

Been redoing the sparse interface after getting feedback from Jonah
Moved communication into the initialization
Sparse variables will now be the same as cell variables, metadata now exists everywhere, every block and rank will have the same structure.
Also been working on GPU hackathon

GPU Hackathon
Spent time testing performance in Phoebus instead of riot because need sparse pr to be merged before can use it fully in RIOT.
Spent time analyzing why Parthenon actually became slower after implementing a fix. Discovered that old machinery was being called instead of newer faster implementation. Fixed the problem and saw a factor of 10 improvement on 16 cubed cells.
Hackathon improvements look to have improved the remeshing algorithm by about 30 %

Busy with implementing 3 components from GPU Hackathon.
Worked on the remeshing functionality of parthenon, previously a large number of allocations were made for buffers, this was slow. With the new pr, instead a single large allocation is made.
Buffer pack in one - was a drop in replacement for the communication routines, now also possible with cell centered variables. There is also a new CMake variable that turns buffer packing machinery on.
Been looking into the buffer packing function and playing around with different patterns which perform significantly better or worse depending on the whether they are being run on the device or host.

Has been helping with each of the pr's that were submitted from the Hackathon
Has been working on some performance improvements in the python regression testing framework to make downstream integration with AthenaPK easier.
Also worked to speed up the python HDF5 diff script, from 5 minutes to a second on some files.

Around 8 hours were spent searching fro a bug that was actually not a bug. It was noticed when the team was changing from creating multiple allocations to a single large allocation for the buffers that more memory was being used than should have. Max from Nvidia discovered that the main problem was nvidia devices were doing page management, because the large buffer allocation was a little over 2 megabytes it was creating a new page which was barely being utilized. Subsequent allocations were not small enough to backfill the new page which was leading to wasted memory. The solution is to use a memory pool. Max suggested waiting - seems to be movement on Nvidia's side for implementing something behind the scenes in Kokkos.
Another result that came out of the Hackathon that bares looking into was the difference in performance caused from using 1 vector with 10 components vs 10 vectors each with 1 component. For some reason the 10 vectors with 1 component was performing better, might have something todo with inner loop access patterns.