-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI Scaling fix #894
base: master
Are you sure you want to change the base?
MPI Scaling fix #894
Conversation
I did some additional testing for 14k catchments on 96 cores and it took 570s to run the simulation with cache and remotes disabled. |
Add LSTM submodule to ngen repository in extern/lstm
TLDR - netcdf cache and partition remotes hinder mpi scaling
The netcdf per feature data provider cache is expensive to create and the performance penalty increases with mpi rank count. Depending on the machine, it can be faster to have it disabled at just 2 ranks .
The mid-simulation mpi communication to accumulate nexus outputs adds a lot of overhead that can be deferred to the end of the simulation for another large speedup.
A 916 second simulation takes 401s on 96 ranks without these changes and 21s with these changes both applied.
With one netcdf file per rank, cache enabled, and remotes disabled, the speedup on 96 cores over serial is now ~80x. Without these changes 96 cores is ~1x-2x faster than serial.
There are a few different things included in this PR that probably could and should be implemented better than the way I've done them, but I thought it would be better share what I've got so far before revising it further.
Major Differences
Netcdf Cache
The reason this is so expensive is that it caches the value for every catchment at a given timestep for a given variable. Serially this seems to make things run 1.3x-1.5x faster than with no cache. When using mpi partitions, each process reads the values for all of the catchments, including catchments not in the current mpi rank. The result is a huge amount of IO that gets worse with the number of mpi ranks and the number of catchments.
Profiling showed ~80% of the simulation runtime was read syscalls for 55 mpi ranks. Reading from a netcdf was taking 180s but per catchment csv files was taking 50s.
I tested just duplicating the netcdf for each rank which was worse ~214s and then splitting the original input up by rank so only the relevant data would be read. Splitting up the netcdf reduced the simulation time to 39s, disabling the cache entirely also gave me a 39s runtime.
I think the machine I did most of my testing on has some large system level cache as serially the ngen cache only gives me a <7% performance boost on inputs <1Gb.
The best performance seems to be splitting up the netcdf by rank and keeping cache enabled, but it's not that much faster than just disabling the cache.
Remotes partitioning
This might be a bit specific to how I'm running ngen, but for cfe+noah-owp-modular it seems like I can just treat each catchment as an individual cell unaffected by those around it and the only communication needed is to sum up the outputs in cases where more than one catchment feeds into a nexus.
Removing the remotes saves on the overhead of the mpi communication and decouples the ranks. If a nexus output is being written from multiple ranks it will corrupt the output. To fix this I just add the mpi rank number to the output file name and make sure each nexus appears at most once per rank. I do this by sorting the catchment to nexus pairs by nexus then assigning them round robin to the partitions. This will break if the number of partitions is smaller than the maximum number of catchments flowing to a single nexus in your network.
The additional post processing to rename the output files and sum up the different ranks is a tiny fraction of the time saved by partitioning this way.
For now this is just a python script as it was much faster for me to write.
As is the script to generate the round robin partitions.
Minor changes
option to disable catchment outputs
I couldn't figure out if there was another way to disable this so I added a root level realization config option "write_catchment_output" that sets the catchment stream out to /dev/null.
MPI Barrier
I noticed before in #846 that the MPI barriers during t-route execution were slowing down troute a bit because troute tries to parallelize without using MPI. A similar but less noticeable effect is happening during the model execution phase too. It's less noticeable because there aren't other processes trying to use the cores, but when a rank finishes the poll rate of the mpi barrier busy wait is so high that it maxes out that core. On my machine this was causing the overall clock speed to drop and the simulation was taking ~5% longer to execute. This seems like something that should be a configurable mpi build arg or environment variable but I couldn't find it so I replaced the barriers with non-blocking Ibarriers that poll every 100ms.
{{partition_id}} {{id}} substitution in forcing input path
As part of the testing I did with one netcdf per mpi rank, I added a rank id placeholder to the "path" variable in the forcing provider config. I also added {{id}} to it to speed up the csv init load time similar to #843.
Testing
To make this easier to test I built a multi-platform docker image called
joshcu/ngiab_dev
from this dockerfile https://github.com/JoshCu/NGIAB-CloudInfra/blob/datastream/docker/Dockerfileand put together some example data for 157 catchments 84 nexuses for one year.
The partitioning is handled by a bash script that calls a python script in the docker file and troute won't work with remote partitioning disabled. These things can be run manually though with scripts inside the container.
Configuration in realization.json
157 cats, 1 year
All of the x86 tests were on ubuntu 24 machines running docker using -v bind mounts
x86 96 core dual Xeon Platinum 8160
serial cache = 916s
serial no-cache = 924s
x86 56 core dual Xeon E5-2697
serial cache = 528s
serial no-cache = 771s
x86 24 core i9-12900
serial cache = 180s
serial no-cache = 256s
arm64 mac mini m2
10 ranks 2-cores bind mount
disk io seems to be terrible with docker -v bind mounts on mac
and kubernetes was only running with 2 cores
serial cache = 301s
serial no-cache = 375s
10 ranks 2-cores, in memory
I started up the container then used
docker cp
to load the input filesserial cache = 233s
serial no-cache = 299s
8 ranks 8 cores, in memory
55 cats 31 nexus 10 years
all done on the 96 core machine
serially : ~1700s
~1600 cats 2 years
numbers prefixed with ~ were run for 3 months then multiplied by 8
serially with cache ~324m
serially with no-cache ~399m