MPI Scaling fix #894

JoshCu · 2024-10-28T22:46:13Z

TLDR - netcdf cache and partition remotes hinder mpi scaling

The netcdf per feature data provider cache is expensive to create and the performance penalty increases with mpi rank count. Depending on the machine, it can be faster to have it disabled at just 2 ranks .
The mid-simulation mpi communication to accumulate nexus outputs adds a lot of overhead that can be deferred to the end of the simulation for another large speedup.
A 916 second simulation takes 401s on 96 ranks without these changes and 21s with these changes both applied.
With one netcdf file per rank, cache enabled, and remotes disabled, the speedup on 96 cores over serial is now ~80x. Without these changes 96 cores is ~1x-2x faster than serial.

There are a few different things included in this PR that probably could and should be implemented better than the way I've done them, but I thought it would be better share what I've got so far before revising it further.

Major Differences

Netcdf Cache

The reason this is so expensive is that it caches the value for every catchment at a given timestep for a given variable. Serially this seems to make things run 1.3x-1.5x faster than with no cache. When using mpi partitions, each process reads the values for all of the catchments, including catchments not in the current mpi rank. The result is a huge amount of IO that gets worse with the number of mpi ranks and the number of catchments.

Profiling showed ~80% of the simulation runtime was read syscalls for 55 mpi ranks. Reading from a netcdf was taking 180s but per catchment csv files was taking 50s.
I tested just duplicating the netcdf for each rank which was worse ~214s and then splitting the original input up by rank so only the relevant data would be read. Splitting up the netcdf reduced the simulation time to 39s, disabling the cache entirely also gave me a 39s runtime.
I think the machine I did most of my testing on has some large system level cache as serially the ngen cache only gives me a <7% performance boost on inputs <1Gb.

The best performance seems to be splitting up the netcdf by rank and keeping cache enabled, but it's not that much faster than just disabling the cache.

Remotes partitioning

This might be a bit specific to how I'm running ngen, but for cfe+noah-owp-modular it seems like I can just treat each catchment as an individual cell unaffected by those around it and the only communication needed is to sum up the outputs in cases where more than one catchment feeds into a nexus.
Removing the remotes saves on the overhead of the mpi communication and decouples the ranks. If a nexus output is being written from multiple ranks it will corrupt the output. To fix this I just add the mpi rank number to the output file name and make sure each nexus appears at most once per rank. I do this by sorting the catchment to nexus pairs by nexus then assigning them round robin to the partitions. This will break if the number of partitions is smaller than the maximum number of catchments flowing to a single nexus in your network.
The additional post processing to rename the output files and sum up the different ranks is a tiny fraction of the time saved by partitioning this way.
For now this is just a python script as it was much faster for me to write.
As is the script to generate the round robin partitions.

Minor changes

option to disable catchment outputs

I couldn't figure out if there was another way to disable this so I added a root level realization config option "write_catchment_output" that sets the catchment stream out to /dev/null.

MPI Barrier

I noticed before in #846 that the MPI barriers during t-route execution were slowing down troute a bit because troute tries to parallelize without using MPI. A similar but less noticeable effect is happening during the model execution phase too. It's less noticeable because there aren't other processes trying to use the cores, but when a rank finishes the poll rate of the mpi barrier busy wait is so high that it maxes out that core. On my machine this was causing the overall clock speed to drop and the simulation was taking ~5% longer to execute. This seems like something that should be a configurable mpi build arg or environment variable but I couldn't find it so I replaced the barriers with non-blocking Ibarriers that poll every 100ms.

{{partition_id}} {{id}} substitution in forcing input path

As part of the testing I did with one netcdf per mpi rank, I added a rank id placeholder to the "path" variable in the forcing provider config. I also added {{id}} to it to speed up the csv init load time similar to #843.

Testing

To make this easier to test I built a multi-platform docker image called joshcu/ngiab_dev from this dockerfile https://github.com/JoshCu/NGIAB-CloudInfra/blob/datastream/docker/Dockerfile
and put together some example data for 157 catchments 84 nexuses for one year.
The partitioning is handled by a bash script that calls a python script in the docker file and troute won't work with remote partitioning disabled. These things can be run manually though with scripts inside the container.

# Running the docker container
docker run --rm -it -v "/mnt/raid0/ngiab_netcdf/speedup_example:/ngen/ngen/data" joshcu/ngiab_dev /ngen/ngen/data/ auto <num_desired_partitions>
# I had io performance issues on mac related to bind mounts and did the following as a workaround
# Run the container
docker run --rm -it joshcu/ngiab_dev
# make the data destination directory 
mkdir /ngen/ngen/
# copy the speedup_example data to /ngen/ngen/data with docker cp

# finally, run the ngen run helper script in the container
/ngen/HelloNGEN.sh /ngen/ngen/data auto <num_desired_partitions>

# Manually running
# Partition generation
python /dmod/utils/partitioning/round_robin_partioning.py -n <num_partitions> <hydrofabric_path> <output_filename>
# Merging the per rank output files if remotes are disabled
python /dmod/utils/data_conversion/no_remote_merging.py /ngen/ngen/data/outputs/ngen/
# Running t-route manually
python -m nwm_routing -V4 -f config/troute.yaml

Configuration in realization.json

        "forcing": {
            "path": "./forcings/by_catchment/forcings.nc",
            "provider": "NetCDF",
            "enable_cache": false
        }
    },
    "time": {
        "start_time": "2010-01-01 00:00:00",
        "end_time": "2011-01-01 00:00:00",
        "output_interval": 3600
    },
    "write_catchment_output": false,
    "remotes_enabled": false, # routing will not work with this set to false
    "output_root": "/ngen/ngen/data/outputs/ngen"
}

157 cats, 1 year

All of the x86 tests were on ubuntu 24 machines running docker using -v bind mounts

x86 96 core dual Xeon Platinum 8160

serial cache = 916s
serial no-cache = 924s

	remotes	no-remotes
cache	401s	386s
no-cache	76s	21s

x86 56 core dual Xeon E5-2697

serial cache = 528s
serial no-cache = 771s

	remotes	no-remotes
cache	215s	146s
no-cache	129s	35s

x86 24 core i9-12900

serial cache = 180s
serial no-cache = 256s

	remotes	no-remotes
cache	146s	94s
no-cache	106s	33.5s

arm64 mac mini m2

10 ranks 2-cores bind mount

disk io seems to be terrible with docker -v bind mounts on mac
and kubernetes was only running with 2 cores
serial cache = 301s
serial no-cache = 375s

	remotes	no-remotes
cache	358s	358s
no-cache	263s	249s

10 ranks 2-cores, in memory

I started up the container then used docker cp to load the input files
serial cache = 233s
serial no-cache = 299s

	remotes	no-remotes
cache	309s	287s
no-cache	207s	157s

8 ranks 8 cores, in memory

	remotes	no-remotes
cache	190s	68s
no-cache	169s	50s

55 cats 31 nexus 10 years

all done on the 96 core machine
serially : ~1700s

31 ranks	remotes	no-remotes
cache	838s	121s
no-cache	N/A	76s
csv files	N/A	94s

55 ranks	remotes	no-remotes
cache	N/A	187s
no-cache	N/A	39s
csv files	N/A	50s

~1600 cats 2 years

numbers prefixed with ~ were run for 3 months then multiplied by 8
serially with cache ~324m
serially with no-cache ~399m

96 ranks	remotes	no-remotes
cache	~316m	~23.5m
no-cache	~11.7m	232s
cache + partitioned nc	N/A	184s

JoshCu · 2024-11-04T20:12:49Z

I did some additional testing for 14k catchments on 96 cores and it took 570s to run the simulation with cache and remotes disabled.
The original python script to merge them took too long so I parallelised it and it now takes ~30s.
I also tried it in rust for fun here. Although it's not that useful here except maybe for demonstrating that the per rank files + reformatting could be quite fast if integrated into ngen in c++

Add LSTM submodule to ngen repository in extern/lstm

JoshCu added 8 commits October 22, 2024 12:16

Add option to disable catchment csv output

fd16f52

perf - reduce mpi barrier poll rate

5396bb7

add option to disable remote partitioning

4db67e9

forcing_para

532ece8

add option to disable netcdf forcing cache

9db90e9

add helper scripts for round robin partitioning

1a16970

fix no_mpi build

2f6053f

fix filenames in output merging script

4018ebf

robertbartel added the QA/QC label Nov 4, 2024

parallelise no_remote_merging.py

9f3a314

jmframe and others added 3 commits November 8, 2024 09:15

Add LSTM submodule to ngen repository in extern/lstm

8e8fdcd

Merge pull request #2 from CIROH-UA/add_lstm_extern

9bd93b8

Add LSTM submodule to ngen repository in extern/lstm

update noah-owp-modular to latest

cc5c4d6

JoshCu mentioned this pull request Nov 26, 2024

Datastream CIROH-UA/ngen#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI Scaling fix #894

MPI Scaling fix #894

JoshCu commented Oct 28, 2024 •

edited

Loading

JoshCu commented Nov 4, 2024

MPI Scaling fix #894

Are you sure you want to change the base?

MPI Scaling fix #894

Conversation

JoshCu commented Oct 28, 2024 • edited Loading

TLDR - netcdf cache and partition remotes hinder mpi scaling

Major Differences

Netcdf Cache

Remotes partitioning

Minor changes

option to disable catchment outputs

MPI Barrier

{{partition_id}} {{id}} substitution in forcing input path

Testing

Configuration in realization.json

157 cats, 1 year

x86 96 core dual Xeon Platinum 8160

x86 56 core dual Xeon E5-2697

x86 24 core i9-12900

arm64 mac mini m2

10 ranks 2-cores bind mount

10 ranks 2-cores, in memory

8 ranks 8 cores, in memory

55 cats 31 nexus 10 years

~1600 cats 2 years

JoshCu commented Nov 4, 2024

JoshCu commented Oct 28, 2024 •

edited

Loading