This repo contains instructions to launch multinode scripts on a pbs cluster,
setting all the ENV_VARIABLES
needed for a multi-gpu script with pytorch or pytorch Lightning.
test.sh
: launches the pbs command and set up python script.multinode_scripts/multinode.sh
: finds the environment variables to be defined. Also launches thempirun
command from master node. Checkline 75
which loades the appropriate module (on my cluster) and the- prefix
option of thempirun
command at line 71 (probably need to change it on a different cluster).multinode_scripts/run.sh
: Take care of launching the final script on each node, adding the final env variable.
MASTER_ADDR
: address of the master node.MASTER_PORT
: free communication port on the master node.WORLD_SIZE
: total number of processes used (usuallynum_gpu * num_nodes
).NODE_RANK
: number rank, different for each node (master is usually 0).
- https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide
- https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_intermediate_1.html
- https://pytorch-lightning.readthedocs.io/en/stable/advanced/model_parallel.html
qstat -fQ
: see permissions of Queues (e.g. max num of parallel jobs)pbsnodes -aSj | grep -F 'gnode' | grep -F 'free'
: see allfree
gnode
.qstat -wan1 -u $user
: monitor all launched jobs and requested resources by$user
.qstat -u $user | grep "$user" | cut -d"." -f1 | xargs qdel
: kill all jobs of$user
.qstat -$user | grep "R" | cut -d"." -f1 | xargs qdel
: kill all the running jobs of$user
.qstat -u $user | grep "Q" | cut -d"." -f1 | xargs qdel
: kill all the queued jobs of$user
.