This gres plugin for Slurm allows for scheduling whole aurora cards on nodes using gres in Slurm but not to share one aurora card between jobs.
Check https://github.com/SX-Aurora/SX-Aurora-Slurm-Plugin/releases for the latest release. This has been tested on 20.11
You need to compile custom code for Slurm:
- Clone this repo to src/plugins/gres/ve in your local copy of Slurm or unpack the release tarball and copy the files to that folder.
- Add ''src/plugins/gres/ve/Makefile'' to the configure.ac file in the slurm root directory.
- Remove the existing configure file.
- Add ''ve'' to the SUBDIRS variable in src/plugins/gres/Makefile.am
- Run autoreconf.
make && make install
if this is a new Slurm-source-tree or patch the slurm.spec file.
It is recommendable to build a separate slurm cluster on the aurora-nodes for testing before moving the setup to production. In such a case slurmctld and slurmdbd are needed because of the gres-usage. Remember to change the ports in the slurm.conf if you have other slurmctlds running otherwise the reconfiguration may crash the other slurmctld.
- You should have the nodes configured in your slurm.conf or an approriate include file. The node definition should look like:
GresTypes=ve
SelectType=select/cons_tres
Nodename=<nodename> Gres=ve:<count>
With multiple VEs per VH you have to create a shared parition like below. It is also recommended to define CPUs as well as memory of the nodes as shared resources like in the example below with two A100:
GresTypes=ve
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
NoneName=vh[100-101] CPUs=80 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=192078 Gres=ve:8
PartitionName=aurora Shared=Yes Nodes=vh[100-101]
- gres.conf should contain at least:
Nodename=<nodename> File=/dev/veslot[<ve slot numbers as csv>]
So for the example with the two A300-8 nodes it would be
NodeName=vh0t[100-101] Name=ve File=/dev/veslot[0,1,2,3,4,5,6,7]
-
cgroups.conf needs to have
ContrainDevices=yes
. -
cgroup_allowed_devices_file.conf
must contain at least
/dev/cpu/*/*
Once the configuration files are changed restart slurmctld and the slurmds. Run a little test like:
srun -n1 --gres=ve:1 -p<yourpartition> env|sort
and check for SLURM-variables being set, i.e. VE_NODE_NUMBER should also be there.
This GRES module only supports single node aurora jobs! The environment variables inside a job are set to support NEC-MPI in distributed mode, which means that inside your job script you should be able to simply do
mpirun <executable>