20.10
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 20.10 Release Notes
What's New
- Repo reorganization
- Slurm cluster node health check
- HPL burn-in test version 1.0 (adds multi-node test)
- Playbook to disable cloud-init on Ubuntu
- Playbook to install NVIDIA DCGM on non-DGX servers
- GPU feature discovery plugin with MIG support for K8S
Changes
- Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
- Kubernetes v1.18.9 (Kubespray v2.14.1), Helm 3, GPU Operator v0.6.0
- Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
- K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
- Docker 19.03
- NVIDIA driver role v1.2.2
Bugs/Enhancements
- Add additional Slurm cluster deployment validation tests
- Fix Rook script to properly delete cluster with Helm 3
- Fix default OpenMPI build compatibility with Slurm
- Fix unnecessary rebuild in PMIx
- Update Slurm install to contain SSH sessions by default
- Fix bug with NVIDIA GPU driver role on RHEL/CentOS
- Clean-up and consolidate Rook script (
poll_ceph.sh
andrmrook.sh
rolled intodeploy_rook.sh
-d&
-w`) - Additional testing (MPI, Rook, ...)
New Directory Structure
├── config.example
│ ├── airgap
│ ├── helm
│ └── pxe
│ └── machines
├── docs
│ ├── airgap
│ ├── deepops
│ ├── img
│ ├── k8s-cluster
│ ├── ngc-ready
│ ├── pxe
│ └── slurm-cluster
├── playbooks
│ ├── airgap
│ ├── bootstrap
│ ├── container
│ ├── generic
│ ├── k8s-cluster
│ ├── nvidia-dgx
│ ├── nvidia-egx
│ ├── nvidia-software
│ ├── provisioning
│ ├── slurm-cluster
│ └── utilities
├── roles
│ ├── autofs
│ ├── container-registry
│ ├── dns-config
│ ├── easy-build
│ ├── easy-build-packages
│ ├── facts
│ ├── grafana
│ ├── kerberos-client
│ ├── lmod
│ ├── move-home-dirs
│ ├── netapp-trident
│ ├── nfs
│ ├── nhc
│ ├── nis-client
│ ├── nvidia-cuda
│ ├── nvidia-dcgm
│ ├── nvidia-dcgm-exporter
│ ├── nvidia-dgx
│ ├── nvidia-dgx-firmware
│ ├── nvidia-gpu-operator
│ ├── nvidia-gpu-operator-node-prep
│ ├── nvidia-gpu-tests
│ ├── nvidia-hpc-sdk
│ ├── nvidia-k8s-gpu-device-plugin
│ ├── nvidia-k8s-gpu-feature-discovery
│ ├── nvidia-ml
│ ├── offline-repo-mirrors
│ ├── ood-wrapper
│ ├── openmpi
│ ├── openshift
│ ├── prometheus
│ ├── prometheus-node-exporter
│ ├── prometheus-slurm-exporter
│ ├── pyxis
│ ├── roce_backend
│ ├── slurm
│ └── spack
├── scripts
│ ├── airgap
│ ├── deepops
│ ├── generic
│ ├── k8s
│ └── pxe
├── src
│ ├── containers
│ │ ├── ansible
│ │ ├── dgx-firmware
│ │ ├── dgxie
│ │ ├── kubeflow-jupyter-web-app
│ │ ├── nccl-tests
│ │ ├── ngc
│ │ │ ├── pytorch
│ │ │ ├── rapids
│ │ │ └── tensorflow
│ │ ├── pixiecore
│ │ └── pxe
│ │ └── dhcp
│ ├── dashboards
│ └── repo
├── submodules
│ └── kubespray
├── virtual
│ ├── scripts
└── workloads
├── burn-in
├── examples
│ ├── k8s
│ │ ├── dask-rapids
│ │ ├── kubeflow-pipeline-deploy
│ │ ├── services
│ │ │ └── logging
│ │ └── users
│ └── slurm
│ ├── dask-rapids
│ └── mpi-hello
├── jenkins
│ └── scripts
└── services
└── k8s
└── dgxie
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh
script must be re-run and any new variables in the config.example
files should be added to the existing config
. For a full diff from release 20.08.1
run git diff 20.10 20.08.1 -- config.example/
It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh
as a reference.