NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.10 Release Notes

What's New

Repo reorganization
Slurm cluster node health check
HPL burn-in test version 1.0 (adds multi-node test)
Playbook to disable cloud-init on Ubuntu
Playbook to install NVIDIA DCGM on non-DGX servers
GPU feature discovery plugin with MIG support for K8S

Changes

Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
Kubernetes v1.18.9 (Kubespray v2.14.1), Helm 3, GPU Operator v0.6.0
Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
Docker 19.03
NVIDIA driver role v1.2.2

Bugs/Enhancements

Add additional Slurm cluster deployment validation tests
Fix Rook script to properly delete cluster with Helm 3
Fix default OpenMPI build compatibility with Slurm
Fix unnecessary rebuild in PMIx
Update Slurm install to contain SSH sessions by default
Fix bug with NVIDIA GPU driver role on RHEL/CentOS
Clean-up and consolidate Rook script (poll_ceph.sh and rmrook.sh rolled into deploy_rook.sh -d&-w`)
Additional testing (MPI, Rook, ...)

New Directory Structure

├── config.example
│   ├── airgap
│   ├── helm
│   └── pxe
│       └── machines
├── docs
│   ├── airgap
│   ├── deepops
│   ├── img
│   ├── k8s-cluster
│   ├── ngc-ready
│   ├── pxe
│   └── slurm-cluster
├── playbooks
│   ├── airgap
│   ├── bootstrap
│   ├── container
│   ├── generic
│   ├── k8s-cluster
│   ├── nvidia-dgx
│   ├── nvidia-egx
│   ├── nvidia-software
│   ├── provisioning
│   ├── slurm-cluster
│   └── utilities
├── roles
│   ├── autofs
│   ├── container-registry
│   ├── dns-config
│   ├── easy-build
│   ├── easy-build-packages
│   ├── facts
│   ├── grafana
│   ├── kerberos-client
│   ├── lmod
│   ├── move-home-dirs
│   ├── netapp-trident
│   ├── nfs
│   ├── nhc
│   ├── nis-client
│   ├── nvidia-cuda
│   ├── nvidia-dcgm
│   ├── nvidia-dcgm-exporter
│   ├── nvidia-dgx
│   ├── nvidia-dgx-firmware
│   ├── nvidia-gpu-operator
│   ├── nvidia-gpu-operator-node-prep
│   ├── nvidia-gpu-tests
│   ├── nvidia-hpc-sdk
│   ├── nvidia-k8s-gpu-device-plugin
│   ├── nvidia-k8s-gpu-feature-discovery
│   ├── nvidia-ml
│   ├── offline-repo-mirrors
│   ├── ood-wrapper
│   ├── openmpi
│   ├── openshift
│   ├── prometheus
│   ├── prometheus-node-exporter
│   ├── prometheus-slurm-exporter
│   ├── pyxis
│   ├── roce_backend
│   ├── slurm
│   └── spack
├── scripts
│   ├── airgap
│   ├── deepops
│   ├── generic
│   ├── k8s
│   └── pxe
├── src
│   ├── containers
│   │   ├── ansible
│   │   ├── dgx-firmware
│   │   ├── dgxie
│   │   ├── kubeflow-jupyter-web-app
│   │   ├── nccl-tests
│   │   ├── ngc
│   │   │   ├── pytorch
│   │   │   ├── rapids
│   │   │   └── tensorflow
│   │   ├── pixiecore
│   │   └── pxe
│   │       └── dhcp
│   ├── dashboards
│   └── repo
├── submodules
│   └── kubespray
├── virtual
│   ├── scripts
└── workloads
    ├── burn-in
    ├── examples
    │   ├── k8s
    │   │   ├── dask-rapids
    │   │   ├── kubeflow-pipeline-deploy
    │   │   ├── services
    │   │   │   └── logging
    │   │   └── users
    │   └── slurm
    │       ├── dask-rapids
    │       └── mpi-hello
    ├── jenkins
    │   └── scripts
    └── services
        └── k8s
            └── dgxie

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.08.1 run git diff 20.10 20.08.1 -- config.example/

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20.10