NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.12 Release Notes

What's New

Support for DGX OS 5.0
Support for Ubuntu 20.04
Support for CentOS 8
MAAS bare-metal provisioning documentation
Initial support for Slurm high-availability
Caching container registry for Slurm and k8s
Slurm and Open OnDemand usage guide
MIG support in K8s and documentation

Changes

HPC SDK 20.9
Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
Kubernetes v1.18.10 (Kubespray v2.14.2), Helm 3, GPU Operator v0.6.0
Kubeflow v1.2 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
Docker 19.03
NVIDIA driver role v2.0 (see notes: [1])

Bugs/Enhancements

Fix paths from repo re-org
Update MIG playbook to enable MIG per device rather than all
Rook/Ceph install script improvements
Added Slurm tests to QA
Move Helm charts off deprecated repo
Fix OpenMPI build on CentOS
Fix MAAS repo location
Fix Enroot removing cache during existing jobs
Updates to use Helm 3
Fix for all GPUs visible when ssh on slurm compute node
Fix python bootstrap script to support python3 on CentOS
Allow disabling docker/nvidia-docker install
Update Kubeflow deployment to all custom configurations/kustomization in the workloads directory with example culling configuration.
Update Kubeflow defaults containers to example NGC containers
Update nvidia-dgx-firmware role to work with new update container with more verifications
Use a persistent volume for Prometheus metrics
Limit CPU usage for Prom node exporters
Many more bug fixes

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.10 run git diff 20.08 20.12 -- config.example/. Note, there are many changes in this release, if you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

[1] On Ubuntu, this update changes the default behavior to use nvidia-headless-450-server package by default, instead of the cuda-drivers package. See release notes for the driver role for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20.12