20.12
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 20.12 Release Notes
What's New
- Support for DGX OS 5.0
- Support for Ubuntu 20.04
- Support for CentOS 8
- MAAS bare-metal provisioning documentation
- Initial support for Slurm high-availability
- Caching container registry for Slurm and k8s
- Slurm and Open OnDemand usage guide
- MIG support in K8s and documentation
Changes
- HPC SDK 20.9
- Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
- Kubernetes v1.18.10 (Kubespray v2.14.2), Helm 3, GPU Operator v0.6.0
- Kubeflow v1.2 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
- K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
- Docker 19.03
- NVIDIA driver role v2.0 (see notes: [1])
Bugs/Enhancements
- Fix paths from repo re-org
- Update MIG playbook to enable MIG per device rather than all
- Rook/Ceph install script improvements
- Added Slurm tests to QA
- Move Helm charts off deprecated repo
- Fix OpenMPI build on CentOS
- Fix MAAS repo location
- Fix Enroot removing cache during existing jobs
- Updates to use Helm 3
- Fix for all GPUs visible when ssh on slurm compute node
- Fix python bootstrap script to support python3 on CentOS
- Allow disabling docker/nvidia-docker install
- Update Kubeflow deployment to all custom configurations/kustomization in the
workloads
directory with exampleculling
configuration. - Update Kubeflow defaults containers to example NGC containers
- Update nvidia-dgx-firmware role to work with new update container with more verifications
- Use a persistent volume for Prometheus metrics
- Limit CPU usage for Prom node exporters
- Many more bug fixes
Upgrade steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh
script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.10 run git diff 20.08 20.12 -- config.example/
. Note, there are many changes in this release, if you encounter problem please open a GitHub issue. See the update guide for additional guidance.
Notes
[1] On Ubuntu, this update changes the default behavior to use nvidia-headless-450-server
package by default, instead of the cuda-drivers
package. See release notes for the driver role for more information.