Skip to content

20.10

Compare
Choose a tag to compare
@dholt dholt released this 05 Oct 17:42

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.10 Release Notes

What's New

  • Repo reorganization
  • Slurm cluster node health check
  • HPL burn-in test version 1.0 (adds multi-node test)
  • Playbook to disable cloud-init on Ubuntu
  • Playbook to install NVIDIA DCGM on non-DGX servers
  • GPU feature discovery plugin with MIG support for K8S

Changes

  • Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
  • Kubernetes v1.18.9 (Kubespray v2.14.1), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
  • Docker 19.03
  • NVIDIA driver role v1.2.2

Bugs/Enhancements

  • Add additional Slurm cluster deployment validation tests
  • Fix Rook script to properly delete cluster with Helm 3
  • Fix default OpenMPI build compatibility with Slurm
  • Fix unnecessary rebuild in PMIx
  • Update Slurm install to contain SSH sessions by default
  • Fix bug with NVIDIA GPU driver role on RHEL/CentOS
  • Clean-up and consolidate Rook script (poll_ceph.sh and rmrook.sh rolled into deploy_rook.sh -d&-w`)
  • Additional testing (MPI, Rook, ...)

New Directory Structure

├── config.example
│   ├── airgap
│   ├── helm
│   └── pxe
│       └── machines
├── docs
│   ├── airgap
│   ├── deepops
│   ├── img
│   ├── k8s-cluster
│   ├── ngc-ready
│   ├── pxe
│   └── slurm-cluster
├── playbooks
│   ├── airgap
│   ├── bootstrap
│   ├── container
│   ├── generic
│   ├── k8s-cluster
│   ├── nvidia-dgx
│   ├── nvidia-egx
│   ├── nvidia-software
│   ├── provisioning
│   ├── slurm-cluster
│   └── utilities
├── roles
│   ├── autofs
│   ├── container-registry
│   ├── dns-config
│   ├── easy-build
│   ├── easy-build-packages
│   ├── facts
│   ├── grafana
│   ├── kerberos-client
│   ├── lmod
│   ├── move-home-dirs
│   ├── netapp-trident
│   ├── nfs
│   ├── nhc
│   ├── nis-client
│   ├── nvidia-cuda
│   ├── nvidia-dcgm
│   ├── nvidia-dcgm-exporter
│   ├── nvidia-dgx
│   ├── nvidia-dgx-firmware
│   ├── nvidia-gpu-operator
│   ├── nvidia-gpu-operator-node-prep
│   ├── nvidia-gpu-tests
│   ├── nvidia-hpc-sdk
│   ├── nvidia-k8s-gpu-device-plugin
│   ├── nvidia-k8s-gpu-feature-discovery
│   ├── nvidia-ml
│   ├── offline-repo-mirrors
│   ├── ood-wrapper
│   ├── openmpi
│   ├── openshift
│   ├── prometheus
│   ├── prometheus-node-exporter
│   ├── prometheus-slurm-exporter
│   ├── pyxis
│   ├── roce_backend
│   ├── slurm
│   └── spack
├── scripts
│   ├── airgap
│   ├── deepops
│   ├── generic
│   ├── k8s
│   └── pxe
├── src
│   ├── containers
│   │   ├── ansible
│   │   ├── dgx-firmware
│   │   ├── dgxie
│   │   ├── kubeflow-jupyter-web-app
│   │   ├── nccl-tests
│   │   ├── ngc
│   │   │   ├── pytorch
│   │   │   ├── rapids
│   │   │   └── tensorflow
│   │   ├── pixiecore
│   │   └── pxe
│   │       └── dhcp
│   ├── dashboards
│   └── repo
├── submodules
│   └── kubespray
├── virtual
│   ├── scripts
└── workloads
    ├── burn-in
    ├── examples
    │   ├── k8s
    │   │   ├── dask-rapids
    │   │   ├── kubeflow-pipeline-deploy
    │   │   ├── services
    │   │   │   └── logging
    │   │   └── users
    │   └── slurm
    │       ├── dask-rapids
    │       └── mpi-hello
    ├── jenkins
    │   └── scripts
    └── services
        └── k8s
            └── dgxie

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.08.1 run git diff 20.10 20.08.1 -- config.example/

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.