From 11bea736185cf4cf6412950cd832e2c653f5a16e Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Thu, 21 Sep 2023 10:33:43 +0100 Subject: [PATCH 01/91] Fix formatting error in list. --- docs/services/virtualmachines/policies.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/services/virtualmachines/policies.md b/docs/services/virtualmachines/policies.md index f5ce003f1..fc13c22b5 100644 --- a/docs/services/virtualmachines/policies.md +++ b/docs/services/virtualmachines/policies.md @@ -29,7 +29,9 @@ When a project is due to end, the PI will receive notification of the closure of ## Backup policies -The content of VM disk images is not backed up -The VM disk images are not backed up +The current policy is: + +- The content of VM disk images is not backed up +- The VM disk images are not backed up We strongly advise that you keep copies of any critical data on on an alternative system that is fully backed up. From ba284f5fd607d2e7cc25666af5f28a4226947cd0 Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Thu, 21 Sep 2023 10:37:33 +0100 Subject: [PATCH 02/91] Correct syntax errors. --- docs/services/virtualmachines/index.md | 1 - docs/services/virtualmachines/policies.md | 4 ++-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/services/virtualmachines/index.md b/docs/services/virtualmachines/index.md index 1c73ebdfc..8c374f193 100644 --- a/docs/services/virtualmachines/index.md +++ b/docs/services/virtualmachines/index.md @@ -10,7 +10,6 @@ The service currenly has a mixture of hardware node types which host VMs of vari The shapes and sizes of the flavours are based on subdivisions of this hardware, noting that CPUs are 4x oversubscribed for mcomp nodes (general VM flavours). - ## Service Access Users should have an EIDF account - [EIDF Accounts](../../access/project.md). diff --git a/docs/services/virtualmachines/policies.md b/docs/services/virtualmachines/policies.md index fc13c22b5..24ca28047 100644 --- a/docs/services/virtualmachines/policies.md +++ b/docs/services/virtualmachines/policies.md @@ -31,7 +31,7 @@ When a project is due to end, the PI will receive notification of the closure of The current policy is: -- The content of VM disk images is not backed up -- The VM disk images are not backed up +* The content of VM disk images is not backed up +* The VM disk images are not backed up We strongly advise that you keep copies of any critical data on on an alternative system that is fully backed up. From 06c11127654b4f452904da3376220607350ee7f1 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Wed, 4 Oct 2023 15:36:32 +0100 Subject: [PATCH 03/91] AG: added specific FAQ to the GPU Service to avoid mixing a lot of specific service issues into the main FAQ. --- docs/services/gpuservice/faq.md | 31 +++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 32 insertions(+) create mode 100644 docs/services/gpuservice/faq.md diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md new file mode 100644 index 000000000..fe801ebf6 --- /dev/null +++ b/docs/services/gpuservice/faq.md @@ -0,0 +1,31 @@ +# GPU Service FAQ + +## GPU Service Frequently Asked Questions + +### How do I access the GPU Service? + +The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway. + +### How do I obtain my project kubeconfig file? + +Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project. + +### I can't mount my PVC in multiple containers or pods at the same time + +The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation. + +### How many GPUs can I use in a pod? + +The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs. + +### Why did a validation error occur when submitting a pod or job with a valid specification file? + +If an error like the below occurs: + +```bash +error: error validating "myjobfile.yml": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false +``` + +There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories. + +The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the [Kubernetes Version Skew Policy](https://kubernetes.io/releases/version-skew-policy/). diff --git a/mkdocs.yml b/mkdocs.yml index 7280f4ea7..06ab5ab38 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -67,6 +67,7 @@ nav: - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md + - "GPU Service FAQ": services/gpuservice/faq.md - "Data Management Services": - "Data Catalogue": - "Metadata information": services/datacatalogue/metadata.md From b66904cdfa0b6f0b5175e34b205029f5041accda Mon Sep 17 00:00:00 2001 From: yuzhao Date: Thu, 12 Oct 2023 15:44:31 +0100 Subject: [PATCH 04/91] the solution for insufficient shared memory size --- docs/services/gpuservice/faq.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index fe801ebf6..8169bfa27 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -29,3 +29,23 @@ error: error validating "myjobfile.yml": error validating data: the server does There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories. The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the [Kubernetes Version Skew Policy](https://kubernetes.io/releases/version-skew-policy/). + + +### Insufficient Shared Memory Size + +My SHM is very small, and it causes "OSError: [Errno 28] No space left on device" when I train a model using multi-GPU. How to increase SHM size? + +The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem: +```yaml + spec: + containers: + - name: [NAME] + image: [IMAGE] + volumeMounts: + - mountPath: /dev/shm + name: dshm + volumes: + - name: dshm + emptyDir: + medium: Memory +``` \ No newline at end of file From 8305734f3922426bebe1f599a4b82d0ece6938c0 Mon Sep 17 00:00:00 2001 From: Joseph Lee Date: Wed, 25 Oct 2023 17:52:26 +0100 Subject: [PATCH 05/91] JL: added graphcore service documentation --- docs/services/graphcore/faq.md | 9 + docs/services/graphcore/index.md | 37 ++++ .../graphcore/training/L1_getting_started.md | 109 ++++++++++ .../graphcore/training/L2_multiple_IPU.md | 7 + .../graphcore/training/L3_profiling.md | 63 ++++++ .../graphcore/training/L4_other_frameworks.md | 190 ++++++++++++++++++ docs/services/index.md | 2 + mkdocs.yml | 8 + 8 files changed, 425 insertions(+) create mode 100644 docs/services/graphcore/faq.md create mode 100644 docs/services/graphcore/index.md create mode 100644 docs/services/graphcore/training/L1_getting_started.md create mode 100644 docs/services/graphcore/training/L2_multiple_IPU.md create mode 100644 docs/services/graphcore/training/L3_profiling.md create mode 100644 docs/services/graphcore/training/L4_other_frameworks.md diff --git a/docs/services/graphcore/faq.md b/docs/services/graphcore/faq.md new file mode 100644 index 000000000..39a0c8fd2 --- /dev/null +++ b/docs/services/graphcore/faq.md @@ -0,0 +1,9 @@ +# Graphcore FAQ + +## Graphcore Questions + +### How do I delete a running/terminated pod? + +`IPUJobs` manages the launcher and worker `pods`, therefore the pods will be deleted when the `IPUJob` is deleted, using `kubectl delete ipujobs `. If only the `pod` is deleted via `kubectl delete pod`, the `IPUJob` may respawn the `pod`. + +To see running or terminated `IPUJobs`, run `kubectl get ipujobs`. diff --git a/docs/services/graphcore/index.md b/docs/services/graphcore/index.md new file mode 100644 index 000000000..77cc1839d --- /dev/null +++ b/docs/services/graphcore/index.md @@ -0,0 +1,37 @@ +# Overview + +EIDF hosts a Graphcore Bow Pod64 system for AI acceleration. + +The specification of the Bow Pod64 is: + +- 16x Bow-2000 machines +- 64x Bow IPUs (4 IPUs per Bow-2000) +- 94,208 IPU cores (1472 cores per IPU) +- 57.6GB of In-Processor-Memory (0.9GB per IPU) + +For more details about the IPU architecture, see [documentation from Graphcore](https://docs.graphcore.ai/projects/ipu-programmers-guide/en/latest/about_ipu.html#). + +The smallest unit of compute resource that can be requested is a single IPU. + +Similarly to the EIDF GPU Service, usage of the graphcore is managed using [Kubernetes](https://kubernetes.io). + +## Service Access + +## Project Quotas + +## Graphcore Tutorial + +The following tutorial teaches users how to submit tasks to the graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the [GPU service tutorial](../gpuservice/training/L1_getting_started.md). For more in-depth lessons about developing applications for graphcore, see [the general documentation](https://docs.graphcore.ai/en/latest/) and [guide for creating IPU jobs via Kubernetes](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). + +| Lesson | Objective | +|-----------------------------------|-------------------------------------| +| [Getting started with IPU jobs](training/L1_getting_started.md) | a. How to send an IPUJob.
b. Monitoring and Cancelling your IPUJob. | +| [Multi-IPU Jobs](training/L2_multiple_IPU.md) | a. Using multiple IPUs for distributed training. | +| [Profiling with PopVision](training/L3_profiling.md) | a. Enabling profiling in your code.
b. Downloading the profile reports. | +| [Other Frameworks](training/L4_other_frameworks.md) | a. Using Tensorflow and PopART.
b. Writing IPU programs with PopLibs (C++).| + +## Further Reading and Help + +- The [Graphcore documentation](https://docs.graphcore.ai/en/latest/) provides information about using the Graphcore system. + +- The [Graphcore examples repository on github](https://github.com/graphcore/examples/tree/master) provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks. diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md new file mode 100644 index 000000000..39f2dcece --- /dev/null +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -0,0 +1,109 @@ +# Getting started with Graphcore IPU Jobs + +This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl`. See [GPU service tutorial](../gpuservice/training/L1_getting_started.md) to get started. + +## Introduction + +Graphcore provides prebuilt docker containers (full lists [here](https://hub.docker.com/u/graphcore)) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs. + +There are two ways of running an IPU training job: + +1. a single `Worker` Pod +1. multiple `Worker` Pods with a dedicated `Launcher` Pod + +In this tutorial we will cover the first scenario, which is suitable for training with a single IPU. The subsequent tutorial will cover the second scenario, which can be used for distrubed training jobs. + +## Creating your first IPU job + +For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (`graphcore/pytorch:3.3.0`) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided [here](https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist). More applications can be found in the repository . + +To get started: + +1. to specify the job - create the file `mnist-training-ipujob.yaml`, then copy and save the following content into the file: + + ``` yaml + apiVersion: graphcore.ai/v1alpha1 + kind: IPUJob + metadata: + name: mnist-training + spec: + # jobInstances defines the number of job instances. + # More than 1 job instance is usually useful for inference jobs only. + jobInstances: 1 + # ipusPerJobInstance refers to the number of IPUs required per job instance. + # A separate IPU partition of this size will be created by the IPU Operator + # for each job instance. + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: mnist-training + image: graphcore/pytorch:3.3.0 + command: ["bash"] + args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/pytorch/mnist && python -m pip install -r requirements.txt && python mnist_poptorch_code_only.py --epochs 1"] + restartPolicy: Never + ``` + +1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output: + + ``` bash + ipujob.graphcore.ai/mnist-training created + ``` + +1. to monitor progress of the job - run `kubectl get pods`, which will give the following output + + ``` bash + NAME READY STATUS RESTARTS AGE + mnist-training-worker-0 0/1 Completed 0 2m56s + ``` + +1. to read the result - run `kubectl logs mnist-training-worker-0`, which will give the following output (or similar) + + ``` bash + ... + Graph compilation: 100%|██████████| 100/100 [00:23<00:00] + Epochs: 100%|██████████| 1/1 [00:34<00:00, 34.18s/it] + ... + Accuracy on test set: 97.08% + ``` + +## Monitoring and Cancelling your IPU job + +An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete `IPUjobs`, run `kubectl get ipujobs`, which will show: + +``` bash +NAME STATUS CURRENT DESIRED LASTMESSAGE AGE +mnist-training Completed 0 1 All instances done 10m +``` + +To delete the `IPUjob`, run `kubectl delete ipujobs `, e.g. `kubectl delete ipujobs mnist-training`. This will also delete the associated worker pod `mnist-training-worker-0`. + +Note: simply deleting the pod via `kubectl delete pods mnist-training-worker-0` does not delete the IPU job, which will need to be deleted separately. + +Note: you can list all pods via `kubectl get all` or `kubectl get pods`, but they do not show the ipujobs. These can be obtained using `kubectl get ipujobs`. + +Note: `kubectl describe ` provides verbose description of a specific pod. + +## Description + +The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named `IPUJob`, which can be seen at the beginning of the included yaml file: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +``` + +An `IPUJob` allows users to defineworkloads that can use IPUs. There are several fields specific to an `IPUJob`: + +**job instances** : This defines the number of jobs. In the case of training it should be 1. + +**ipusPerJobInstance** : This defines the size of IPU partition that will be created for each job instance. + +**workers** : This defines a Pod specification that will be used for `Worker` Pods, including the container image and commands. + +These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the [next lesson](./L2_multiple_IPU.md). + +## Additional Information + +It is possible to further specify the restart policy (`Always`/`OnFailure`/`Never`/`ExitCode`) and clean up policy (`Workers`/`All`/`None`); see [here](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). diff --git a/docs/services/graphcore/training/L2_multiple_IPU.md b/docs/services/graphcore/training/L2_multiple_IPU.md new file mode 100644 index 000000000..7be4a9975 --- /dev/null +++ b/docs/services/graphcore/training/L2_multiple_IPU.md @@ -0,0 +1,7 @@ +# Distributed training on multiple IPUs + +Multiple IPUs (in powers of 2) can be requested to perform distributed training. + +In this case, the `IPUJob` also spawns a `launcher` , which is a Pod that runs an `mpirun` or `poprun` command. These commands start workloads inside `worker` Pods. + +As an example, we will run the same MNIST training tutorial from the previous lesson, but use two IPUs. diff --git a/docs/services/graphcore/training/L3_profiling.md b/docs/services/graphcore/training/L3_profiling.md new file mode 100644 index 000000000..c723eae73 --- /dev/null +++ b/docs/services/graphcore/training/L3_profiling.md @@ -0,0 +1,63 @@ +# Profiling with PopVision + +Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see [Profiling and Debugging](https://docs.graphcore.ai/en/latest/child-pages/profiling-debugging.html#profiling-debugging) and [PopVision Graph Analyser User Guide](https://docs.graphcore.ai/en/latest/child-pages/profiling-debugging.html#profiling-debugging). + +We will reuse the same PyTorch MNIST example from [lesson 1](./L1_getting_started.md) (from ). + +To enable profiling and [create IPU reports](https://docs.graphcore.ai/projects/graph-analyser-userguide/en/latest/capturing-ipu-reports.html), we need to add the following line to the training script `mnist_poptorch_code_only.py` : + +``` python +training_opts = training_opts.enableProfiling() +``` + +(for details the API, see [API reference](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/reference.html#poptorch.Options)) + +Save and run `kubectl create -f ` on the following: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + name: mnist-training-profiling +spec: + # jobInstances defines the number of job instances. + # More than 1 job instance is usually useful for inference jobs only. + jobInstances: 1 + # ipusPerJobInstance refers to the number of IPUs required per job instance. + # A separate IPU partition of this size will be created by the IPU Operator + # for each job instance. + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: mnist-training-profiling + image: graphcore/pytorch:3.3.0 + command: ["bash"] + args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/pytorch/mnist && python -m pip install -r requirements.txt && sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py && python mnist_poptorch_code_only.py --epochs 1 && echo 'RUNNING ls ./training' && ls training"] + restartPolicy: Never +``` + +After completion, using `kubectl logs `, we can see the following result + +``` bash +... +Accuracy on test set: 96.69% +RUNNING ls ./training +archive.a +profile.pop +``` + +We can see that the training has created two Poplar report files: `archive.a` which is an archive of the ELF executable files, one for each tile; and `profile.pop`, the poplar profile, which contains compile-time and execution information about the Poplar graph. + +## Downloading the profile reports + +To download the traing profiles to your local environment, you can use `kubectl cp`. For example, run + +``` bash +kubectl cp :/root/build/examples/tutorials/simple_applications/pytorch/mnist/training . +``` + +Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here . + +From the Graph Analyser, you can analyse information including memory usage, execution trace and more. diff --git a/docs/services/graphcore/training/L4_other_frameworks.md b/docs/services/graphcore/training/L4_other_frameworks.md new file mode 100644 index 000000000..11349d645 --- /dev/null +++ b/docs/services/graphcore/training/L4_other_frameworks.md @@ -0,0 +1,190 @@ +# Other Frameworks + +In this tutorial we'll briefly cover running tensorflow and PopART for Machine Learning, and writing IPU programs directly via the PopLibs library in C++. Extra links and resources will be provided for more in-depth information. + +## Terminology + +Within Graphcore, `Poplar` refers to the tools (e.g. Poplar `Graph Engine` or Poplar `Graph Compiler`) and libraries (`PopLibs`) for programming on IPUs. + +The `Poplar SDK` is a package of software development tools, including + +- TensorFlow 1 and 2 for the IPU +- PopTorch (Wrapper around PyTorch for running on IPU) +- PopART (Poplar Advanced Run-Time, provides support for importing, creating, and running ONNX graphs on the IPU) +- Poplar and PopLibs +- PopDist (Poplar Distributed Configuration Library) and PopRun (Command line utility to launch distributed applications) +- Device drivers and command line tools for managing the IPU + +For more details see [here](https://docs.graphcore.ai/projects/graphcore-glossary/en/latest/index.html#term-Poplar). + +## Other ML frameworks: Tensorflow and PopART + +Besides being able to run PyTorch code, as demonstrated in the previous lessons, the Poplar SDK also supports running ML learning applications with tensorflow or PopART. + +### Tensorflow + +The Poplar SDK includes implementation of TensorFlow and Keras for the IPU. + +For more information, refer to [Targeting the IPU from TensorFlow 2](https://docs.graphcore.ai/projects/tensorflow-user-guide/en/latest/index.html) and [TensorFlow 2 Quick Start](https://docs.graphcore.ai/projects/tensorflow2-quick-start/en/latest/index.html). + +These are available from the image `graphcore/tensorflow:2`. + +For a quick example, we will run an example script from . To get started, save the following yaml and run `kubectl create -f ` to create the IPUJob: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + name: tensorflow-example +spec: + jobInstances: 1 + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: tensorflow-example + image: graphcore/tensorflow:2 + command: ["bash"] + args: ["-c", "apt update && apt upgrade -y && apt install git -y && cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/tensorflow2/mnist && python -m pip install -r requirements.txt && python mnist_code_only.py --epochs 1"] + restartPolicy: Never +``` + +Running `kubectl logs ` should show the results similar to the following + +``` bash +... +2023-10-25 13:21:40.263823: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.2.0 (1513789a51) Poplar package: b82480c629 +2023-10-25 13:21:42.203515: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0 +Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz +11493376/11490434 [==============================] - 0s 0us/step +11501568/11490434 [==============================] - 0s 0us/step +2023-10-25 13:21:43.789573: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2) +2023-10-25 13:21:44.164207: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. +2023-10-25 13:21:57.935339: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. +Epoch 1/4 +2000/2000 [==============================] - 17s 8ms/step - loss: 0.6188 +Epoch 2/4 +2000/2000 [==============================] - 1s 427us/step - loss: 0.3330 +Epoch 3/4 +2000/2000 [==============================] - 1s 371us/step - loss: 0.2857 +Epoch 4/4 +2000/2000 [==============================] - 1s 439us/step - loss: 0.2568 +``` + +### PopART + +The Poplar Advanced Run Time (PopART) enables importing and constructing ONNX graphs, and running graphs in inference, evaluation or training modes. PopART provides both a C++ and Python API. + +For more information, see the [PopART User Guide](https://docs.graphcore.ai/projects/popart-user-guide/en/latest/intro.html) + +PopART is available from the image `graphcore/popart`. + +For a quick example, we will run an example script from . To get started, save the following yaml and run `kubectl create -f ` to create the IPUJob: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + name: popart-example +spec: + jobInstances: 1 + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: popart-example + image: graphcore/popart:3.3.0 + command: ["bash"] + args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/tutorials.git && cd tutorials && git checkout sdk-release-3.1 && cd simple_applications/popart/mnist && python3 -m pip install -r requirements.txt && ./get_data.sh && python3 popart_mnist.py --epochs 1"] + restartPolicy: Never +``` + +Running `kubectl logs ` should show the results similar to the following + +``` bash +... +Creating ONNX model. +Compiling the training graph. +Compiling the validation graph. +Running training loop. +Epoch #1 + Loss=16.2605 + Accuracy=88.88% +``` + +## Writing IPU programs directly with PopLibs + +The Poplar libraries are a set of C++ libraries consisting of the Poplar graph library and the open-source PopLibs libraries. + +The Poplar graph library provides direct access to the IPU by code written in C++. You can write complete programs using Poplar, or use it to write functions to be called from your application written in a higher-level framework such as TensorFlow. + +The PopLibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, element-wise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU. + +For more information, see [Poplar Quick Start](https://docs.graphcore.ai/projects/poplar-quick-start/en/latest/index.html) and [Poplar and PopLibs User Guide](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/index.html). + +These are available from the image `graphcore/poplar`. + +When using the PopLibs libraries, you will have to include the include files in the `include/popops` directory, e.g. + +``` c++ +#include +``` + +and to link the relevant PopLibs libraries, in addition to the Poplar library, e.g. + +``` bash +g++ -std=c++11 my-program.cpp -lpoplar -lpopops +``` + +For a quick example, we will run an example from . To get started, save the following yaml and run `kubectl create -f ` to create the IPUJob: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + name: poplib-example +spec: + jobInstances: 1 + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: poplib-example + image: graphcore/poplar:3.3.0 + command: ["bash"] + args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/poplar/mnist/ && ./get_data.sh && make && ./regression-demo -IPU 1 50"] + restartPolicy: Never +``` + +Running `kubectl logs ` should show the results similar to the following + +``` bash +... +Using the IPU +Trying to attach to IPU +Attached to IPU 0 +Target: + Number of IPUs: 1 + Tiles per IPU: 1,472 + Total Tiles: 1,472 + Memory Per-Tile: 624.0 kB + Total Memory: 897.0 MB + Clock Speed (approx): 1,850.0 MHz + Number of Replicas: 1 + IPUs per Replica: 1 + Tiles per Replica: 1,472 + Memory per Replica: 897.0 MB + +Graph: + Number of vertices: 5,466 + Number of edges: 16,256 + Number of variables: 41,059 + Number of compute sets: 20 + +... + +Epoch 1 (99%), accuracy 76% +``` diff --git a/docs/services/index.md b/docs/services/index.md index d49608479..462342a48 100644 --- a/docs/services/index.md +++ b/docs/services/index.md @@ -12,6 +12,8 @@ [Ultra2](./ultra2/) +[Graphcore Bow Pod64](./graphcore/) + ## Data Management Services [Data Catalogue](./datacatalogue/) diff --git a/mkdocs.yml b/mkdocs.yml index 06ab5ab38..cbfe4d1d2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -68,6 +68,14 @@ nav: - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md - "GPU Service FAQ": services/gpuservice/faq.md + - "Graphcore Bow Pod64": + - "Overview": services/graphcore/index.md + - "Tutorial": + - "Getting Started": services/graphcore/training/L1_getting_started.md + - "Multi-IPU Jobs": services/graphcore/training/L2_multiple_IPU.md + - "Profiling": services/graphcore/training/L3_profiling.md + - "Other Frameworks": services/graphcore/training/L4_other_frameworks.md + - "Graphcore FAQ": services/graphcore/faq.md - "Data Management Services": - "Data Catalogue": - "Metadata information": services/datacatalogue/metadata.md From d7ae44dda2b2826510feb333ca969af76ea4b74b Mon Sep 17 00:00:00 2001 From: Ruairidh MacLeod Date: Fri, 27 Oct 2023 10:42:36 +0100 Subject: [PATCH 06/91] fix typo in link --- docs/safe-haven-services/using-the-hpc-cluster.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/safe-haven-services/using-the-hpc-cluster.md b/docs/safe-haven-services/using-the-hpc-cluster.md index 278344007..04ae720da 100644 --- a/docs/safe-haven-services/using-the-hpc-cluster.md +++ b/docs/safe-haven-services/using-the-hpc-cluster.md @@ -20,7 +20,7 @@ Minor software changes will be made as soon as admin effort can be allocated. Ma Login to the HPC system is from the project VM using SSH and is not direct from the VDI. The HPC cluster accounts are the same accounts used on the project VMs, with the same username and password. All project data access on the HPC system is private to the project accounts as it is on the VMs, but it is important to understand that the TRE HPC cluster is shared by projects in other TRE Safe Havens. -To login to the HPC cluster from the project VMs use `ssh shs-sdf01` from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the [SSH key configuration instructions here]([https://hpc-wiki.info/hpc/Ssh_keys). SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged. +To login to the HPC cluster from the project VMs use `ssh shs-sdf01` from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the [SSH key configuration instructions here](https://hpc-wiki.info/hpc/Ssh_keys). SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged. ## Running Jobs From e86a14fb4b1a58a18d9fffabe7d9b34876145e1f Mon Sep 17 00:00:00 2001 From: Joseph Lee Date: Thu, 9 Nov 2023 12:45:29 +0000 Subject: [PATCH 07/91] JL: add graphcore multi-IPU & yaml formatting --- docs/services/graphcore/faq.md | 6 + .../graphcore/training/L1_getting_started.md | 32 ++- .../graphcore/training/L2_multiple_IPU.md | 263 +++++++++++++++++- .../graphcore/training/L3_profiling.md | 20 +- .../graphcore/training/L4_other_frameworks.md | 30 +- 5 files changed, 329 insertions(+), 22 deletions(-) diff --git a/docs/services/graphcore/faq.md b/docs/services/graphcore/faq.md index 39a0c8fd2..0e975f6d1 100644 --- a/docs/services/graphcore/faq.md +++ b/docs/services/graphcore/faq.md @@ -7,3 +7,9 @@ `IPUJobs` manages the launcher and worker `pods`, therefore the pods will be deleted when the `IPUJob` is deleted, using `kubectl delete ipujobs `. If only the `pod` is deleted via `kubectl delete pod`, the `IPUJob` may respawn the `pod`. To see running or terminated `IPUJobs`, run `kubectl get ipujobs`. + +### My IPUJob died with a message: `'poptorch_cpp_error': Failed to acquire X IPU(s)`. Why? + +This error may appear when the IPUJob name is too long. + +We have identified that for IPUJobs with `metadata:name` length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters. diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md index 39f2dcece..e68c4b81c 100644 --- a/docs/services/graphcore/training/L1_getting_started.md +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -6,12 +6,7 @@ This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl Graphcore provides prebuilt docker containers (full lists [here](https://hub.docker.com/u/graphcore)) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs. -There are two ways of running an IPU training job: - -1. a single `Worker` Pod -1. multiple `Worker` Pods with a dedicated `Launcher` Pod - -In this tutorial we will cover the first scenario, which is suitable for training with a single IPU. The subsequent tutorial will cover the second scenario, which can be used for distrubed training jobs. +In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs. ## Creating your first IPU job @@ -40,9 +35,30 @@ To get started: containers: - name: mnist-training image: graphcore/pytorch:3.3.0 - command: ["bash"] - args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/pytorch/mnist && python -m pip install -r requirements.txt && python mnist_poptorch_code_only.py --epochs 1"] + command: [/bin/bash, -c, --] + args: + - | + cd; + mkdir build; + cd build; + git clone https://github.com/graphcore/examples.git; + cd examples/tutorials/simple_applications/pytorch/mnist; + python -m pip install -r requirements.txt; + python mnist_poptorch_code_only.py --epochs 1 + securityContext: + capabilities: + add: + - IPC_LOCK + volumeMounts: + - mountPath: /dev/shm + name: devshm restartPolicy: Never + hostIPC: true + volumes: + - emptyDir: + medium: Memory + sizeLimit: 10Gi + name: devshm ``` 1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output: diff --git a/docs/services/graphcore/training/L2_multiple_IPU.md b/docs/services/graphcore/training/L2_multiple_IPU.md index 7be4a9975..7bc9a9a76 100644 --- a/docs/services/graphcore/training/L2_multiple_IPU.md +++ b/docs/services/graphcore/training/L2_multiple_IPU.md @@ -1,7 +1,264 @@ # Distributed training on multiple IPUs -Multiple IPUs (in powers of 2) can be requested to perform distributed training. +In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on . These may require distributed training on multiple IPUs. -In this case, the `IPUJob` also spawns a `launcher` , which is a Pod that runs an `mpirun` or `poprun` command. These commands start workloads inside `worker` Pods. +The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64. -As an example, we will run the same MNIST training tutorial from the previous lesson, but use two IPUs. +## First example + +As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from . + +To get started, save and create an IPUJob with the following `.yaml` file: + +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + name: bert-training-multi-ipu +spec: + jobInstances: 1 + ipusPerJobInstance: "4" + workers: + template: + spec: + containers: + - name: mnist-training-profiling-alex-2 + image: graphcore/pytorch:3.3.0 + command: [/bin/bash, -c, --] + args: + - | + cd ; + mkdir build; + cd build ; + git clone https://github.com/graphcore/examples.git; + cd examples/nlp/bert/pytorch; + apt update ; + apt upgrade -y; + DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ; + pip3 install -r requirements.txt ; + python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1 + securityContext: + capabilities: + add: + - IPC_LOCK + volumeMounts: + - mountPath: /dev/shm + name: devshm + restartPolicy: Never + hostIPC: true + volumes: + - emptyDir: + medium: Memory + sizeLimit: 10Gi + name: devshm +``` + +Running the above IPUJob and querying the log via `kubectl logs pod/bert-training-multi-ipu-worker-0` should give: + +``` bash +... +Data loaded in 8.559805537108332 secs +----------------------------------------------------------- +-------------------- Device Allocation -------------------- +Embedding --> IPU 0 +Encoder 0 --> IPU 1 +Encoder 1 --> IPU 1 +Encoder 2 --> IPU 1 +Encoder 3 --> IPU 1 +Encoder 4 --> IPU 2 +Encoder 5 --> IPU 2 +Encoder 6 --> IPU 2 +Encoder 7 --> IPU 2 +Encoder 8 --> IPU 3 +Encoder 9 --> IPU 3 +Encoder 10 --> IPU 3 +Encoder 11 --> IPU 3 +Pooler --> IPU 0 +Classifier --> IPU 0 +----------------------------------------------------------- +---------- Compilation/Loading from Cache Started --------- + +... + +Graph compilation: 100%|██████████| 100/100 [08:02<00:00] +Compiled/Loaded model in 500.756152929971 secs +----------------------------------------------------------- +--------------------- Training Started -------------------- +Step: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %: 0%| | 0/1 [00:16` should produce: + +``` bash +... + =========================================================================================== +| poprun topology | +|===========================================================================================| +10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE +10:10:50.154 1 POPRUN [D] Target options from environment: {} +| hosts | localhost | +|-----------|-------------------------------------------------------------------------------| +| ILDs | 0 | +|-----------|-------------------------------------------------------------------------------| +| instances | 0 | +|-----------|-------------------------------------------------------------------------------| +| replicas | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | + ------------------------------------------------------------------------------------------- +10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true","instanceSize":"64"} +10:10:50.154 1 POPRUN [D] Using target options: {"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true","instanceSize":"64"} +10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting +10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None +10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output' +'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x' +'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x' +'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x' +'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0' +'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS= +{"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true", +"instanceSize":"64"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1' +10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346) +[1,0]: Registered metric hook: total_compiling_time with object: +[1,0]:Using config: pretrain_large_128_POD64 +... +Graph compilation: 100%|██████████| 100/100 [10:11<00:00][1,0]: +[1,0]:Compiled/Loaded model in 683.6591004971415 secs +[1,0]:----------------------------------------------------------- +[1,0]:--------------------- Training Started -------------------- +Step: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03: +[1,0]:----------------------------------------------------------- +[1,0]:-------------------- Training Metrics --------------------- +[1,0]:global_batch_size: 65536 +[1,0]:device_iterations: 1 +[1,0]:training_steps: 1 +[1,0]:Training time: 3.718 secs +[1,0]:----------------------------------------------------------- +``` + +## Notes on using the examples respository + +Graphcore provides examples of a variety of models on Github . When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel. diff --git a/docs/services/graphcore/training/L3_profiling.md b/docs/services/graphcore/training/L3_profiling.md index c723eae73..e999e9475 100644 --- a/docs/services/graphcore/training/L3_profiling.md +++ b/docs/services/graphcore/training/L3_profiling.md @@ -20,12 +20,7 @@ kind: IPUJob metadata: name: mnist-training-profiling spec: - # jobInstances defines the number of job instances. - # More than 1 job instance is usually useful for inference jobs only. jobInstances: 1 - # ipusPerJobInstance refers to the number of IPUs required per job instance. - # A separate IPU partition of this size will be created by the IPU Operator - # for each job instance. ipusPerJobInstance: "1" workers: template: @@ -33,8 +28,19 @@ spec: containers: - name: mnist-training-profiling image: graphcore/pytorch:3.3.0 - command: ["bash"] - args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/pytorch/mnist && python -m pip install -r requirements.txt && sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py && python mnist_poptorch_code_only.py --epochs 1 && echo 'RUNNING ls ./training' && ls training"] + command: [/bin/bash, -c, --] + args: + - | + cd; + mkdir build; + cd build; + git clone https://github.com/graphcore/examples.git; + cd examples/tutorials/simple_applications/pytorch/mnist; + python -m pip install -r requirements.txt; + sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py; + python mnist_poptorch_code_only.py --epochs 1; + echo 'RUNNING ls ./training'; + ls training restartPolicy: Never ``` diff --git a/docs/services/graphcore/training/L4_other_frameworks.md b/docs/services/graphcore/training/L4_other_frameworks.md index 11349d645..b56e96652 100644 --- a/docs/services/graphcore/training/L4_other_frameworks.md +++ b/docs/services/graphcore/training/L4_other_frameworks.md @@ -45,8 +45,19 @@ spec: containers: - name: tensorflow-example image: graphcore/tensorflow:2 - command: ["bash"] - args: ["-c", "apt update && apt upgrade -y && apt install git -y && cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/tensorflow2/mnist && python -m pip install -r requirements.txt && python mnist_code_only.py --epochs 1"] + command: [/bin/bash, -c, --] + args: + - | + apt update; + apt upgrade -y; + apt install git -y; + cd; + mkdir build; + cd build; + git clone https://github.com/graphcore/examples.git; + cd examples/tutorials/simple_applications/tensorflow2/mnist; + python -m pip install -r requirements.txt; + python mnist_code_only.py --epochs 1 restartPolicy: Never ``` @@ -96,8 +107,19 @@ spec: containers: - name: popart-example image: graphcore/popart:3.3.0 - command: ["bash"] - args: ["-c", "cd && mkdir build && cd build && git clone https://github.com/graphcore/tutorials.git && cd tutorials && git checkout sdk-release-3.1 && cd simple_applications/popart/mnist && python3 -m pip install -r requirements.txt && ./get_data.sh && python3 popart_mnist.py --epochs 1"] + command: [/bin/bash, -c, --] + args: + - | + cd ; + mkdir build; + cd build ; + git clone https://github.com/graphcore/tutorials.git; + cd tutorials; + git checkout sdk-release-3.1; + cd simple_applications/popart/mnist; + python3 -m pip install -r requirements.txt; + ./get_data.sh; + python3 popart_mnist.py --epochs 1 restartPolicy: Never ``` From 076300c144f5fa5a4c354f3a5f4e0c576a6d1988 Mon Sep 17 00:00:00 2001 From: Joseph Lee Date: Thu, 9 Nov 2023 16:19:04 +0000 Subject: [PATCH 08/91] JL: edit example job name --- docs/services/graphcore/training/L2_multiple_IPU.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/graphcore/training/L2_multiple_IPU.md b/docs/services/graphcore/training/L2_multiple_IPU.md index 7bc9a9a76..582a01d65 100644 --- a/docs/services/graphcore/training/L2_multiple_IPU.md +++ b/docs/services/graphcore/training/L2_multiple_IPU.md @@ -22,7 +22,7 @@ spec: template: spec: containers: - - name: mnist-training-profiling-alex-2 + - name: bert-training-multi-ipu image: graphcore/pytorch:3.3.0 command: [/bin/bash, -c, --] args: From 792dfb0589c929a536d0a63d338991e6e7e6cd71 Mon Sep 17 00:00:00 2001 From: awat31 Date: Mon, 13 Nov 2023 16:37:42 +0000 Subject: [PATCH 09/91] AW: Remove Moba from SSH Guide --- Brewfile | 8 ++++++++ docs/access/ssh.md | 38 +++++++++++++++++--------------------- 2 files changed, 25 insertions(+), 21 deletions(-) create mode 100644 Brewfile diff --git a/Brewfile b/Brewfile new file mode 100644 index 000000000..d0e23d7e7 --- /dev/null +++ b/Brewfile @@ -0,0 +1,8 @@ +tap "homebrew/bundle" +tap "homebrew/cask" +tap "homebrew/core" +brew "git" +brew "nmap" +brew "sshuttle" +brew "wimlib" +cask "zenmap" diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 176d9dd73..dd5587341 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -79,38 +79,34 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. 1. Select the check box next to ‘OpenSSH Client’ and click ‘Install’ 1. Once this is installed, you can reach your VM by opening CMD and running:
```$ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]``` -### Installing MobaXTerm - -1. Download [MobaXTerm](https://mobaxterm.mobatek.net/) from [https://mobaxterm.mobatek.net/](https://mobaxterm.mobatek.net/) -1. Once installed click the ‘Session’ button in the top left corner -1. Click ‘SSH’ -1. In the ‘Remote Host’ section, specify the VM IP -1. Click the ‘Network Settings’ Tab -1. Click the ‘SSH Gateway (jump host)’ button in the middle -1. Under Gateway Host, specify: eidf-gateway.epcc.ed.ac.uk -1. Under Username, specify your username -1. Click ‘OK’ -1. Click ‘OK’ to launch the session -1. For the EIDF-Gateway and VM login prompts, use your password +### Accessing via a Terminal +If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
-## Accessing From MacOS/Linux - -OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
-The '-J' flag is use to specify that we will access the second specified host by jumping through the first specified host like the example below. +1. Open either Powershell (the Windows Terminal) or a WSL Linux Terminal +1. Import the SSH Key you generated above: ```$ ssh-add [/path/to/sshkey]``` +1. This should return "Identity added [Path to SSH Key]" if successful. +1. Login by jumping through the gateway. ```bash -ssh -J [username]@jumphost [username]@target +ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` -To access EIDF Services: +## Accessing From MacOS/Linux +If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
+ +OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
+Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below.
```bash ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` +The '-J' flag is use to specify that we will access the second specified host by jumping through the first specified host.
+ -## Password Resets via the EIDF-Gateway +## First Password Setting and Password Resets via the EIDF-Gateway -You will have to connect to your VM via SSH before you can login with RDP as your initial password needs to be reset, which can only be done via SSH. You can reset your password through the SSH Gateway by connecting to it directly: +You will have to connect to your VM via SSH before you can login with RDP as your initial password needs to be reset, which can only be done via SSH. You can reset your password through the SSH Gateway by connecting to it directly. +
You will need to pass your accounts SSH key using either the '-i' flag or running 'ssh-add /path/to/key' first. ```bash ssh [username]@eidf-gateway.epcc.ed.ac.uk From 4f725a0c203ef7cdb7a4cabfbf372d9f07fd8dc2 Mon Sep 17 00:00:00 2001 From: awat31 Date: Mon, 13 Nov 2023 16:39:17 +0000 Subject: [PATCH 10/91] AW: Tidy SSH Page --- docs/access/ssh.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index dd5587341..b6a1faa50 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -77,9 +77,8 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. 1. If ‘OpenSSH Client’ is not under ‘Installed Features’, click the ‘Add a Feature’ button 1. Search ‘OpenSSH Client’ 1. Select the check box next to ‘OpenSSH Client’ and click ‘Install’ -1. Once this is installed, you can reach your VM by opening CMD and running:
```$ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]``` -### Accessing via a Terminal +### Accessing EIDF via a Terminal If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
1. Open either Powershell (the Windows Terminal) or a WSL Linux Terminal From ddd76a52805251016361944ec43191b5bca2d802 Mon Sep 17 00:00:00 2001 From: awat31 Date: Tue, 14 Nov 2023 08:53:32 +0000 Subject: [PATCH 11/91] AW Precheck Fixes --- docs/access/ssh.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index b6a1faa50..cf8b79918 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -79,6 +79,7 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. 1. Select the check box next to ‘OpenSSH Client’ and click ‘Install’ ### Accessing EIDF via a Terminal + If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
1. Open either Powershell (the Windows Terminal) or a WSL Linux Terminal @@ -90,17 +91,18 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` - ## Accessing From MacOS/Linux + If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below.
+ ```bash ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` -The '-J' flag is use to specify that we will access the second specified host by jumping through the first specified host.
+The '-J' flag is use to specify that we will access the second specified host by jumping through the first specified host.
## First Password Setting and Password Resets via the EIDF-Gateway From 4b6fa1dbff7a044ee10225e03cefe2c03bb61bdb Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Wed, 15 Nov 2023 13:18:42 +0000 Subject: [PATCH 12/91] Reflect changes to the login process for U2/CS-2 --- docs/services/cs2/run.md | 7 ++-- docs/services/ultra2/run.md | 76 ++++++++++++++++++++++++++++++++++++- 2 files changed, 77 insertions(+), 6 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 7102984d2..60e2ea37f 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -2,12 +2,11 @@ ## Introduction -The Cerebras CS-2 system is attached to the SDF-CS1 (Ultra2) system which serves as a host, provides access to files, the SLURM batch system etc. +The Cerebras CS-2 system is attached to the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc. -## Login +## Connecting to the CS-2 -To login to the host system, use the username and password you obtain from [SAFE](https://safe.epcc.ed.ac.uk), along with the SSH Key you registered when creating the account. -You can then login directly to the host via: `ssh @sdf-cs1.epcc.ed.ac.uk` +To gain access to the CS-2 you need to login to the host system, Ultra2 (also called SDF-CS1). See the [documentation for Ultra2](../ultra2/run.md#login). ## Running Jobs diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index d1a79e0db..600dc673c 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -10,8 +10,80 @@ The system is a HPE SuperDome Flex containing 576 individual cores in a SMT-1 ar ## Login -To login to the host system, use the username and password you obtain from [SAFE](https://www.safe.epcc.ed.ac.uk), along with the SSH Key you registered when creating the account. -You can then login directly to the host via: `ssh @sdf-cs1.epcc.ed.ac.uk` +Login is via SSH only via `ssh @sdf-cs1.epcc.ed.ac.uk`. See below for details on the credentials required to access the system. + +### Access credentials + +To access Ultra2, you need to use two credentials: your SSH key pair protected by a passphrase **and** a Time-based one-time password (TOTP). + +### SSH Key Pairs + +You will need to generate an SSH key pair protected by a passphrase to access Ultra2. + +Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key: + +```bash + $ ssh-keygen -t rsa -C "your@email.com" + ... + -bash-4.1$ ssh-keygen -t rsa -C "your@email.com" + Generating public/private rsa key pair. + Enter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter] + Enter passphrase (empty for no passphrase): [Passphrase] + Enter same passphrase again: [Passphrase] + Your identification has been saved in /Home/user/.ssh/id_rsa. + Your public key has been saved in /Home/user/.ssh/id_rsa.pub. + The key fingerprint is: + 03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com + The key's randomart image is: + +--[ RSA 2048]----+ + | . ...+o++++. | + | . . . =o.. | + |+ . . .......o o | + |oE . . | + |o = . S | + |. +.+ . | + |. oo | + |. . | + | .. | + +-----------------+ +``` + +(remember to replace "" with your e-mail address). + +### Upload public part of key pair to SAFE + +You should now upload the public part of your SSH key pair to the SAFE by following the instructions at: + +[Login to SAFE](https://safe.epcc.ed.ac.uk/). Then: + + 1. Go to the Menu *Login accounts* and select the Ultra2 account you want to add the SSH key to + 1. On the subsequent Login account details page click the *Add Credential* button + 1. Select *SSH public key* as the Credential Type and click *Next* + 1. Either copy and paste the public part of your SSH key into the *SSH Public key* box or use the button to select the public key file on your computer. + 1. Click *Add* to associate the public SSH key part with your account + +Once you have done this, your SSH key will be added to your Ultra2 account. + +### Time-based one-time password (TOTP) + +Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2. + +!!! Note + + When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: + + 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + + 2. When prompted to enter your new password: type in a new password + + 3. When prompted to re-enter the new password: re-enter the new password + + Your password has now been changed
+ You will **not** use your password when logging on to Ultra2 after the initial logon. + +### SSH Login + +To login to the host system, you will need to use the SSH Key and TOTP token you registered when creating the account [SAFE](https://www.safe.epcc.ed.ac.uk), along with the SSH Key you registered when creating the account. For example, with the appropriate key loaded
`ssh @sdf-cs1.epcc.ed.ac.uk` will then prompt you, once per 24 hours, for your TOTP code. ## Software From 3f6e1c8fb11781ca86e7aafa6749862cd6323c58 Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Wed, 15 Nov 2023 15:06:34 +0000 Subject: [PATCH 13/91] Remove explicit mention of TOTP window. --- docs/services/ultra2/run.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 600dc673c..18c1b5f98 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -83,7 +83,7 @@ Remember, you will need to use both an SSH key and Time-based one-time password ### SSH Login -To login to the host system, you will need to use the SSH Key and TOTP token you registered when creating the account [SAFE](https://www.safe.epcc.ed.ac.uk), along with the SSH Key you registered when creating the account. For example, with the appropriate key loaded
`ssh @sdf-cs1.epcc.ed.ac.uk` will then prompt you, once per 24 hours, for your TOTP code. +To login to the host system, you will need to use the SSH Key and TOTP token you registered when creating the account [SAFE](https://www.safe.epcc.ed.ac.uk), along with the SSH Key you registered when creating the account. For example, with the appropriate key loaded
`ssh @sdf-cs1.epcc.ed.ac.uk` will then prompt you, roughly once per day, for your TOTP code. ## Software From 78e0528686aac40ba6e7dc7c63838d87079beaf5 Mon Sep 17 00:00:00 2001 From: Amy Krause Date: Thu, 16 Nov 2023 16:51:54 +0000 Subject: [PATCH 14/91] updates for new password setting form in portal --- docs/access/index.md | 4 +- docs/access/project.md | 4 +- docs/access/ssh.md | 62 +++++++++------------ docs/access/virtualmachines-vdi.md | 8 +-- docs/services/virtualmachines/docs.md | 10 +--- docs/services/virtualmachines/quickstart.md | 32 ++++++++--- 6 files changed, 62 insertions(+), 58 deletions(-) diff --git a/docs/access/index.md b/docs/access/index.md index 3e085e38d..2e9028e24 100644 --- a/docs/access/index.md +++ b/docs/access/index.md @@ -26,8 +26,8 @@ Users with the appropriate permissions can also [use `ssh` to login to Virtual D Includes access to the following services: -* [Cerebras CS-2](../services/cs2/) -* [Ultra2](../services/ultra2/) +* [Cerebras CS-2](../services/cs2/index.md) +* [Ultra2](../services/ultra2/index.md) To login to most command-line services with `ssh` you should use the username and password you obtained from SAFE when you applied for access, along with the SSH Key you diff --git a/docs/access/project.md b/docs/access/project.md index 6e2e7d0e6..87b3946f1 100644 --- a/docs/access/project.md +++ b/docs/access/project.md @@ -76,7 +76,7 @@ and you will be notified of the outcome of your application. ### Approved Project If your application was approved, refer to -[Data Science Virtual Desktops: Quickstart](../../services/virtualmachines/quickstart/) +[Data Science Virtual Desktops: Quickstart](../services/virtualmachines/quickstart.md) how to view your project and to -[Data Science Virtual Desktops: Managing VMs](../../services/virtualmachines/docs/) +[Data Science Virtual Desktops: Managing VMs](../services/virtualmachines/docs.md) how to manage a project and how to create virtual machines and user accounts. diff --git a/docs/access/ssh.md b/docs/access/ssh.md index cf8b79918..6ba772d82 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -37,8 +37,11 @@ If not, you'll need to generate an SSH-Key, to do this: ### Generate a new SSH Key 1. Open a new window of whatever terminal you will use to SSH to EIDF. -1. Generate a new SSH Key: ```$ ssh-keygen``` -1. Input the directory and filename of they key. It's recommended to make this something like 'eidf-gateway' so it's easier to identify later +1. Generate a new SSH Key: + ``` + ssh-keygen + ``` +1. It is fine to accept the default name and path for the key unless you manage a number of keys. 1. Press enter to finish generating the key ### Adding the new SSH Key to your account via the Portal @@ -49,9 +52,8 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Select your username 1. Select the plus button under 'Credentials' 1. Select 'Choose File' to upload the PUBLIC (.pub) ssh key generated in the last step, or open the .pub file you just created and copy its contents into the text box. -1. Click 'Upload Credential'
It should look something like this: - -![eidf-portal-ssh](/eidf-docs/images/access/eidf-portal-ssh.png){: class="border-img"} +1. Click 'Upload Credential' - it should look something like this: + ![eidf-portal-ssh](../images/access/eidf-portal-ssh.png){: class="border-img"} #### Adding a new SSH Key via SAFE @@ -64,9 +66,24 @@ However, select your '[username]@EIDF' login account, not 'Archer2' as specified 1. From your local terminal, import the SSH Key you generated above: ```$ ssh-add [sshkey]``` 1. This should return "Identity added [Path to SSH Key]" if successful. You can then follow the steps below to access your VM. +## Accessing From MacOS/Linux + +!!! warning + If this is your first time connecting to EIDF using a new account, you have to set a password as described in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). + +OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal. + +Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below. + +```bash +ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] +``` + +The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. + ## Accessing from Windows -Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. Putty can also be used but won’t be covered in this tutorial. +Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won’t be covered in this tutorial. ### Installing and using OpenSSH @@ -80,7 +97,8 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. ### Accessing EIDF via a Terminal -If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
+!!! warning + If this is your first time connecting to EIDF using a new account, you have to set a password as described in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). 1. Open either Powershell (the Windows Terminal) or a WSL Linux Terminal 1. Import the SSH Key you generated above: ```$ ssh-add [/path/to/sshkey]``` @@ -91,32 +109,6 @@ Windows will require the installation of OpenSSH-Server or MobaXTerm to use SSH. ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` -## Accessing From MacOS/Linux +## First Password Setting and Password Resets -If this is your first time connecting to EIDF, see the 'First Password Setting and Password Resets via the EIDF-Gateway' section below.
- -OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.
-Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below.
- -```bash -ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] -``` - -The '-J' flag is use to specify that we will access the second specified host by jumping through the first specified host.
- -## First Password Setting and Password Resets via the EIDF-Gateway - -You will have to connect to your VM via SSH before you can login with RDP as your initial password needs to be reset, which can only be done via SSH. You can reset your password through the SSH Gateway by connecting to it directly. -
You will need to pass your accounts SSH key using either the '-i' flag or running 'ssh-add /path/to/key' first. - -```bash -ssh [username]@eidf-gateway.epcc.ed.ac.uk -``` - -Your first attempt to log in to your account using the SSH Gateway will prompt you for your initial password (provided in the portal) like a normal login. If this is successful you will choose a new password. You will be asked for your initial password again, followed by two entries of your new password. This will reset the password to your account for both the gateway and the VM. Once this reset has been completed, the session will disconnect and you can login via SSH again using the newly set password. - -You will not be able to directly connect to the gateway again, so to connect to your VM, jump through the SSH Gateway: - -```bash -ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] -``` +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file diff --git a/docs/access/virtualmachines-vdi.md b/docs/access/virtualmachines-vdi.md index 3624f4c8a..72b9b7d1d 100644 --- a/docs/access/virtualmachines-vdi.md +++ b/docs/access/virtualmachines-vdi.md @@ -22,7 +22,7 @@ After you have been authenticated through SAFE and logged into the EIDF VDI, if to you that have been associated with your user (typically in the case of research projects), you will be presented with the VDI home screen as shown below: - ![VDI-home-screen](/eidf-docs/images/access/vdi-home-screen.png){: class="border-img"} + ![VDI-home-screen](../images/access/vdi-home-screen.png){: class="border-img"} *VDI home page with list of available VM connections* !!! note "Adding connections" @@ -34,13 +34,13 @@ If you have only one connection associated with your VDI user account (typically automatically connected to the target VM's virtual desktop. Once you are connected to the VM, you will be asked for your username and password as shown below (if you are participating in a workshop, then you may not be asked for credentials) - ![VM-VDI-connection-login](/eidf-docs/images/access/vm-vdi-connection-login.png){: class="border-img"} + ![VM-VDI-connection-login](../images/access/vm-vdi-connection-login.png){: class="border-img"} *VM virtual desktop connection user account login screen* Once your credentials have been accepted, you will be connected to your VM's desktop environment. For instance, the screenshot below shows a resulting connection to a Xubuntu 20.04 VM with the Xfce desktop environment. - ![VM-VDI-connection](/eidf-docs/images/access/vm-vdi-connection.png){: class="border-img"} + ![VM-VDI-connection](../images/access/vm-vdi-connection.png){: class="border-img"} *VM virtual desktop* ## VDI Features for the Virtual Desktop @@ -77,7 +77,7 @@ You can use the standard keyboard shortcuts to copy text from your client PC or then again copy that text from the Guacamole menu clipboard into an application or CLI terminal on the VM's remote desktop. An example of using the copy and paste clipboard is shown in the screenshot below. - ![EIDF-VDI-Clipboard](/eidf-docs/images/access/vm-vdi-copy-paste.png){: class="border-img center"} + ![EIDF-VDI-Clipboard](../images/access/vm-vdi-copy-paste.png){: class="border-img center"} *The EIDF VDI Clipboard* ### Keyboard Language and Layout Settings diff --git a/docs/services/virtualmachines/docs.md b/docs/services/virtualmachines/docs.md index bea821603..3f8e3f3a7 100644 --- a/docs/services/virtualmachines/docs.md +++ b/docs/services/virtualmachines/docs.md @@ -72,8 +72,7 @@ however a user may have multiple accounts in a project, for example for differen 1. Select the project member from the 'Account owner' drop-down field 1. Click 'Create' -The new account is allocated a temporary password which the account owner can view -in their account details. +The user can now set the password for their new account on the account details page. ## Adding Access to the VM for a User @@ -107,12 +106,9 @@ Please contact the helpdesk if sudo permission management is required but is not ## First login -A new user account is allocated a temporary password which the user must reset before they -can log in for the first time. -The password reset will not work when logging in via RDP - -they must use a SSH connection, either in the VDI or via an SSH gateway. +A new user account must reset the password before they can log in for the first time. -The user can view the temporary password in their account details page. +The user can reset the password in their account details page. ## Updating an existing machine diff --git a/docs/services/virtualmachines/quickstart.md b/docs/services/virtualmachines/quickstart.md index 1f8aa6e81..07801ee37 100644 --- a/docs/services/virtualmachines/quickstart.md +++ b/docs/services/virtualmachines/quickstart.md @@ -29,24 +29,40 @@ Now you have to wait for your PI or project manager to accept your request to jo ## Accessing a VM -1. View your user accounts on the project page. +1. Select a project and view your user accounts on the project page. 1. Click on an account name to view details of the VMs that are you allowed to access - with this account, and look up the temporary password allocated to the account. + with this account, and to change the password for this account. + +1. Before you log in for the first time with a new user account, you must change your password as described + [below](../../services/virtualmachines/quickstart.md#set-or-change-the-password-for-your-user-account). 1. Follow the link to the Guacamole login or log in directly at [https://eidf-vdi.epcc.ed.ac.uk/vdi/](https://eidf-vdi.epcc.ed.ac.uk/vdi/). - Please see the [VDI](/eidf-docs/access/virtualmachines-vdi/#navigating-the-eidf-vdi) guide for more information. + Please see the [VDI](../../access/virtualmachines-vdi.md#navigating-the-eidf-vdi) guide for more information. -1. Choose the SSH connection to log in for the first time. You will be asked to reset the password. +1. You can also log in via the [EIDF Gateway Jump Host](https://epcced.github.io/eidf-docs/access/ssh/) + if this is available in your project. !!! warning - Do not use RDP to login for the first time as you have to reset your password. - Always use SSH to login to the VM for the first time. - This can be done either via the VDI or the EIDF-Gateway Jump Host as described [here.](https://epcced.github.io/eidf-docs/access/ssh/) + You must set a password for a new account before you log in for the first time. + +## Set or change the password for a user account + +Follow these instructions to set a password for a new account before you log in for the first time. +If you have forgotten your password you may reset the password as described here. + +1. Select a project and click the account name in the project page to view the account details. + +1. In the user account detail page, press the button "Set Password" + and follow the instructions in the form. + +There may be a short delay while the change is implemented before the new password becomes usable. ## Further information [Managing VMs](./docs.md): Project management guide to creating, configuring and removing VMs and managing user accounts in the portal. -[Virtual Desktop Interface](/eidf-docs/access/virtualmachines-vdi/): Working with the VDI interface. +[Virtual Desktop Interface](../../access/virtualmachines-vdi.md): Working with the VDI interface. + +[EIDF Gateway](../../access/ssh.md): SSH access to VMs via the EIDF SSH Gateway jump host. \ No newline at end of file From 6f3b121b8c59cb109c1fc9dfcd485461044c33c9 Mon Sep 17 00:00:00 2001 From: Amy Krause Date: Thu, 16 Nov 2023 17:17:55 +0000 Subject: [PATCH 15/91] fix precommit issues --- docs/access/ssh.md | 6 ++++-- docs/services/gpuservice/faq.md | 4 ++-- docs/services/virtualmachines/quickstart.md | 2 +- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 6ba772d82..122600288 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -38,9 +38,11 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Open a new window of whatever terminal you will use to SSH to EIDF. 1. Generate a new SSH Key: - ``` + + ```bash ssh-keygen ``` + 1. It is fine to accept the default name and path for the key unless you manage a number of keys. 1. Press enter to finish generating the key @@ -111,4 +113,4 @@ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ## First Password Setting and Password Resets -Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 8169bfa27..7b37742a9 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -30,12 +30,12 @@ There may be an issue with the kubectl version that is being run. This can occur The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the [Kubernetes Version Skew Policy](https://kubernetes.io/releases/version-skew-policy/). - ### Insufficient Shared Memory Size My SHM is very small, and it causes "OSError: [Errno 28] No space left on device" when I train a model using multi-GPU. How to increase SHM size? The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem: + ```yaml spec: containers: @@ -48,4 +48,4 @@ The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to s - name: dshm emptyDir: medium: Memory -``` \ No newline at end of file +``` diff --git a/docs/services/virtualmachines/quickstart.md b/docs/services/virtualmachines/quickstart.md index 07801ee37..cbaf78b3d 100644 --- a/docs/services/virtualmachines/quickstart.md +++ b/docs/services/virtualmachines/quickstart.md @@ -65,4 +65,4 @@ There may be a short delay while the change is implemented before the new passwo [Virtual Desktop Interface](../../access/virtualmachines-vdi.md): Working with the VDI interface. -[EIDF Gateway](../../access/ssh.md): SSH access to VMs via the EIDF SSH Gateway jump host. \ No newline at end of file +[EIDF Gateway](../../access/ssh.md): SSH access to VMs via the EIDF SSH Gateway jump host. From 0b330dacba2d35083cacc8525b189ca19fb6e290 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Mon, 20 Nov 2023 12:18:17 +0000 Subject: [PATCH 16/91] AG: updates for draft for access and lack of quota --- docs/services/graphcore/index.md | 12 +++++++++--- .../graphcore/training/L1_getting_started.md | 2 +- 2 files changed, 10 insertions(+), 4 deletions(-) diff --git a/docs/services/graphcore/index.md b/docs/services/graphcore/index.md index 77cc1839d..df9e81d9b 100644 --- a/docs/services/graphcore/index.md +++ b/docs/services/graphcore/index.md @@ -13,15 +13,21 @@ For more details about the IPU architecture, see [documentation from Graphcore]( The smallest unit of compute resource that can be requested is a single IPU. -Similarly to the EIDF GPU Service, usage of the graphcore is managed using [Kubernetes](https://kubernetes.io). +Similarly to the EIDF GPU Service, usage of the Graphcore is managed using [Kubernetes](https://kubernetes.io). ## Service Access +Access to the Graphcore accelerator is provisioning through the EIDF GPU Service. + +Users should apply for access to Graphcore via the [EIDF GPU Service](../gpuservice/index.md). + ## Project Quotas +Currently there is no active quota mechanism on the Graphcore accelerator. IPUJobs should be actively using partitions on the Graphcore. + ## Graphcore Tutorial -The following tutorial teaches users how to submit tasks to the graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the [GPU service tutorial](../gpuservice/training/L1_getting_started.md). For more in-depth lessons about developing applications for graphcore, see [the general documentation](https://docs.graphcore.ai/en/latest/) and [guide for creating IPU jobs via Kubernetes](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). +The following tutorial teaches users how to submit tasks to the Graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the [GPU service tutorial](../gpuservice/training/L1_getting_started.md). For more in-depth lessons about developing applications for Graphcore, see [the general documentation](https://docs.graphcore.ai/en/latest/) and [guide for creating IPU jobs via Kubernetes](https://docs.graphcore.ai/projects/kubernetes-user-guide/en/latest/creating-ipujob.html). | Lesson | Objective | |-----------------------------------|-------------------------------------| @@ -34,4 +40,4 @@ The following tutorial teaches users how to submit tasks to the graphcore system - The [Graphcore documentation](https://docs.graphcore.ai/en/latest/) provides information about using the Graphcore system. -- The [Graphcore examples repository on github](https://github.com/graphcore/examples/tree/master) provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks. +- The [Graphcore examples repository on GitHub](https://github.com/graphcore/examples/tree/master) provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks. diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md index e68c4b81c..b6a270e99 100644 --- a/docs/services/graphcore/training/L1_getting_started.md +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -1,6 +1,6 @@ # Getting started with Graphcore IPU Jobs -This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl`. See [GPU service tutorial](../gpuservice/training/L1_getting_started.md) to get started. +This guide assumes basic familiarity with Kubernetes (K8s) and usage of `kubectl`. See [GPU service tutorial](../../gpuservice/training/L1_getting_started.md) to get started. ## Introduction From 4625dc2566b162bf41e55ceefa71ff2e5d23034c Mon Sep 17 00:00:00 2001 From: agrant3 Date: Mon, 27 Nov 2023 14:25:01 +0000 Subject: [PATCH 17/91] AG: added OMP Thread issue on Pytorch --- docs/services/gpuservice/faq.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 7b37742a9..1aede79fd 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -49,3 +49,30 @@ The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to s emptyDir: medium: Memory ``` + +### Pytorch Slow Performance Issues + +Pytorch on Kubernetes may operate slower than expected - much slower than an equivalent VM setup. + +Pytorch defaults to auto-detecting the number of OMP Threads and it will report an incorrect number of potential threads compared to your requested CPU core count. This is a consequence in operating in a container environment, the CPU information is reported by standard libraries and tools will be the node level information rather than your container. + +To help correct this issue, the environment variable OMP_NUM_THREADS should be set in the job submission file to the number of cores requested or less. + +This has been tested using: + +- OMP_NUM_THREADS=1 +- OMP_NUM_THREADS=(number of requested cores). + +Example fragment for a Bash command start: + +```yaml + containers: + - args: + - > + export OMP_NUM_THREADS=1; + python mypytorchprogram.py; + command: + - /bin/bash + - '-c' + - '--' +``` From 7cec38722410d09cdfaa89c68c1a82be9908884c Mon Sep 17 00:00:00 2001 From: agrant3 Date: Thu, 30 Nov 2023 13:53:44 +0000 Subject: [PATCH 18/91] AG: update to job definitions to avoid name clashes and pod limits --- .../graphcore/training/L1_getting_started.md | 66 ++++++++++--------- .../graphcore/training/L2_multiple_IPU.md | 14 +++- .../graphcore/training/L3_profiling.md | 19 +++++- .../graphcore/training/L4_other_frameworks.md | 57 +++++++++++++++- 4 files changed, 118 insertions(+), 38 deletions(-) diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md index b6a270e99..e80e453ae 100644 --- a/docs/services/graphcore/training/L1_getting_started.md +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -20,37 +20,41 @@ To get started: apiVersion: graphcore.ai/v1alpha1 kind: IPUJob metadata: - name: mnist-training + generateName: mnist-training- spec: - # jobInstances defines the number of job instances. - # More than 1 job instance is usually useful for inference jobs only. - jobInstances: 1 - # ipusPerJobInstance refers to the number of IPUs required per job instance. - # A separate IPU partition of this size will be created by the IPU Operator - # for each job instance. - ipusPerJobInstance: "1" - workers: + # jobInstances defines the number of job instances. + # More than 1 job instance is usually useful for inference jobs only. + jobInstances: 1 + # ipusPerJobInstance refers to the number of IPUs required per job instance. + # A separate IPU partition of this size will be created by the IPU Operator + # for each job instance. + ipusPerJobInstance: "1" + workers: template: - spec: + spec: containers: - name: mnist-training - image: graphcore/pytorch:3.3.0 - command: [/bin/bash, -c, --] - args: + image: graphcore/pytorch:3.3.0 + command: [/bin/bash, -c, --] + args: - | - cd; - mkdir build; - cd build; - git clone https://github.com/graphcore/examples.git; - cd examples/tutorials/simple_applications/pytorch/mnist; - python -m pip install -r requirements.txt; - python mnist_poptorch_code_only.py --epochs 1 - securityContext: + cd; + mkdir build; + cd build; + git clone https://github.com/graphcore/examples.git; + cd examples/tutorials/simple_applications/pytorch/mnist; + python -m pip install -r requirements.txt; + python mnist_poptorch_code_only.py --epochs 1 + resources: + limits: + cpu: 32 + memory: 200Gi + securityContext: capabilities: - add: - - IPC_LOCK - volumeMounts: - - mountPath: /dev/shm + add: + - IPC_LOCK + volumeMounts: + - mountPath: /dev/shm name: devshm restartPolicy: Never hostIPC: true @@ -58,23 +62,23 @@ To get started: - emptyDir: medium: Memory sizeLimit: 10Gi - name: devshm + name: devshm ``` 1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output: ``` bash - ipujob.graphcore.ai/mnist-training created + ipujob.graphcore.ai/mnist-training- created ``` 1. to monitor progress of the job - run `kubectl get pods`, which will give the following output ``` bash NAME READY STATUS RESTARTS AGE - mnist-training-worker-0 0/1 Completed 0 2m56s + mnist-training--worker-0 0/1 Completed 0 2m56s ``` -1. to read the result - run `kubectl logs mnist-training-worker-0`, which will give the following output (or similar) +1. to read the result - run `kubectl logs mnist-training--worker-0`, which will give the following output (or similar) ``` bash ... @@ -93,9 +97,9 @@ NAME STATUS CURRENT DESIRED LASTMESSAGE AGE mnist-training Completed 0 1 All instances done 10m ``` -To delete the `IPUjob`, run `kubectl delete ipujobs `, e.g. `kubectl delete ipujobs mnist-training`. This will also delete the associated worker pod `mnist-training-worker-0`. +To delete the `IPUjob`, run `kubectl delete ipujobs `, e.g. `kubectl delete ipujobs mnist-training-`. This will also delete the associated worker pod `mnist-training--worker-0`. -Note: simply deleting the pod via `kubectl delete pods mnist-training-worker-0` does not delete the IPU job, which will need to be deleted separately. +Note: simply deleting the pod via `kubectl delete pods mnist-training--worker-0` does not delete the IPU job, which will need to be deleted separately. Note: you can list all pods via `kubectl get all` or `kubectl get pods`, but they do not show the ipujobs. These can be obtained using `kubectl get ipujobs`. diff --git a/docs/services/graphcore/training/L2_multiple_IPU.md b/docs/services/graphcore/training/L2_multiple_IPU.md index 582a01d65..59de4ff3d 100644 --- a/docs/services/graphcore/training/L2_multiple_IPU.md +++ b/docs/services/graphcore/training/L2_multiple_IPU.md @@ -14,7 +14,7 @@ To get started, save and create an IPUJob with the following `.yaml` file: apiVersion: graphcore.ai/v1alpha1 kind: IPUJob metadata: - name: bert-training-multi-ipu + generateName: bert-training-multi-ipu- spec: jobInstances: 1 ipusPerJobInstance: "4" @@ -37,6 +37,10 @@ spec: DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ; pip3 install -r requirements.txt ; python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1 + resources: + limits: + cpu: 32 + memory: 200Gi securityContext: capabilities: add: @@ -53,7 +57,7 @@ spec: name: devshm ``` -Running the above IPUJob and querying the log via `kubectl logs pod/bert-training-multi-ipu-worker-0` should give: +Running the above IPUJob and querying the log via `kubectl logs pod/bert-training-multi-ipu--worker-0` should give: ``` bash ... @@ -162,7 +166,7 @@ In this case, [Poprun](https://docs.graphcore.ai/projects/poprun-user-guide/en/l apiVersion: graphcore.ai/v1alpha1 kind: IPUJob metadata: - name: bert-poprun-64ipus + generateName: bert-poprun-64ipus- spec: jobInstances: 1 modelReplicasPerWorker: "16" @@ -196,6 +200,10 @@ spec: python3 run_pretraining.py \ --config pretrain_large_128_POD64 \ --dataset generated --training-steps 1 + resources: + limits: + cpu: 32 + memory: 200Gi securityContext: capabilities: add: diff --git a/docs/services/graphcore/training/L3_profiling.md b/docs/services/graphcore/training/L3_profiling.md index e999e9475..63ee20c9a 100644 --- a/docs/services/graphcore/training/L3_profiling.md +++ b/docs/services/graphcore/training/L3_profiling.md @@ -18,7 +18,7 @@ Save and run `kubectl create -f ` on the following: apiVersion: graphcore.ai/v1alpha1 kind: IPUJob metadata: - name: mnist-training-profiling + generateName: mnist-training-profiling- spec: jobInstances: 1 ipusPerJobInstance: "1" @@ -41,7 +41,24 @@ spec: python mnist_poptorch_code_only.py --epochs 1; echo 'RUNNING ls ./training'; ls training + resources: + limits: + cpu: 32 + memory: 200Gi + securityContext: + capabilities: + add: + - IPC_LOCK + volumeMounts: + - mountPath: /dev/shm + name: devshm restartPolicy: Never + hostIPC: true + volumes: + - emptyDir: + medium: Memory + sizeLimit: 10Gi + name: devshm ``` After completion, using `kubectl logs `, we can see the following result diff --git a/docs/services/graphcore/training/L4_other_frameworks.md b/docs/services/graphcore/training/L4_other_frameworks.md index b56e96652..b1a26ffa9 100644 --- a/docs/services/graphcore/training/L4_other_frameworks.md +++ b/docs/services/graphcore/training/L4_other_frameworks.md @@ -35,7 +35,7 @@ For a quick example, we will run an example script from ` should show the results similar to the following @@ -97,7 +114,7 @@ For a quick example, we will run an example script from ` should show the results similar to the following @@ -166,7 +200,7 @@ For a quick example, we will run an example from ` should show the results similar to the following From 9cccd89cf91a6373584a94ae5e9ffcb63f615877 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Thu, 30 Nov 2023 14:29:58 +0000 Subject: [PATCH 19/91] AG: indent fix for initial example jobs --- docs/services/gpuservice/training/L1_getting_started.md | 2 +- docs/services/graphcore/training/L1_getting_started.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 49fe870ef..eef9015c6 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -67,7 +67,7 @@ Finally, it optional to define GPU resources but only the `limits` tag is used t apiVersion: v1 kind: Pod metadata: -generateName: first-pod- + generateName: first-pod- spec: restartPolicy: OnFailure containers: diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md index e80e453ae..7778c465f 100644 --- a/docs/services/graphcore/training/L1_getting_started.md +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -20,7 +20,7 @@ To get started: apiVersion: graphcore.ai/v1alpha1 kind: IPUJob metadata: - generateName: mnist-training- + generateName: mnist-training- spec: # jobInstances defines the number of job instances. # More than 1 job instance is usually useful for inference jobs only. From 1f529121b693fa04de5d242c3386445a9f966f35 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Thu, 30 Nov 2023 16:02:07 +0000 Subject: [PATCH 20/91] AG: fix to indent on volume name and correction on input file --- docs/services/gpuservice/faq.md | 4 +- .../graphcore/training/L1_getting_started.md | 94 +++++++++---------- 2 files changed, 49 insertions(+), 49 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 1aede79fd..e91502968 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -42,8 +42,8 @@ The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to s - name: [NAME] image: [IMAGE] volumeMounts: - - mountPath: /dev/shm - name: dshm + - mountPath: /dev/shm + name: dshm volumes: - name: dshm emptyDir: diff --git a/docs/services/graphcore/training/L1_getting_started.md b/docs/services/graphcore/training/L1_getting_started.md index 7778c465f..e5cbfcf8d 100644 --- a/docs/services/graphcore/training/L1_getting_started.md +++ b/docs/services/graphcore/training/L1_getting_started.md @@ -16,54 +16,54 @@ To get started: 1. to specify the job - create the file `mnist-training-ipujob.yaml`, then copy and save the following content into the file: - ``` yaml - apiVersion: graphcore.ai/v1alpha1 - kind: IPUJob - metadata: - generateName: mnist-training- - spec: - # jobInstances defines the number of job instances. - # More than 1 job instance is usually useful for inference jobs only. - jobInstances: 1 - # ipusPerJobInstance refers to the number of IPUs required per job instance. - # A separate IPU partition of this size will be created by the IPU Operator - # for each job instance. - ipusPerJobInstance: "1" - workers: - template: - spec: - containers: - - name: mnist-training - image: graphcore/pytorch:3.3.0 - command: [/bin/bash, -c, --] - args: - - | - cd; - mkdir build; - cd build; - git clone https://github.com/graphcore/examples.git; - cd examples/tutorials/simple_applications/pytorch/mnist; - python -m pip install -r requirements.txt; - python mnist_poptorch_code_only.py --epochs 1 - resources: - limits: - cpu: 32 - memory: 200Gi - securityContext: - capabilities: - add: - - IPC_LOCK - volumeMounts: - - mountPath: /dev/shm - name: devshm - restartPolicy: Never - hostIPC: true - volumes: - - emptyDir: - medium: Memory - sizeLimit: 10Gi +``` yaml +apiVersion: graphcore.ai/v1alpha1 +kind: IPUJob +metadata: + generateName: mnist-training- +spec: + # jobInstances defines the number of job instances. + # More than 1 job instance is usually useful for inference jobs only. + jobInstances: 1 + # ipusPerJobInstance refers to the number of IPUs required per job instance. + # A separate IPU partition of this size will be created by the IPU Operator + # for each job instance. + ipusPerJobInstance: "1" + workers: + template: + spec: + containers: + - name: mnist-training + image: graphcore/pytorch:3.3.0 + command: [/bin/bash, -c, --] + args: + - | + cd; + mkdir build; + cd build; + git clone https://github.com/graphcore/examples.git; + cd examples/tutorials/simple_applications/pytorch/mnist; + python -m pip install -r requirements.txt; + python mnist_poptorch_code_only.py --epochs 1 + resources: + limits: + cpu: 32 + memory: 200Gi + securityContext: + capabilities: + add: + - IPC_LOCK + volumeMounts: + - mountPath: /dev/shm name: devshm - ``` + restartPolicy: Never + hostIPC: true + volumes: + - emptyDir: + medium: Memory + sizeLimit: 10Gi + name: devshm +``` 1. to submit the job - run `kubectl create -f mnist-training-ipujob.yaml`, which will give the following output: From a8499afb40a556d955d89191e54855ff1b0cc1ac Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Mon, 11 Dec 2023 16:27:50 +0000 Subject: [PATCH 21/91] Update to reflect change to WSC --- docs/services/cs2/run.md | 56 ++++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 60e2ea37f..205cb7519 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -2,24 +2,19 @@ ## Introduction -The Cerebras CS-2 system is attached to the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc. +The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc. -## Connecting to the CS-2 +## Connecting to the cluster -To gain access to the CS-2 you need to login to the host system, Ultra2 (also called SDF-CS1). See the [documentation for Ultra2](../ultra2/run.md#login). +To gain access to the CS-2 WSC you need to login to the host system, Ultra2 (also called SDF-CS1). See the [documentation for Ultra2](../ultra2/run.md#login). ## Running Jobs -All jobs must be run via SLURM to avoid inconveniencing other users of the system. The `csrun_cpu` and `csrun_wse` scripts themselves contain calls to `srun` to work with the SLURM system, so note the omission of `srun` in the below examples. -Users can either copy these files from `/home/y26/shared/bin` to their own home directory should they wish, or use the centrally supplied version. In either case, ensure they are in your `PATH` before execution, eg: +All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below. -```bash -export PATH=$PATH:/home/y26/shared/bin -``` +### SLURM example -### Run on the host - -Jobs can be run on the host system (eg simulations, test scripts) using the `csrun_cpu` wrapper. Here is the example from the Cerebras documentation on PyTorch. Note that this assumes csrun_cpu is in your path. +This is based on the sample job from the Cerebras documentation [Cerebras documentation - Execute your job](https://docs.cerebras.net/en/latest/wsc/getting-started/cs-appliance.html#execute-your-job) ```bash #!/bin/bash @@ -27,23 +22,40 @@ Jobs can be run on the host system (eg simulations, test scripts) using the `csr #SBATCH --cpus-per-task=2 # Request 2 cores #SBATCH --output=example_%j.log # Standard output and error log #SBATCH --time=01:00:00 # Set time limit for this job to 1 hour +#SBATCH --gres=cs:1 # Request CS-2 system -csrun_cpu python-pt run.py --mode train --compile_only --params configs/ +source venv_cerebras_pt/bin/activate +srun python run.py \ + CSX \ + --params params.yaml \ + --num_csx=1 \ + --model_dir model_dir \ + --mode {train,eval,eval_all,train_and_eval} \ + --mount_dirs {paths to modelzoo and to data} \ + --python_paths {paths to modelzoo and other python code if used} ``` -### Run on the CS-2 +## Creating an environment -The following will run the above PyTorch example on the CS-2 - note the `--cs_ip` argument with port number passed in via the command line, and the inclusion of the `--gres` option to request use of the CS-2 via SLURM. +To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following: + +1. Create the venv ```bash -#!/bin/bash -#SBATCH --job-name=Example # Job name -#SBATCH --tasks-per-node=8 # There is only one node on SDF-CS1 -#SBATCH --cpus-per-task=16 # Each cpu is a core -#SBATCH --gres=cs:1 # Request CS-2 system -#SBATCH --output=example_%j.log # Standard output and error log -#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour +/opt/python3.8/bin/python3.8 -m venv venv_cerebras_pt +``` + +1. Install the dependencies +```bash +source venv_cerebras_pt/bin/activate +pip install --upgrade pip +pip install cerebras_pytorch==2.0.2 +``` -csrun_wse python-pt run.py --mode train --cs_ip 172.24.102.121:9000 --params configs/ +1. Validate the setup + +```bash +source venv_cerebras_pt/bin/activate +cerebras_install_check ``` From 81c7fc80bcd4a9f0bd5ddc3eae63d23a8051479e Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Mon, 11 Dec 2023 17:07:00 +0000 Subject: [PATCH 22/91] Remove anomalous srun. --- docs/services/cs2/run.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 205cb7519..f7c1c7f36 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -25,14 +25,14 @@ This is based on the sample job from the Cerebras documentation [Cerebras docume #SBATCH --gres=cs:1 # Request CS-2 system source venv_cerebras_pt/bin/activate -srun python run.py \ - CSX \ - --params params.yaml \ - --num_csx=1 \ - --model_dir model_dir \ - --mode {train,eval,eval_all,train_and_eval} \ - --mount_dirs {paths to modelzoo and to data} \ - --python_paths {paths to modelzoo and other python code if used} +python run.py \ + CSX \ + --params params.yaml \ + --num_csx=1 \ + --model_dir model_dir \ + --mode {train,eval,eval_all,train_and_eval} \ + --mount_dirs {paths to modelzoo and to data} \ + --python_paths {paths to modelzoo and other python code if used} ``` ## Creating an environment From 51c84c21f6653f5c6b982086954cec0de808c954 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 15 Dec 2023 13:50:49 +0000 Subject: [PATCH 23/91] Add troubleshooting step to remove modification check --- docs/services/cs2/run.md | 47 ++++++++++++++++++++++++++++++++++++++++ requirements.txt | 1 - requirements2.txt | 30 +++++++++++++++++++++++++ 3 files changed, 77 insertions(+), 1 deletion(-) create mode 100644 requirements2.txt diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index f7c1c7f36..8630924d6 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -59,3 +59,50 @@ pip install cerebras_pytorch==2.0.2 source venv_cerebras_pt/bin/activate cerebras_install_check ``` + + +## Troubleshooting + +### "Failed to transfer X out of 1943 weight tensors" +Sometimes jobs receive an error during the 'Transferring weights to server' like below: +``` +2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. +2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. +``` + +If this occurs, follow the below steps to fix it: + +1. From within your python venv, edit the /lib64/python3.8/site-packages/cerebras_pytorch/storage.py file +```bash +vi /lib64/python3.8/site-packages/cerebras_pytorch/storage.py +``` + +1. Navigate to line 672 +```bash +:672 +``` +The section should look like this: +``` +if modified_time > self._last_modified: + raise RuntimeError( + f"Attempting to materialize deferred tensor with key " + f"\"{self._key}\" from file {self._filepath}, but the file has " + f"since been modified. The loaded tensor value may be " + f"different from originally loaded tensor. Please refrain " + f"from modifying the file while the run is in progress." + ) +``` + +1. Comment out the whole section +``` + #if modified_time > self._last_modified: + # raise RuntimeError( + # f"Attempting to materialize deferred tensor with key " + # f"\"{self._key}\" from file {self._filepath}, but the file has " + # f"since been modified. The loaded tensor value may be " + # f"different from originally loaded tensor. Please refrain " + # f"from modifying the file while the run is in progress." + # ) +``` + +1. Save the file \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index a1add227b..420938bf8 100644 --- a/requirements.txt +++ b/requirements.txt @@ -26,5 +26,4 @@ pyyaml-env-tag==0.1 six==1.16.0 toml==0.10.2 virtualenv==20.6.0 -watchdog==2.1.3 zipp==3.5.0 diff --git a/requirements2.txt b/requirements2.txt new file mode 100644 index 000000000..a1add227b --- /dev/null +++ b/requirements2.txt @@ -0,0 +1,30 @@ +backports.entry-points-selectable==1.1.0 +cfgv==3.3.0 +click==8.0.1 +distlib==0.3.2 +filelock==3.0.12 +ghp-import==2.0.1 +identify==2.2.11 +importlib-metadata==4.6.1 +Jinja2==3.0.1 +Markdown==3.3.4 +MarkupSafe==2.0.1 +mergedeep==1.3.4 +mkdocs==1.2.1 +mkdocs-material==7.1.10 +mkdocs-material-extensions==1.0.1 +nodeenv==1.6.0 +packaging==21.0 +platformdirs==2.0.2 +pre-commit==2.13.0 +Pygments==2.9.0 +pymdown-extensions==8.2 +pyparsing==2.4.7 +python-dateutil==2.8.1 +PyYAML==5.4.1 +pyyaml-env-tag==0.1 +six==1.16.0 +toml==0.10.2 +virtualenv==20.6.0 +watchdog==2.1.3 +zipp==3.5.0 From 2774c213065c344d17d4dec4ad0ceb7fbb0f6419 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 15 Dec 2023 13:58:20 +0000 Subject: [PATCH 24/91] Adding re-run job step --- docs/services/cs2/run.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 8630924d6..3f5e3fc5c 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -105,4 +105,6 @@ if modified_time > self._last_modified: # ) ``` -1. Save the file \ No newline at end of file +1. Save the file + +1. Re-run the job \ No newline at end of file From 9ff4a7673bca3c4ebd53d34c5a7b002c44fab876 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 15 Dec 2023 14:00:05 +0000 Subject: [PATCH 25/91] Fixing requirements file --- requirements.txt | 1 + requirements2.txt | 30 ------------------------------ 2 files changed, 1 insertion(+), 30 deletions(-) delete mode 100644 requirements2.txt diff --git a/requirements.txt b/requirements.txt index 420938bf8..a1add227b 100644 --- a/requirements.txt +++ b/requirements.txt @@ -26,4 +26,5 @@ pyyaml-env-tag==0.1 six==1.16.0 toml==0.10.2 virtualenv==20.6.0 +watchdog==2.1.3 zipp==3.5.0 diff --git a/requirements2.txt b/requirements2.txt deleted file mode 100644 index a1add227b..000000000 --- a/requirements2.txt +++ /dev/null @@ -1,30 +0,0 @@ -backports.entry-points-selectable==1.1.0 -cfgv==3.3.0 -click==8.0.1 -distlib==0.3.2 -filelock==3.0.12 -ghp-import==2.0.1 -identify==2.2.11 -importlib-metadata==4.6.1 -Jinja2==3.0.1 -Markdown==3.3.4 -MarkupSafe==2.0.1 -mergedeep==1.3.4 -mkdocs==1.2.1 -mkdocs-material==7.1.10 -mkdocs-material-extensions==1.0.1 -nodeenv==1.6.0 -packaging==21.0 -platformdirs==2.0.2 -pre-commit==2.13.0 -Pygments==2.9.0 -pymdown-extensions==8.2 -pyparsing==2.4.7 -python-dateutil==2.8.1 -PyYAML==5.4.1 -pyyaml-env-tag==0.1 -six==1.16.0 -toml==0.10.2 -virtualenv==20.6.0 -watchdog==2.1.3 -zipp==3.5.0 From 599a9956fb00ba5bd98972c0243d63bc66b093a4 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 15 Dec 2023 14:07:39 +0000 Subject: [PATCH 26/91] Rewording of CS2 Instructions --- docs/services/cs2/run.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 3f5e3fc5c..2e3c385db 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -34,6 +34,7 @@ python run.py \ --mount_dirs {paths to modelzoo and to data} \ --python_paths {paths to modelzoo and other python code if used} ``` +See the 'Troubleshooting' section below for known issues. ## Creating an environment @@ -63,14 +64,14 @@ cerebras_install_check ## Troubleshooting -### "Failed to transfer X out of 1943 weight tensors" +### "Failed to transfer X out of 1943 weight tensors with modelzoo" Sometimes jobs receive an error during the 'Transferring weights to server' like below: ``` 2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. 2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. ``` -If this occurs, follow the below steps to fix it: +Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: 1. From within your python venv, edit the /lib64/python3.8/site-packages/cerebras_pytorch/storage.py file ```bash @@ -107,4 +108,4 @@ if modified_time > self._last_modified: 1. Save the file -1. Re-run the job \ No newline at end of file +1. Re-run the job From 19287308432edb154bf2e74720faab646e9dbf71 Mon Sep 17 00:00:00 2001 From: Justs Zarins Date: Mon, 18 Dec 2023 16:37:41 +0000 Subject: [PATCH 27/91] Update invocation of python3.8 on SDF --- docs/services/cs2/run.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 2e3c385db..79461e616 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -43,7 +43,7 @@ To run a job on the cluster, you must create a Python virtual environment (venv) 1. Create the venv ```bash -/opt/python3.8/bin/python3.8 -m venv venv_cerebras_pt +python3.8 -m venv venv_cerebras_pt ``` 1. Install the dependencies From 2b8e05e202d9664d807c6a71fbe4448dea8ecf33 Mon Sep 17 00:00:00 2001 From: awat31 Date: Tue, 19 Dec 2023 11:23:13 +0000 Subject: [PATCH 28/91] Updated docs for eidf-gateway when MFA is enabled --- docs/access/ssh.md | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 122600288..efcdb7649 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -16,6 +16,13 @@ The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. +You will require three things to use the gateway: + +1. A user within a project allowed to access the gateway and a password set. +1. An SSH-key linked to this account, used to authenticate against the gateway +1. Have MFA setup with your project account via SAFE. + +Steps to meet all of these requirements are explained below. ## Generating and Adding an SSH Key @@ -63,7 +70,13 @@ This should not be necessary for most users, so only follow this process if you If you need to add an SSH Key directly to SAFE, you can follow this [guide.](https://epcced.github.io/safe-docs/safe-for-users/#how-to-add-an-ssh-public-key-to-your-account) However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide. -### Using the SSH-Key to access EIDF - Windows and Linux +## Enabling MFA via SAFE + +A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
+To enable this for your EIDF account, follow the safe guide: [How to turn on MFA on your machine account](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) + + +### Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux 1. From your local terminal, import the SSH Key you generated above: ```$ ssh-add [sshkey]``` 1. This should return "Identity added [Path to SSH Key]" if successful. You can then follow the steps below to access your VM. @@ -83,6 +96,8 @@ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. +You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Applicaiton. + ## Accessing from Windows Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won’t be covered in this tutorial. @@ -111,6 +126,8 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` +You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Applicaiton. + ## First Password Setting and Password Resets Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). From 8a45e92478c1347e1d51c820ef10873027a4cdde Mon Sep 17 00:00:00 2001 From: awat31 Date: Tue, 19 Dec 2023 11:31:53 +0000 Subject: [PATCH 29/91] Updated docs for eidf-gateway when MFA is enabled --- docs/access/ssh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index efcdb7649..71b04a846 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -19,7 +19,7 @@ The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a co You will require three things to use the gateway: 1. A user within a project allowed to access the gateway and a password set. -1. An SSH-key linked to this account, used to authenticate against the gateway +1. An SSH-key linked to this account, used to authenticate against the gateway. 1. Have MFA setup with your project account via SAFE. Steps to meet all of these requirements are explained below. From 15d1acfec9c7225e9e5edde5f61d7f651a77e0f0 Mon Sep 17 00:00:00 2001 From: awat31 Date: Tue, 19 Dec 2023 14:33:55 +0000 Subject: [PATCH 30/91] Format changes --- docs/access/ssh.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 71b04a846..17f392233 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -62,6 +62,7 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Select the plus button under 'Credentials' 1. Select 'Choose File' to upload the PUBLIC (.pub) ssh key generated in the last step, or open the .pub file you just created and copy its contents into the text box. 1. Click 'Upload Credential' - it should look something like this: + ![eidf-portal-ssh](../images/access/eidf-portal-ssh.png){: class="border-img"} #### Adding a new SSH Key via SAFE @@ -73,12 +74,13 @@ However, select your '[username]@EIDF' login account, not 'Archer2' as specified ## Enabling MFA via SAFE A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
-To enable this for your EIDF account, follow the safe guide: [How to turn on MFA on your machine account](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) +To enable this for your EIDF account, follow the safe guide: [How to turn on MFA on your machine account](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) ### Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux 1. From your local terminal, import the SSH Key you generated above: ```$ ssh-add [sshkey]``` + 1. This should return "Identity added [Path to SSH Key]" if successful. You can then follow the steps below to access your VM. ## Accessing From MacOS/Linux @@ -126,7 +128,7 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` -You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Applicaiton. +You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. ## First Password Setting and Password Resets From 383109b9f1f33467d447111407428509f264c19a Mon Sep 17 00:00:00 2001 From: awat31 Date: Tue, 19 Dec 2023 14:37:49 +0000 Subject: [PATCH 31/91] Remove training whitespace --- docs/access/ssh.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 17f392233..f34524c3a 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -16,11 +16,11 @@ The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. -You will require three things to use the gateway: +You will require three things to use the gateway: -1. A user within a project allowed to access the gateway and a password set. -1. An SSH-key linked to this account, used to authenticate against the gateway. -1. Have MFA setup with your project account via SAFE. +1. A user within a project allowed to access the gateway and a password set. +1. An SSH-key linked to this account, used to authenticate against the gateway. +1. Have MFA setup with your project account via SAFE. Steps to meet all of these requirements are explained below. From 9b7a8f7509884c5b02b637877ff76b3663cbc16e Mon Sep 17 00:00:00 2001 From: Julien Sindt <57000579+jsindt@users.noreply.github.com> Date: Wed, 20 Dec 2023 15:18:44 +0000 Subject: [PATCH 32/91] Update L3_running_a_pytorch_task.md Fixing minor typo --- docs/services/gpuservice/training/L3_running_a_pytorch_task.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md index cb40d7140..b3fad8906 100644 --- a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md +++ b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md @@ -221,5 +221,5 @@ kubectl delete pod pytorch-pod kubectl delete pod pytorch-job -kubectl delete pv pytorch-pvc +kubectl delete pvc pytorch-pvc ``` From 15e6e6aa032c7f0386836b19a6cae98232c958cb Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Wed, 3 Jan 2024 14:46:30 +0000 Subject: [PATCH 33/91] Minor update to CS2 WSC docs/Troubleshooting. --- docs/services/cs2/run.md | 27 +++++++++++++++++++-------- docs/services/ultra2/run.md | 13 +++++++------ 2 files changed, 26 insertions(+), 14 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 79461e616..d5e53c9d3 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -34,6 +34,7 @@ python run.py \ --mount_dirs {paths to modelzoo and to data} \ --python_paths {paths to modelzoo and other python code if used} ``` + See the 'Troubleshooting' section below for known issues. ## Creating an environment @@ -61,29 +62,34 @@ source venv_cerebras_pt/bin/activate cerebras_install_check ``` - ## Troubleshooting ### "Failed to transfer X out of 1943 weight tensors with modelzoo" + Sometimes jobs receive an error during the 'Transferring weights to server' like below: -``` + +```bash 2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. 2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. -``` +``` Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: -1. From within your python venv, edit the /lib64/python3.8/site-packages/cerebras_pytorch/storage.py file +1. From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file + ```bash -vi /lib64/python3.8/site-packages/cerebras_pytorch/storage.py -``` +vi /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py +``` 1. Navigate to line 672 + ```bash :672 ``` + The section should look like this: -``` + +```python if modified_time > self._last_modified: raise RuntimeError( f"Attempting to materialize deferred tensor with key " @@ -95,7 +101,8 @@ if modified_time > self._last_modified: ``` 1. Comment out the whole section -``` + +```python #if modified_time > self._last_modified: # raise RuntimeError( # f"Attempting to materialize deferred tensor with key " @@ -109,3 +116,7 @@ if modified_time > self._last_modified: 1. Save the file 1. Re-run the job + +### Paths, PYTHONPATH and mount_dirs + +There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories) diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 18c1b5f98..6374cdc67 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -68,18 +68,19 @@ Once you have done this, your SSH key will be added to your Ultra2 account. Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2. -!!! Note +--- +!!! note "First Login" When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: - 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When prompted to enter your new password: type in a new password + 1. When prompted to re-enter the new password: re-enter the new password - 2. When prompted to enter your new password: type in a new password + Your password has now been changed - 3. When prompted to re-enter the new password: re-enter the new password - - Your password has now been changed
You will **not** use your password when logging on to Ultra2 after the initial logon. +--- ### SSH Login From f952ddcc3451842e7e0139cf18d6f6ce8baef65e Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 12 Jan 2024 09:52:27 +0000 Subject: [PATCH 34/91] SSH Gateway MFA Changes --- docs/access/ssh.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index f34524c3a..cc9a6cdff 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -75,7 +75,20 @@ However, select your '[username]@EIDF' login account, not 'Archer2' as specified A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
-To enable this for your EIDF account, follow the safe guide: [How to turn on MFA on your machine account](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) +To enable this for your EIDF account: + +1. Login to the [portal.](https://portal.eidf.ac.uk) +1. Select 'Projects' then 'Your Projects' +1. Select the project containing the account you'd like to add MFA to. +1. Under 'Your Accounts', select the account you would like to add MFA to. +1. Select 'Set MFA Token' +1. Within your chosen MFA application, scan the QR Code or enter the key and add the token. +1. Enter the code displayed in the app into the 'Verification Code' box and select 'Set Token' +1. You will be redirected to the User Account page and a green 'Added MFA Token' message will confirm the token has been added successfully. + +!!! note + TOTP is only required for the SSH Gateway, not to the VMs themselves, and not through the VDI.
+ An MFA token will have to be set for each account you'd like to use to access the EIDF SSH Gateway. ### Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux From 8782ba6631f25252b9249459dc8cd81355025d30 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 12 Jan 2024 15:29:06 +0000 Subject: [PATCH 35/91] Updated header --- docs/access/ssh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index cc9a6cdff..3738ccee5 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -71,7 +71,7 @@ This should not be necessary for most users, so only follow this process if you If you need to add an SSH Key directly to SAFE, you can follow this [guide.](https://epcced.github.io/safe-docs/safe-for-users/#how-to-add-an-ssh-public-key-to-your-account) However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide. -## Enabling MFA via SAFE +## Enabling MFA via the Portal A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.
From 1eace02f14b24b7b52a803a50c46916fa8f8667e Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 12 Jan 2024 15:32:07 +0000 Subject: [PATCH 36/91] Spelling Correction --- docs/access/ssh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 3738ccee5..e6f955e87 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -111,7 +111,7 @@ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. -You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Applicaiton. +You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. ## Accessing from Windows From 1edbb2e8bf677191e726af37c520d69d3c3b173a Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Tue, 6 Feb 2024 14:02:34 +0000 Subject: [PATCH 37/91] Post r2.1.1 update --- docs/services/cs2/run.md | 65 +++++++++++++++++++++++++++------------- 1 file changed, 44 insertions(+), 21 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index d5e53c9d3..e6c00a791 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -41,50 +41,41 @@ See the 'Troubleshooting' section below for known issues. To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following: -1. Create the venv +### Create the venv ```bash python3.8 -m venv venv_cerebras_pt ``` -1. Install the dependencies +### Install the dependencies ```bash source venv_cerebras_pt/bin/activate pip install --upgrade pip -pip install cerebras_pytorch==2.0.2 +pip install cerebras_pytorch==2.1.1 ``` -1. Validate the setup +### Validate the setup ```bash source venv_cerebras_pt/bin/activate cerebras_install_check ``` -## Troubleshooting - -### "Failed to transfer X out of 1943 weight tensors with modelzoo" - -Sometimes jobs receive an error during the 'Transferring weights to server' like below: - -```bash -2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. -2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. -``` +### Modify venv files to remove clock sync check on EPCC system. Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: -1. From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file +### From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file ```bash vi /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py ``` -1. Navigate to line 672 +### Navigate to line 530 ```bash -:672 +:530 ``` The section should look like this: @@ -100,7 +91,7 @@ if modified_time > self._last_modified: ) ``` -1. Comment out the whole section +### Comment out the whole section ```python #if modified_time > self._last_modified: @@ -113,10 +104,42 @@ if modified_time > self._last_modified: # ) ``` -1. Save the file +### Navigate to line 774 + +```bash +:774 +``` + +The section should look like this: + +```python + if stat.st_mtime_ns > self._stat.st_mtime_ns: + raise RuntimeError( + f"Attempting to {msg} deferred tensor with key " + f"\"{self._key}\" from file {self._filepath}, but the file has " + f"since been modified. The loaded tensor value may be " + f"different from originally loaded tensor. Please refrain " + f"from modifying the file while the run is in progress." + ) +``` + +### Comment out the whole section + +```python + #if stat.st_mtime_ns > self._stat.st_mtime_ns: + # raise RuntimeError( + # f"Attempting to {msg} deferred tensor with key " + # f"\"{self._key}\" from file {self._filepath}, but the file has " + # f"since been modified. The loaded tensor value may be " + # f"different from originally loaded tensor. Please refrain " + # f"from modifying the file while the run is in progress." + # ) +``` + +### Save the file -1. Re-run the job +### Run jobs as per existing documentation. -### Paths, PYTHONPATH and mount_dirs +## Paths, PYTHONPATH and mount_dirs There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories) From 2541900f378e590e57dce4fbdd9d0e064a1fef77 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Wed, 14 Feb 2024 12:48:35 +0000 Subject: [PATCH 38/91] AG: updated style rules to avoid issues with indent blocks. Update GPU service overview. --- .mdl_style.rb | 1 + docs/services/gpuservice/index.md | 55 ++++++++++++++++++++++--------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/.mdl_style.rb b/.mdl_style.rb index e1c0cd8ba..d3b4f8de3 100644 --- a/.mdl_style.rb +++ b/.mdl_style.rb @@ -1,4 +1,5 @@ all exclude_rule 'MD033' +exclude_rule 'MD046' rule 'MD013', :line_length => 500 rule 'MD026', :punctuation => '.,:;' diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index b44e7b7b4..d96433fb4 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -1,32 +1,50 @@ # Overview -The EIDF GPU Service (EIDFGPUS) uses Nvidia A100 GPUs as accelerators. +The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon [Kubernetes](https://kubernetes.io). -Full Nvidia A100 GPUs are connected to 40GB of dynamic memory. +MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion. -Multi-instance usage (MIG) GPUs allow multiple tasks or users to share the same GPU (similar to CPU threading). +The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU. -There are two types of MIG GPUs inside the EIDFGPUS the Nvidia A100 3G.20GB GPUs and the Nvidia A100 1G.5GB GPUs which equate to ~1/2 and ~1/7 of a full Nvidia A100 40 GB GPU. +The service provides access to: -The current specification of the EIDFGPUS is: +- Nvidia A100 40GB +- Nvidia 80GB +- Nvidia MIG A100 1G.5GB +- Nvidia MIG A100 3G.20GB +- Nvidia H100 80GB -- 1856 CPU Cores -- 8.7 TiB Memory -- Local Disk Space (Node Image Cache and Local Workspace) - 21 TiB +The current full specification of the EIDF GPU Service as of 14 February 2024: + +- 4912 CPU Cores (AMD EPYC and Intel Xeon) +- 23 TiB Memory +- Local Disk Space (Node Image Cache and Local Workspace) - 40 TiB - Ceph Persistent Volumes (Long Term Data) - up to 100TiB -- 70 Nvidia A100 40 GB GPUs -- 14 MIG Nvidia A100 40 GB GPUs equating to 28 Nvidia A100 3G.20GB GPUs -- 20 MIG Nvidia A100 40 GB GPU equating to 140 A100 1G.5GB GPUs +- 112 Nvidia A100 40 GB +- 39 Nvidia A100 80 GB +- 16 Nvidia A100 3G.20GB +- 56 Nvidia A100 1G.5GB +- 32 Nvidia H100 80 GB -The EIDFGPUS is managed using [Kubernetes](https://kubernetes.io), with up to 8 GPUs being on a single node. +!!! Quotas + This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team. ## Service Access Users should have an EIDF account - [EIDF Accounts](../../access/project.md). -Project Leads will be able to have access to the EIDFGPUS added to their project during the project application process or through a request to the EIDF helpdesk. +Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. + +Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). + +All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. + +!!! Important + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. -Each project will be given a namespace to operate in and a kubeconfig file in a Virtual Machine on the EIDF DSC - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). + An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type. + + Projects do not have to apply for a GPU-enabled VM to access the GPU Service. ## Project Quotas @@ -36,7 +54,12 @@ A standard project namespace has the following initial quota (subject to ongoing - Memory: 1TiB - GPU: 12 -Note these quotas are maximum use by a single project, and that during periods of high usage Kubernetes Jobs maybe queued waiting for resource to become available on the cluster. +!!! Important + A project quota is the maximum proportion of the service available for use by that project. + + During periods of high demand, Jobs will queued awaiting resource availability on the Service. + + This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. ## Additional Service Policy Information @@ -44,7 +67,7 @@ Additional information on service policies can be found [here](policies.md). ## EIDF GPU Service Tutorial -This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a comprehensive overview of Kubernetes. +This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes. | Lesson | Objective | |-----------------------------------|-------------------------------------| From d24b745ed9856f7a4ca44d48381f26eb22fe6c85 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Wed, 14 Feb 2024 16:57:48 +0000 Subject: [PATCH 39/91] AG: pre-commit failures corrected in cs2 and ultra2 and gpuservice latest --- docs/services/cs2/run.md | 8 +- docs/services/gpuservice/faq.md | 6 +- docs/services/gpuservice/index.md | 30 +- docs/services/gpuservice/kueue.md | 450 ++++++++++++++++++ docs/services/gpuservice/policies.md | 25 +- .../gpuservice/training/L1_getting_started.md | 266 +++++++---- .../L2_requesting_persistent_volumes.md | 74 +-- .../training/L3_running_a_pytorch_task.md | 214 +++++---- docs/services/ultra2/run.md | 1 - mkdocs.yml | 3 +- 10 files changed, 834 insertions(+), 243 deletions(-) create mode 100644 docs/services/gpuservice/kueue.md diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index e6c00a791..46b9ec3a6 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate cerebras_install_check ``` -### Modify venv files to remove clock sync check on EPCC system. +### Modify venv files to remove clock sync check on EPCC system Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: @@ -91,7 +91,7 @@ if modified_time > self._last_modified: ) ``` -### Comment out the whole section +### Comment out the section `if modified_time > self._last_modified` ```python #if modified_time > self._last_modified: @@ -123,7 +123,7 @@ The section should look like this: ) ``` -### Comment out the whole section +### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns` ```python #if stat.st_mtime_ns > self._stat.st_mtime_ns: @@ -138,7 +138,7 @@ The section should look like this: ### Save the file -### Run jobs as per existing documentation. +### Run jobs as per existing documentation ## Paths, PYTHONPATH and mount_dirs diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index e91502968..456870b7a 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by ### How many GPUs can I use in a pod? -The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs. +The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs. ### Why did a validation error occur when submitting a pod or job with a valid specification file? @@ -76,3 +76,7 @@ Example fragment for a Bash command start: - '-c' - '--' ``` + +### My large number of GPUs Job takes a long time to be scheduled + +When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available. diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index d96433fb4..7dde82aaf 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -4,7 +4,7 @@ The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPU MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion. -The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU. +The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively. The service provides access to: @@ -26,23 +26,27 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: - 56 Nvidia A100 1G.5GB - 32 Nvidia H100 80 GB -!!! Quotas - This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team. +!!! important "Quotas" + This is the full configuration of the cluster. + + Each project will have access to a quota across this shared configuration. + + Changes to the default quota must be discussed and agreed with the EIDF Services team. ## Service Access -Users should have an EIDF account - [EIDF Accounts](../../access/project.md). +Users should have an [EIDF Account](../../access/project.md). Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. -Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). +Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. -!!! Important +!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. - An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type. + An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. Projects do not have to apply for a GPU-enabled VM to access the GPU Service. @@ -54,13 +58,17 @@ A standard project namespace has the following initial quota (subject to ongoing - Memory: 1TiB - GPU: 12 -!!! Important +!!! important "Quota is a maximum on a Shared Resource" A project quota is the maximum proportion of the service available for use by that project. - During periods of high demand, Jobs will queued awaiting resource availability on the Service. + During periods of high demand, Jobs will be queued awaiting resource availability on the Service. This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. +## Project Queues + +EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md). + ## Additional Service Policy Information Additional information on service policies can be found [here](policies.md). @@ -79,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it - The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster. -- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources) +- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources). -- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run) +- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run). diff --git a/docs/services/gpuservice/kueue.md b/docs/services/gpuservice/kueue.md new file mode 100644 index 000000000..55a614564 --- /dev/null +++ b/docs/services/gpuservice/kueue.md @@ -0,0 +1,450 @@ +# Kueue + +## Overview + +[Kueue](https://kueue.sigs.k8s.io/docs/overview/) is a native Kubernetes quota and job management system. + +This is the job queue system for the EIDF GPU Service, starting with February 2024. + +All users should submit jobs to their local namespace user queue, this queue will have the name `eidf project namespace`-user-queue. + +### Changes to Job Specs + +Jobs can be submitted as before but will require the addition of a metadata label: + +```yaml + labels: + kueue.x-k8s.io/queue-name: -user-queue +``` + +This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted. + +## Useful commands for looking at your local queue + +### `kubectl get queue` + +This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start: + +```bash +NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS +eidf001-user-queue eidf001-project-gpu-cq 0 2 +``` + +### `kubectl describe queue ` + +This command will output more detailed information on the current resource usage in your queue: + +```bash +Name: eidf001-user-queue +Namespace: eidf001 +Labels: +Annotations: +API Version: kueue.x-k8s.io/v1beta1 +Kind: LocalQueue +Metadata: + Creation Timestamp: 2024-02-06T13:06:23Z + Generation: 1 + Managed Fields: + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:spec: + .: + f:clusterQueue: + Manager: kubectl-create + Operation: Update + Time: 2024-02-06T13:06:23Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + .: + f:admittedWorkloads: + f:conditions: + .: + k:{"type":"Active"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + f:flavorUsage: + .: + k:{"name":"default-flavor"}: + .: + f:name: + f:resources: + .: + k:{"name":"cpu"}: + .: + f:name: + f:total: + k:{"name":"memory"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-1g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-3g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-80"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + f:flavorsReservation: + .: + k:{"name":"default-flavor"}: + .: + f:name: + f:resources: + .: + k:{"name":"cpu"}: + .: + f:name: + f:total: + k:{"name":"memory"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-1g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-3g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-80"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + f:pendingWorkloads: + f:reservingWorkloads: + Manager: kueue + Operation: Update + Subresource: status + Time: 2024-02-14T10:54:20Z + Resource Version: 333898946 + UID: bca097e2-6c55-4305-86ac-d1bd3c767751 +Spec: + Cluster Queue: eidf001-project-gpu-cq +Status: + Admitted Workloads: 2 + Conditions: + Last Transition Time: 2024-02-06T13:06:23Z + Message: Can submit new workloads to clusterQueue + Reason: Ready + Status: True + Type: Active + Flavor Usage: + Name: gpu-a100 + Resources: + Name: nvidia.com/gpu + Total: 2 + Name: gpu-a100-3g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-1g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-80 + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: default-flavor + Resources: + Name: cpu + Total: 16 + Name: memory + Total: 256Gi + Flavors Reservation: + Name: gpu-a100 + Resources: + Name: nvidia.com/gpu + Total: 2 + Name: gpu-a100-3g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-1g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-80 + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: default-flavor + Resources: + Name: cpu + Total: 16 + Name: memory + Total: 256Gi + Pending Workloads: 0 + Reserving Workloads: 2 +Events: +``` + +### `kubectl get workloads` + +This command will return the list of workloads in the queue: + +```bash +NAME QUEUE ADMITTED BY AGE +job-jobtest-366ab eidf001-user-queue eidf001-project-gpu-cq 4h45m +job-jobtest-34ba9 eidf001-user-queue eidf001-project-gpu-cq 6h48m +``` + +### `kubectl describe workload ` + +This command will return a detailed summary of the workload including status and resource usage: + +```bash +Name: job-pytorch-job-0b664 +Namespace: t4 +Labels: kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc +Annotations: +API Version: kueue.x-k8s.io/v1beta1 +Kind: Workload +Metadata: + Creation Timestamp: 2024-02-14T15:22:16Z + Generation: 2 + Managed Fields: + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + f:admission: + f:clusterQueue: + f:podSetAssignments: + k:{"name":"main"}: + .: + f:count: + f:flavors: + f:cpu: + f:memory: + f:nvidia.com/gpu: + f:name: + f:resourceUsage: + f:cpu: + f:memory: + f:nvidia.com/gpu: + f:conditions: + k:{"type":"Admitted"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + k:{"type":"QuotaReserved"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + Manager: kueue-admission + Operation: Apply + Subresource: status + Time: 2024-02-14T15:22:16Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + f:conditions: + k:{"type":"Finished"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + Manager: kueue-job-controller-Finished + Operation: Apply + Subresource: status + Time: 2024-02-14T15:25:06Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:metadata: + f:labels: + .: + f:kueue.x-k8s.io/job-uid: + f:ownerReferences: + .: + k:{"uid":"33bc1e48-4dca-4252-9387-bf68b99759dc"}: + f:spec: + .: + f:podSets: + .: + k:{"name":"main"}: + .: + f:count: + f:name: + f:template: + .: + f:metadata: + .: + f:labels: + .: + f:controller-uid: + f:job-name: + f:name: + f:spec: + .: + f:containers: + f:dnsPolicy: + f:nodeSelector: + f:restartPolicy: + f:schedulerName: + f:securityContext: + f:terminationGracePeriodSeconds: + f:volumes: + f:priority: + f:priorityClassSource: + f:queueName: + Manager: kueue + Operation: Update + Time: 2024-02-14T15:22:16Z + Owner References: + API Version: batch/v1 + Block Owner Deletion: true + Controller: true + Kind: Job + Name: pytorch-job + UID: 33bc1e48-4dca-4252-9387-bf68b99759dc + Resource Version: 270812029 + UID: 8cfa93ba-1142-4728-bc0c-e8de817e8151 +Spec: + Pod Sets: + Count: 1 + Name: main + Template: + Metadata: + Labels: + Controller - UID: 33bc1e48-4dca-4252-9387-bf68b99759dc + Job - Name: pytorch-job + Name: pytorch-pod + Spec: + Containers: + Args: + /mnt/ceph_rbd/example_pytorch_code.py + Command: + python3 + Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + Image Pull Policy: IfNotPresent + Name: pytorch-con + Resources: + Limits: + Cpu: 4 + Memory: 4Gi + nvidia.com/gpu: 1 + Requests: + Cpu: 2 + Memory: 1Gi + Termination Message Path: /dev/termination-log + Termination Message Policy: File + Volume Mounts: + Mount Path: /mnt/ceph_rbd + Name: volume + Dns Policy: ClusterFirst + Node Selector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB + Restart Policy: Never + Scheduler Name: default-scheduler + Security Context: + Termination Grace Period Seconds: 30 + Volumes: + Name: volume + Persistent Volume Claim: + Claim Name: pytorch-pvc + Priority: 0 + Priority Class Source: + Queue Name: t4-user-queue +Status: + Admission: + Cluster Queue: project-cq + Pod Set Assignments: + Count: 1 + Flavors: + Cpu: default-flavor + Memory: default-flavor + nvidia.com/gpu: gpu-a100 + Name: main + Resource Usage: + Cpu: 2 + Memory: 1Gi + nvidia.com/gpu: 1 + Conditions: + Last Transition Time: 2024-02-14T15:22:16Z + Message: Quota reserved in ClusterQueue project-cq + Reason: QuotaReserved + Status: True + Type: QuotaReserved + Last Transition Time: 2024-02-14T15:22:16Z + Message: The workload is admitted + Reason: Admitted + Status: True + Type: Admitted + Last Transition Time: 2024-02-14T15:25:06Z + Message: Job finished successfully + Reason: JobFinished + Status: True + Type: Finished +``` diff --git a/docs/services/gpuservice/policies.md b/docs/services/gpuservice/policies.md index b083965de..5587d223f 100644 --- a/docs/services/gpuservice/policies.md +++ b/docs/services/gpuservice/policies.md @@ -16,12 +16,29 @@ Each project will be assigned a kubeconfig file for access to the service which ## Kubernetes Job Time to Live -All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via "spec.ttlSecondsAfterFinished" automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service. +All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via `spec.ttlSecondsAfterFinished`> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service. -Note: This policy is automated and does not require users to change their job specifications. +!!! important + This policy is automated and does not require users to change their job specifications. ## Kubernetes Active Deadline Seconds -All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via "spec.spec.activeDeadlineSeconds" automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service. +All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via `spec.spec.activeDeadlineSeconds` automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service. -Note: This policy is automated and does not require users to change their job or pod specifications. +!!! important + This policy is automated and does not require users to change their job or pod specifications. + +## Kueue + +All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload. + +Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata: + +```yaml + labels: + kueue.x-k8s.io/queue-name: -user-queue +``` + +Jobs without this queue name tag will be rejected. + +Pods bypassing the queue system will be deleted. diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index eef9015c6..9ebd1bea7 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -2,14 +2,14 @@ ## Introduction -Kubernetes (K8s) is a systems administration tool originally developed by Google to orchestrate the deployment, scaling, and management of containerised applications. +Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications. -Nvidia have created drivers to officially support clusters of Nvidia GPUs managed by K8s. +Nvidia GPUs are supported through K8s native Nvidia GPU Operators. -Using K8s to manage the EIDFGPUS provides two key advantages: +The use of K8s to manage the EIDF GPU Service provides two key advantages: -- native support for containers enabling reproducible analysis whilst minimising demand on system admin. -- automated resource allocation for GPUs and storage volumes that are shared across multiple users. +- support for containers enabling reproducible analysis whilst minimising demand on system admin. +- automated resource allocation management for GPUs and storage volumes that are shared across multiple users. ## Interacting with a K8s cluster @@ -23,97 +23,174 @@ Users define the resource requirements of a pod (i.e. number/type of GPU) and th The pod definition yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran. -A node is a unit of the cluster, e.g. a group of GPUs or virtual GPUs. +A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs. Multiple pods can be defined and maintained using several different methods depending on purpose: [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [services](https://kubernetes.io/docs/concepts/services-networking/service/) and [jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/); see the K8s docs for more details. Users interact with the K8s API using the `kubectl` (short for kubernetes control) commands. + Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces. + Useful commands are: -- `kubectl create -f `: Create a new pod with requested resources. Returns an error if a pod with the same name already exists. -- `kubectl apply -f `: Create a new pod with requested resources. If a pod with the same name already exists it updates that pod with the new resource/container requirements outlined in the yaml. +- `kubectl create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. +- `kubectl apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. - `kubectl delete pod `: Delete a pod from the cluster. -- `kubectl get pods`: Summarise all pods the users has active (or queued). -- `kubectl describe pods`: Verbose description of all pods the users has active (or queued). +- `kubectl get pods`: Summarise all pods the namespace has active (or pending). +- `kubectl describe pods`: Verbose description of all pods the namespace has active (or pending). +- `kubectl describe pod `: Verbose summary of the specified pod. - `kubectl logs `: Retrieve the log files associated with a running pod. +- `kubectl get jobs`: List all jobs the namespace has active (or pending). +- `kubectl describe job `: Verbose summary of the specified job. +- `kubectl delete job `: Delete a job from the cluster. -## Creating your first pod +## Creating your first job -Nvidia have several prebuilt docker images to perform different tasks on their GPU hardware. +To access the GPUs on the service, it is recommended to start with one of the prebuild container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs. -The list of docker images is available on their [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags). +The list of Nvidia images is available on their [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags). -This example uses their CUDA sample code simulating nbody interactions. +The following example uses their CUDA sample code simulating nbody interactions. 1. Open an editor of your choice and create the file test_NBody.yml -1. Copy the following in to the file: +1. Copy the following in to the file, replacing `namespace-user-queue` with -user-queue, e.g. eidf001ns-user-queue: + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue + spec: + completions: 1 + template: + metadata: + name: job-test + spec: + containers: + - name: cudasample + image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + nvidia.com/gpu: 1 + restartPolicy: Never + ``` -The pod resources are defined with the `requests` and `limits` tags. + The pod resources are defined under the `resources` tags using the `requests` and `limits` tags. -Resources defined in the `requests` tags are the minimum possible resources required for the pod to run. + Resources defined under the `requests` tags are the reserved resources required for the pod to be scheduled. -If a pod is assigned to an unused node then it may use resources beyond those requested. + If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested. -This may allow the task within the pod to run faster, but it also runs the risk of unnecessarily blocking off resources for future pod requests. + This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node. -The `limits` tag specifies the maximum resources that can be assigned to a pod. + The `limits` tag specifies the maximum resources that can be assigned to a pod. -The EIDFGPUS cluster requires all pods to have `requests` and `limits` tags for cpu and memory resources in order to be accepted. + The EIDF GPU Service requires all pods have `requests` and `limits` tags for CPU and memory defined in order to be accepted. -Finally, it optional to define GPU resources but only the `limits` tag is used to specify the use of a GPU, `limits: nvidia.com/gpu: 1`. + GPU resources requests are optional and only an entry under the `limits` tag is needed to specify the use of a GPU, `nvidia.com/gpu: 1`. Without this no GPU will be available to the pod. -``` yaml -apiVersion: v1 -kind: Pod -metadata: - generateName: first-pod- -spec: - restartPolicy: OnFailure - containers: - - name: cudasample - image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 - args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 -``` + The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users. 1. Save the file and exit the editor -1. Run `kubectl create -f test_NBody.yml' +1. Run `kubectl create -f test_NBody.yml` 1. This will output something like: ``` bash - pod/first-pod-7gdtb created + job.batch/jobtest-b92qg created + ``` + +1. Run `kubectl get jobs` +1. This will output something like: + + ```bash + NAME COMPLETIONS DURATION AGE + jobtest-b92qg 3/3 48s 6m27s + jobtest-d45sr 5/5 15m 22h + jobtest-kwmwk 3/3 48s 29m + jobtest-kw22k 1/1 48s 29m + ``` + + This displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age. + +1. Describe your job using the command `kubectl describe job jobtest-b92-qg`, replacing the job name with your job name. +1. This will output something like: + + ```bash + Name: jobtest-b92qg + Namespace: t4 + Selector: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 + Labels: kueue.x-k8s.io/queue-name=t4-user-queue + Annotations: batch.kubernetes.io/job-tracking: + Parallelism: 1 + Completions: 3 + Completion Mode: NonIndexed + Start Time: Wed, 14 Feb 2024 14:07:44 +0000 + Completed At: Wed, 14 Feb 2024 14:08:32 +0000 + Duration: 48s + Pods Statuses: 0 Active (0 Ready) / 3 Succeeded / 0 Failed + Pod Template: + Labels: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 + job-name=jobtest-b92qg + Containers: + cudasample: + Image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + Port: + Host Port: + Args: + -benchmark + -numbodies=512000 + -fp64 + -fullscreen + Limits: + cpu: 2 + memory: 4Gi + nvidia.com/gpu: 1 + Requests: + cpu: 2 + memory: 1Gi + Environment: + Mounts: + Volumes: + Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Suspended 8m1s job-controller Job suspended + Normal CreatedWorkload 8m1s batch/job-kueue-controller Created Workload: t4/job-jobtest-b92qg-3b890 + Normal Started 8m1s batch/job-kueue-controller Admitted by clusterQueue project-cq + Normal SuccessfulCreate 8m job-controller Created pod: jobtest-b92qg-lh64s + Normal Resumed 8m job-controller Job resumed + Normal SuccessfulCreate 7m44s job-controller Created pod: jobtest-b92qg-xhvdm + Normal SuccessfulCreate 7m28s job-controller Created pod: jobtest-b92qg-lvmrf + Normal Completed 7m12s job-controller Job completed ``` 1. Run `kubectl get pods` 1. This will output something like: ``` bash - pi-tt9kq 0/1 Completed 0 24h - first-pod-24n7n 0/1 Completed 0 24h - first-pod-2j5tc 0/1 Completed 0 24h - first-pod-2kjbx 0/1 Completed 0 24h - sample-2mnvg 0/1 Completed 0 24h - sample-4sng2 0/1 Completed 0 24h - sample-5h6sr 0/1 Completed 0 24h - sample-6bqql 0/1 Completed 0 24h - first-pod-7gdtb 0/1 Completed 0 39s - sample-8dnht 0/1 Completed 0 24h - sample-8pxz4 0/1 Completed 0 24h - sample-bphjx 0/1 Completed 0 24h - sample-cp97f 0/1 Completed 0 24h - sample-gcbbb 0/1 Completed 0 24h - sample-hdlrr 0/1 Completed 0 24h + NAME READY STATUS RESTARTS AGE + jobtest-b92qg-lh64s 0/1 Completed 0 11m + jobtest-b92qg-lvmrf 0/1 Completed 0 10m + jobtest-b92qg-xhvdm 0/1 Completed 0 10m + jobtest-d45sr-8tf4d 0/1 Completed 0 22h + jobtest-d45sr-jjhgg 0/1 Completed 0 22h + jobtest-d45sr-n5w6c 0/1 Completed 0 22h + jobtest-d45sr-v9p4j 0/1 Completed 0 22h + jobtest-d45sr-xgq5s 0/1 Completed 0 22h + jobtest-kwmwk-cgwmf 0/1 Completed 0 33m + jobtest-kwmwk-mttdw 0/1 Completed 0 33m + jobtest-kwmwk-r2q9h 0/1 Completed 0 33m ``` -1. View the logs of the pod you ran `kubectl logs first-pod-7gdtb` +1. View the logs of a pod from the job you ran `kubectl logs jobtest-b92qg-lh64s` - note that the pods for the job in this case start with the job name. 1. This will output something like: ``` bash @@ -144,65 +221,76 @@ spec: = 7439.679 double-precision GFLOP/s at 30 flops per interaction ``` -1. delete your pod with `kubectl delete pod first-pod-7gdtb` +1. Delete your job with `kubectl delete job jobtest-b92qg` - this will delete the associated pods as well. ## Specifying GPU requirements -If you create multiple pods with the same yaml file and compare their log files you may notice the CUDA device may differ from `Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]`. +If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from `Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]`. -This is because K8s is allocating the pod to any free node irrespective of whether that node contains a full 80GB Nvida A100 or a GPU from a MIG Nvida A100. +The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of whether what GPU type is present on the node. -The GPU resource request can be more specific by adding the type of product the pod is requesting to the node selector: +The GPU resource requests can be made more specific by adding the type of GPU product the pod is requesting to the node selector: - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` +- `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` ### Example yaml file -``` yaml -apiVersion: v1 -kind: Pod +```yaml + +apiVersion: batch/v1 +kind: Job metadata: - generateName: first-pod- + generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue spec: - restartPolicy: OnFailure - containers: - - name: cudasample - image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 - args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + completions: 1 + template: + metadata: + name: job-test + spec: + containers: + - name: cudasample + image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + nvidia.com/gpu: 1 + restartPolicy: Never + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb ``` ## Running multiple pods with K8s jobs -A typical use case of the EIDFGPUS cluster will not consist of sending pod requests directly to Kubernetes. - -Instead, users will use a job request which wraps around a pod specification and provide several useful attributes. +The recommended use of the EIDF GPU Service is to use a job request which wraps around a pod specification and provide several useful attributes. Firstly, if a pod is assigned to a node that dies then the pod itself will fail and the user has to manually restart it. -Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod. +Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod, if the restartPolicy is set. + +Jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. -Furthermore, jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. +Jobs allow for better scheduling of resources using the Kueue service implemented on the EIDF GPU Service. Pods which attempt to bypass the queue mechanism this provides will affect the experience of other project users. -See below for an example K8s pod that requires three pods to successfully complete the example CUDA code before the job itself ends. +See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends. ``` yaml apiVersion: batch/v1 kind: Job metadata: generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue spec: completions: 3 parallelism: 1 diff --git a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md index f99a0527b..cfd546181 100644 --- a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md +++ b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md @@ -1,6 +1,6 @@ # Requesting Persistent Volumes With Kubernetes -Pods in the K8s EIDFGPUS are intentionally ephemeral. +Pods in the K8s EIDF GPU Service are intentionally ephemeral. They only last as long as required to complete the task that they were created for. @@ -10,9 +10,9 @@ However, this means the default storage volumes within a pod are temporary. If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance. -Instead, K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs. +K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs. -These persistent volumes will remain even if the pods it is mounted to are deleted, are updated or crash. +These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash. ## Submitting a Persistent Volume Claim @@ -20,11 +20,11 @@ Before a persistent volume can be mounted to a pod, the required storage resourc A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources. -The storage resources are held on a Ceph server which can accept requests up 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDFGPUS. This means at this stage, pods can mount the same PVC in sequence, but not concurrently. +The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently. Example PVCs can be seen on the [Kubernetes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) documentation page. -All PVCs on the EIDFGPUS must use the `csi-rbd-sc` storage class. +All PVCs on the EIDF GPU Service must use the `csi-rbd-sc` storage class. ### Example PersistentVolumeClaim @@ -42,12 +42,12 @@ spec: storageClassName: csi-rbd-sc ``` -You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl create ` +You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl create ` Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands: -- `kubectl delete pvc ` -- `kubectl get pvc ` -- `kubectl apply -f ` +- `kubectl delete pvc ` +- `kubectl get pvc ` +- `kubectl apply -f ` ## Mounting a persistent Volume to a Pod @@ -56,29 +56,37 @@ Introducing a persistent volume to a pod requires the addition of a volumeMount ### Example pod specification yaml with mounted persistent volume ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: test-ceph-pvc-pod + name: test-ceph-pvc-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - containers: - - name: trial - image: busybox - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: test-ceph-pvc + completions: 1 + template: + metadata: + name: test-ceph-pvc-pod + spec: + containers: + - name: cudasample + image: busybox + args: ["sleep", "infinity"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + restartPolicy: Never + volumes: + - name: volume + persistentVolumeClaim: + claimName: test-ceph-pvc ``` ## Accessing the persistent volume outside a pod @@ -86,8 +94,8 @@ spec: To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command. ```bash -*** On Login Node *** -kubectl cp /home/data/test_data.csv test-ceph-pvc-pod:/mnt/ceph_rbd +*** On Login Node - replacing pod name with your pod name *** +kubectl cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd ``` For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted. @@ -97,7 +105,7 @@ The bash command rsync can be amended to manage file transfers into the mounted ## Clean up ```bash -kubectl delete pod test-ceph-pvc-pod +kubectl delete job test-ceph-pvc-job kubectl delete pvc test-ceph-pvc ``` diff --git a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md index b3fad8906..33dae5ffb 100644 --- a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md +++ b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md @@ -1,6 +1,6 @@ # Running a PyTorch task -In the following lesson, we'll build a NLP neural network and train it using the EIDFGPUS. +In the following lesson, we'll build a NLP neural network and train it using the EIDF GPU Service. The model was taken from the [PyTorch Tutorials](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html). @@ -8,7 +8,7 @@ The lesson will be split into three parts: - Requesting a persistent volume and transferring code/data to it - Creating a pod with a PyTorch container downloaded from DockerHub -- Submitting a job to the EIDFGPUS and retrieving the results +- Submitting a job to the EIDF GPU Service and retrieving the results ## Load training data and ML code into a persistent volume @@ -44,132 +44,147 @@ spec: kubectl get pvc ``` -1. Create a lightweight pod with PV mounted (example pod below) +1. Create a lightweight job with pod with PV mounted (example job below) ``` bash - kubectl create -f lightweight-pod.yaml + kubectl create -f lightweight-pod-job.yaml ``` -1. Download the pytorch code +1. Download the PyTorch code ``` bash wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py ``` -1. Copy python script into the PV +1. Copy the Python script into the PV ``` bash - kubectl cp example_pytorch_code.py lightweight-pod:/mnt/ceph_rbd/ + kubectl cp example_pytorch_code.py lightweight-job-:/mnt/ceph_rbd/ ``` -1. Check files were transferred successfully +1. Check whether the files were transferred successfully ``` bash - kubectl exec lightweight-pod -- ls /mnt/ceph_rbd + kubectl exec lightweight-job- -- ls /mnt/ceph_rbd ``` -1. Delete lightweight pod +1. Delete the lightweight job ``` bash - kubectl delete pod lightweight-pod + kubectl delete job lightweight-job- ``` -### Example lightweight pod specification +### Example lightweight job specification ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: lightweight-pod + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - containers: - - name: data-loader - image: busybox - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 1 + template: + metadata: + name: lightweight-pod + spec: + containers: + - name: data-loader + image: busybox + args: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: '1Gi' + limits: + cpu: 1 + memory: '1Gi' + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + restartPolicy: Never + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` -## Creating a pod with a PyTorch container +## Creating a Job with a PyTorch container We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model. The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU. -Submit the specification file to K8s to create the pod. +Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name. ``` bash -kubectl create -f +kubectl create -f ``` -### Example PyTorch Pod Specification File +### Example PyTorch Job Specification File ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: pytorch-pod + name: pytorch-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - restartPolicy: Never - containers: - - name: pytorch-con - image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel - command: ["python3"] - args: ["/mnt/ceph_rbd/example_pytorch_code.py"] - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 1 + template: + metadata: + name: pytorch-pod + spec: + restartPolicy: Never + containers: + - name: pytorch-con + image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + command: ["python3"] + args: ["/mnt/ceph_rbd/example_pytorch_code.py"] + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + resources: + requests: + cpu: 2 + memory: "1Gi" + limits: + cpu: 4 + memory: "4Gi" + nvidia.com/gpu: 1 + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` ## Reviewing the results of the PyTorch model This is not intended to be an introduction to PyTorch, please see the [online tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) for details about the model. -1. Check model ran to completion +1. Check that the model ran to completion ``` bash kubectl logs ``` -1. Spin up lightweight pod to retrieve results +1. Spin up a lightweight pod to retrieve results ``` bash - kubectl create -f lightweight-pod.yaml + kubectl create -f lightweight-pod-job.yaml ``` -1. Copy trained model back to the head node +1. Copy the trained model back to your access VM ``` bash - kubectl cp lightweight-pod:mnt/ceph_rbd/model.pth model.pth + kubectl cp lightweight-job-:mnt/ceph_rbd/model.pth model.pth ``` -## Using a Kubernetes job to train the pytorch model +## Using a Kubernetes job to train the pytorch model multiple times A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets. @@ -183,42 +198,43 @@ Below is an example job yaml for running the pytorch model which will continue t apiVersion: batch/v1 kind: Job metadata: - name: pytorch-job + name: pytorch-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - completions: 3 - parallelism: 1 - template: - spec: - restartPolicy: Never - containers: - - name: pytorch-con - image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel - command: ["python3"] - args: ["/mnt/ceph_rbd/example_pytorch_code.py"] - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - resources: - requests: - cpu: 1 - memory: "4Gi" - limits: - cpu: 1 - memory: "8Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 3 + template: + metadata: + name: pytorch-pod + spec: + restartPolicy: Never + containers: + - name: pytorch-con + image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + command: ["python3"] + args: ["/mnt/ceph_rbd/example_pytorch_code.py"] + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + resources: + requests: + cpu: 2 + memory: "1Gi" + limits: + cpu: 4 + memory: "4Gi" + nvidia.com/gpu: 1 + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` ## Clean up ``` bash -kubectl delete pod pytorch-pod - kubectl delete pod pytorch-job kubectl delete pvc pytorch-pvc diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 6374cdc67..f181eb08a 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -70,7 +70,6 @@ Remember, you will need to use both an SSH key and Time-based one-time password --- !!! note "First Login" - When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) diff --git a/mkdocs.yml b/mkdocs.yml index cbfe4d1d2..fb602f696 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -63,7 +63,8 @@ nav: - "GPU Service": - "Overview": services/gpuservice/index.md - "Policies": services/gpuservice/policies.md - - "Tutorial": + - "Kueue": services/gpuservice/kueue.md + - "Tutorials": - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md From 80b2fe2629c631bd559db68c65e485dd62bb74fb Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Thu, 7 Mar 2024 10:08:39 +0000 Subject: [PATCH 40/91] Update VM policy statement to mention patching. --- docs/services/virtualmachines/policies.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/services/virtualmachines/policies.md b/docs/services/virtualmachines/policies.md index 24ca28047..e965bdcce 100644 --- a/docs/services/virtualmachines/policies.md +++ b/docs/services/virtualmachines/policies.md @@ -35,3 +35,7 @@ The current policy is: * The VM disk images are not backed up We strongly advise that you keep copies of any critical data on on an alternative system that is fully backed up. + +## Patching of User VMs + +The EIDF team updates and patches the hypervisors and the cloud management software as part of the EIDF Maintenance sessions. It is the responsibility of project PIs to keep the VMs in their projects up to date. VMs running the Ubuntu operating system automatically install security patches and alert users at log-on (via SSH) to reboot as necessary for the changes to take effect. It also encourages users to update packages. From 38917fe4b3153889961881272ea03f9f729a0436 Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Mon, 11 Mar 2024 13:15:17 +0000 Subject: [PATCH 41/91] Remove reference to Drag and Drop in Guacamole - this feature is not supported. --- docs/access/virtualmachines-vdi.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/access/virtualmachines-vdi.md b/docs/access/virtualmachines-vdi.md index 72b9b7d1d..abc7a18a2 100644 --- a/docs/access/virtualmachines-vdi.md +++ b/docs/access/virtualmachines-vdi.md @@ -58,7 +58,6 @@ by pressing <Ctrl> + <Alt> + <Shift> on a Windows PC client, o options, including: * [Reading from (and writing to) the clipboard of the remote desktop](https://guacamole.apache.org/doc/gug/using-guacamole.html#copying-pasting-text) -* [Uploading and downloading files](https://guacamole.apache.org/doc/gug/using-guacamole.html#file-transfer) * [Zooming in and out of the remote display](https://guacamole.apache.org/doc/gug/using-guacamole.html#scaling-display) ### Clipboard Copy and Paste Functionality @@ -87,4 +86,3 @@ are transmitted to your VM. Please contact the EIDF helpdesk at [eidf@epcc.ed.ac are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration. -## Further information From 73483d7f023e77d55586e68f69f87b1507bd038a Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 3 Jan 2024 16:20:28 +0000 Subject: [PATCH 42/91] Add first draft of template workflow --- .../training/L4_template_workflow.md | 314 ++++++++++++++++++ mkdocs.yml | 2 + 2 files changed, 316 insertions(+) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 2114bfda7..16e008145 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1 +1,315 @@ # Template workflow + +An example workflow for code development using K8s is outlined below. + +The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can be adapted for other platforms such as GitLab). + +The workflow is separated into three sections: + +1) Data Loading + +1) Preparing a custom Docker image + +1) Code development with K8s + +## Data loading + +### Create a persistent volume + +Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below). + +``` bash +kubectl create -f +``` + +##### Example PyTorch PersistentVolumeClaim + +``` yaml +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: template-workflow-pvc +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 100Gi + storageClassName: csi-rbd-sc +``` + +### Create a lightweight pod to tranfer data to the persistent volume + +1. Check PVC has been created + + ``` bash + kubectl get pvc + ``` + +1. Create a lightweight pod with PV mounted (example pod below) + + ``` bash + kubectl create -f lightweight-pod.yaml + ``` + +1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) + + ``` bash + kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ + ``` + +1. Delete lightweight pod + + ``` bash + kubectl delete pod lightweight-pod + ``` + +##### Example lightweight pod specification + +``` yaml +apiVersion: v1 +kind: Pod +metadata: + name: lightweight-pod +spec: + containers: + - name: data-loader + image: ubuntu-latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc +``` + +## Preparing a custom Docker image + +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. + +1) Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) + +1) Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image + + ```txt + FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 + RUN pip install pandas + RUN pip install scikit-learn + ``` + +1) Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) + + ```bash + docker build + ``` + +1) Push Docker image to Docker Hub (You will need to create and setup an account) + + ```bash + docker push template-docker-image + ``` + +## Code development with K8s + +A rapid development cycle from code writing to testing requires some initial setup within k8s. + +The first step is to automatically pull the latest code version before running any tests in a pod. + +This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. + +This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. + +However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. + +A template GitHub repo with sample code, k8s yaml files and github actions is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). + +### Create a job that downloads and runs the latest code version at runtime + +1) Create a standard job with the required resources and custom docker image (example below) + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc + ``` + +1) Add an initial container that runs before the main container to download the latest version of the code. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + - mountPath: /code + name: github-code + initContainers: + - name: lightweight-git-container + image: cicirello/alpine-plus-plus + command: ['sh', '-c', "cd /code; git clone "] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + volumeMounts: + - mountPath: /code + name: github-code + volumes: + - name: volume + persistentVolumeClaim: + claimName: benchmark-imagenet-pvc + - name: github-code + emptyDir: + sizeLimit: 1Gi + ``` + +1) Change the command argument in the main container to run the code once started. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ['sh', '-c', "python3 /code/"] + resources: + requests: + cpu: 10 + memory: "40Gi" + limits: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + - mountPath: /code + name: github-code + initContainers: + - name: lightweight-git-container + image: cicirello/alpine-plus-plus + command: ['sh', '-c', "cd /code; git clone "] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + volumeMounts: + - mountPath: /code + name: github-code + volumes: + - name: volume + persistentVolumeClaim: + claimName: benchmark-imagenet-pvc + - name: github-code + emptyDir: + sizeLimit: 1Gi + ``` + +### Setup GitHub actions to build and publish any changes to a Dockerfile + +1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. + +1) Add the Dockerfile to a code/docker folder within the active GitHub repo + +1) Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. + + ```yaml + name: ci + on: + push: + paths: + - 'code/docker/**' + + jobs: + docker: + runs-on: ubuntu-latest + steps: + - + name: Set up QEMU + uses: docker/setup-qemu-action@v3 + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + - + name: Login to Docker Hub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - + name: Build and push + uses: docker/build-push-action@v5 + with: + context: "{{defaultContext}}:code/docker" + push: true + tags: + ``` diff --git a/mkdocs.yml b/mkdocs.yml index fb602f696..85ba97978 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -68,6 +68,8 @@ nav: - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md + - "Template K8s Workflow": +services/gpuservice/training/L4_template_workflow.md - "GPU Service FAQ": services/gpuservice/faq.md - "Graphcore Bow Pod64": - "Overview": services/graphcore/index.md From 1cdc440482e11cce4b536e6ebc10c4527fd219d1 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 09:24:40 +0000 Subject: [PATCH 43/91] Clarify workflow with git pull --- docs/services/gpuservice/training/L4_template_workflow.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 16e008145..54189c521 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -134,7 +134,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ### Create a job that downloads and runs the latest code version at runtime -1) Create a standard job with the required resources and custom docker image (example below) +1) Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) ```yaml apiVersion: batch/v1 @@ -274,6 +274,11 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` +1) Submit the yaml file to kubernetes + ```bash + kubectl create -f + ``` + ### Setup GitHub actions to build and publish any changes to a Dockerfile 1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. From 5ea7d16831d187a9940d61321d9d55a35a7cd181 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 4 Jan 2024 10:08:32 +0000 Subject: [PATCH 44/91] Fix bug with loading template --- conda-requirements.yaml | 33 ------------------- .../training/L4_template_workflow.md | 32 +++++++++--------- mkdocs.yml | 3 +- 3 files changed, 17 insertions(+), 51 deletions(-) delete mode 100644 conda-requirements.yaml diff --git a/conda-requirements.yaml b/conda-requirements.yaml deleted file mode 100644 index 566edc658..000000000 --- a/conda-requirements.yaml +++ /dev/null @@ -1,33 +0,0 @@ -name: mkdocs -channels: - - conda-forge -dependencies: - - backports=1.1 - - cfgv=3.3.0 - - click=8.0.1 - - distlib=0.3.2 - - filelock=3.0.12 - - ghp-import=2.0.1 - - identify=2.2.11 - - importlib-metadata=4.6.1 - - Jinja2=3.0.1 - - Markdown=3.3.4 - - MarkupSafe=2.0.1 - - mergedeep=1.3.4 - - mkdocs=1.2.1 - - mkdocs-material=7.1.10 - - mkdocs-material-extensions=1.0.1 - - nodeenv=1.6.0 - - packaging=21.0 - - platformdirs=3.2 - - pre-commit=2.13.0 - - Pygments=2.9.0 - - pymdown-extensions=8.2 - - pyparsing=2.4.7 - - python-dateutil=2.8.1 - - PyYAML=5.4.1 - - pyyaml-env-tag=0.1 - - six=1.16.0 - - toml=0.10.2 - - watchdog=2.1.3 - - zipp=3.5.0 diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 54189c521..9d00cbdb1 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,11 +6,11 @@ The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can b The workflow is separated into three sections: -1) Data Loading +1. Data Loading -1) Preparing a custom Docker image +1. Preparing a custom Docker image -1) Code development with K8s +1. Code development with K8s ## Data loading @@ -22,7 +22,7 @@ Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec kubectl create -f ``` -##### Example PyTorch PersistentVolumeClaim +#### Example PyTorch PersistentVolumeClaim ``` yaml kind: PersistentVolumeClaim @@ -64,7 +64,7 @@ spec: kubectl delete pod lightweight-pod ``` -##### Example lightweight pod specification +#### Example lightweight pod specification ``` yaml apiVersion: v1 @@ -96,9 +96,9 @@ spec: Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. -1) Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) -1) Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image +1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 @@ -106,13 +106,13 @@ Kubernetes requires Docker images to be pre-built and available for download fro RUN pip install scikit-learn ``` -1) Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) +1. Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) ```bash docker build ``` -1) Push Docker image to Docker Hub (You will need to create and setup an account) +1. Push Docker image to Docker Hub (You will need to create and setup an account) ```bash docker push template-docker-image @@ -134,7 +134,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ### Create a job that downloads and runs the latest code version at runtime -1) Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) +1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below) ```yaml apiVersion: batch/v1 @@ -168,7 +168,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av claimName: template-workflow-pvc ``` -1) Add an initial container that runs before the main container to download the latest version of the code. +1. Add an initial container that runs before the main container to download the latest version of the code. ```yaml apiVersion: batch/v1 @@ -221,7 +221,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` -1) Change the command argument in the main container to run the code once started. +1. Change the command argument in the main container to run the code once started. ```yaml apiVersion: batch/v1 @@ -274,18 +274,18 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av sizeLimit: 1Gi ``` -1) Submit the yaml file to kubernetes +1. Submit the yaml file to kubernetes ```bash kubectl create -f ``` ### Setup GitHub actions to build and publish any changes to a Dockerfile -1) Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. +1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. -1) Add the Dockerfile to a code/docker folder within the active GitHub repo +1. Add the Dockerfile to a code/docker folder within the active GitHub repo -1) Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. +1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. ```yaml name: ci diff --git a/mkdocs.yml b/mkdocs.yml index 85ba97978..b2837cbb7 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -68,8 +68,7 @@ nav: - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md - - "Template K8s Workflow": -services/gpuservice/training/L4_template_workflow.md + - "Template K8s Workflow": services/gpuservice/training/L4_template_workflow.md - "GPU Service FAQ": services/gpuservice/faq.md - "Graphcore Bow Pod64": - "Overview": services/graphcore/index.md From 588cf06275a63228bd46a09615b247498f97e8f7 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 4 Jan 2024 10:25:51 +0000 Subject: [PATCH 45/91] Changed template worflow md in response to pre-commit --- .../training/L4_template_workflow.md | 21 ++++++++++--------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 9d00cbdb1..6dc74df01 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -51,13 +51,13 @@ spec: ``` bash kubectl create -f lightweight-pod.yaml ``` - + 1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) ``` bash kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ ``` - + 1. Delete lightweight pod ``` bash @@ -99,7 +99,7 @@ Kubernetes requires Docker images to be pre-built and available for download fro 1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) 1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image - + ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 RUN pip install pandas @@ -122,11 +122,11 @@ Kubernetes requires Docker images to be pre-built and available for download fro A rapid development cycle from code writing to testing requires some initial setup within k8s. -The first step is to automatically pull the latest code version before running any tests in a pod. +The first step is to automatically pull the latest code version before running any tests in a pod. This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. -This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. +This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. @@ -210,7 +210,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av cpu: 1 memory: "8Gi" volumeMounts: - - mountPath: /code + - mountPath: /code name: github-code volumes: - name: volume @@ -263,7 +263,7 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av cpu: 1 memory: "8Gi" volumeMounts: - - mountPath: /code + - mountPath: /code name: github-code volumes: - name: volume @@ -273,13 +273,14 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av emptyDir: sizeLimit: 1Gi ``` - + 1. Submit the yaml file to kubernetes + ```bash kubectl create -f ``` - -### Setup GitHub actions to build and publish any changes to a Dockerfile + +### Setup GitHub actions to build and publish any changes to a Dockerfile 1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. From a83cdad2babbb5018ee6b3a006cc72f41d452d31 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 11:02:10 +0000 Subject: [PATCH 46/91] Remove reference to pytorch in L4_template_workflow.md --- docs/services/gpuservice/training/L4_template_workflow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 6dc74df01..360731d61 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -22,7 +22,7 @@ Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec kubectl create -f ``` -#### Example PyTorch PersistentVolumeClaim +#### Example PersistentVolumeClaim ``` yaml kind: PersistentVolumeClaim From 20f75e0550d042d1f999c49bcf6f28c4c62819a8 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 12:12:49 +0000 Subject: [PATCH 47/91] Fix bugs with data loading --- docs/services/gpuservice/training/L4_template_workflow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 360731d61..81431fb3a 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -55,7 +55,7 @@ spec: 1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) ``` bash - kubectl exec lightweight-pod -- wget /mnt/ceph_rdb/ + kubectl exec lightweight-pod -- curl /mnt/ceph_rbd/ ``` 1. Delete lightweight pod @@ -74,7 +74,7 @@ metadata: spec: containers: - name: data-loader - image: ubuntu-latest + image: alpine/curl:latest command: ["sleep", "infinity"] resources: requests: From ced7729697b74621cb777c0eeda806074ab01578 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Thu, 4 Jan 2024 12:35:47 +0000 Subject: [PATCH 48/91] Fix indent in basic workflow container --- .../training/L4_template_workflow.md | 42 +++++++++---------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 81431fb3a..35647c9d9 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -140,31 +140,31 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av apiVersion: batch/v1 kind: Job metadata: - name: template-workflow-job + name: template-workflow-job spec: - completions: 1 - parallelism: 1 - template: + completions: 1 + parallelism: 1 + template: spec: - restartPolicy: Never - containers: - - name: template-docker-image - image: /template-docker-image:latest - command: ["sleep", "infinity"] - resources: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 10 + memory: "40Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: + cpu: 10 + memory: "80Gi" + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: claimName: template-workflow-pvc ``` From a5322b132a68cd62f5290a5e2d8cf64e698411c7 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 7 Feb 2024 15:14:34 +0000 Subject: [PATCH 49/91] Add notes highlighting the importance of specifying GPU types --- docs/services/gpuservice/index.md | 5 +++++ docs/services/gpuservice/training/L1_getting_started.md | 5 +++++ 2 files changed, 10 insertions(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 7dde82aaf..99629d1b0 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -33,6 +33,11 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: Changes to the default quota must be discussed and agreed with the EIDF Services team. +> **NOTE** +> +> If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. +> Please see [Getting started with Kubernetes](training/L1_getting_started.md) to learn about specifying GPU resources. + ## Service Access Users should have an [EIDF Account](../../access/project.md). diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 9ebd1bea7..71fa10d1b 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -237,6 +237,11 @@ The GPU resource requests can be made more specific by adding the type of GPU pr - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` - `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` +> **WARNING** +> +> If you request a GPU but do not specify a GPU type you will be assigned one at random. +> Please check you are requesting a GPU with the correct memory and double check spelling. + ### Example yaml file ```yaml From ba2546df4600fab52723a3cd58dc6ce7006a6514 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 13 Feb 2024 17:19:35 +0000 Subject: [PATCH 50/91] Add options for all three stages of the workflow --- .../training/L4_template_workflow.md | 379 ++++++++++++------ 1 file changed, 265 insertions(+), 114 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 35647c9d9..0b7768954 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1,136 +1,327 @@ # Template workflow +## Overview + An example workflow for code development using K8s is outlined below. -The workflow requires a GitHub account and GitHub Actions for CI/CD, (this can be adapted for other platforms such as GitLab). +In theory, users can create docker images with all the code, software and data included to complete their analysis. -The workflow is separated into three sections: +In practice, docker images with the required software alone can be several gigabytes in size and can be lead to unacceptable download times when ~100GB of data and code is included. -1. Data Loading +Therefore, it is recommended to separate code, software and data preparation into distinct steps: -1. Preparing a custom Docker image +1. Data Loading: Loading large data sets asynchronously. -1. Code development with K8s +1. Developing a Docker environment: Manually or automatically building Docker images. -## Data loading +1. Code development with K8s: Iteratively changing and testing code in a job. + +The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. -### Create a persistent volume +The three stages are interchangeable and may not be relevant to every project. + +Some strategies in the workflow require a [GitHub](https://github.com) account and [Docker Hub](https://hub.docker.com/) account for automatic building (this can be adapted for other platforms such as GitLab). + +## Data loading -Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below). +The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware. -``` bash -kubectl create -f -``` +Ensure persistent volume claims are of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. -#### Example PersistentVolumeClaim +Read the [requesting persistent volumes with Kubernetes](L2_requesting_persistent_volumes.md) lesson to learn how to request and mount persistent volumes to pods. -``` yaml -kind: PersistentVolumeClaim -apiVersion: v1 -metadata: - name: template-workflow-pvc -spec: - accessModes: - - ReadWriteOnce - resources: - requests: - storage: 100Gi - storageClassName: csi-rbd-sc -``` +Downloading data sets of 1/2 TB or more to a persistent volume often takes several hours or days and needs to be completed asynchronously. -### Create a lightweight pod to tranfer data to the persistent volume +### Asynchronous data downloading with a lightweight job -1. Check PVC has been created +1. Check a PVC has been created. ``` bash - kubectl get pvc + kubectl get pvc template-workflow-pvc + ``` + +1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + metadata: + name: lightweight-job + spec: + restartPolicy: Never + containers: + - name: data-loader + image: alpine/curl:latest + command: ['sh', '-c', "cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip"] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc ``` -1. Create a lightweight pod with PV mounted (example pod below) +1. Run the data download job. ``` bash kubectl create -f lightweight-pod.yaml ``` -1. Download data set (If the data set download time is estimated to be hours or days you may want to run this code within a [screen](https://www.gnu.org/software/screen/manual/screen.html) instance on your VM so you can track the progress asynchronously) +1. Check if the download has completed. ``` bash - kubectl exec lightweight-pod -- curl /mnt/ceph_rbd/ + kubectl get jobs ``` -1. Delete lightweight pod +1. Delete lightweight job once completed. ``` bash - kubectl delete pod lightweight-pod + kubectl delete job lightweight-job + ``` + +### Asynchronous data downloading within a screen session + +[Screen](https://www.gnu.org/software/screen/manual/screen.html#Overview) is a window manager available in Linux that allows you to create multiple interactive shells and swap between then. + +Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. + +This allows you to start a task, such as downloading a data set, and check in on it asynchronously. + +Once you have started a screen session, you can create a new window with `ctrl-a c`, swap between windows with `ctrl-a 0-9` and exit screen (but keep any task running) with `ctrl-a d`. + +Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading. + +1. Start a screen session. + + ```bash + screen ``` -#### Example lightweight pod specification - -``` yaml -apiVersion: v1 -kind: Pod -metadata: - name: lightweight-pod -spec: - containers: - - name: data-loader - image: alpine/curl:latest - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: template-workflow-pvc -``` +1. Create an interactive lightweight job session. + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + metadata: + name: lightweight-pod + spec: + restartPolicy: Never + containers: + - name: data-loader + image: alpine/curl:latest + command: ['sleep','infinity'] + resources: + requests: + cpu: 1 + memory: "1Gi" + limits: + cpu: 1 + memory: "1Gi" + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + volumes: + - name: volume + persistentVolumeClaim: + claimName: template-workflow-pvc + ``` + +1. Download data set. Change the curl URL to your data set of interest. + + ``` bash + kubectl exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + ``` + +1. Exit the remote session by either ending the session or `ctrl-a d`. + +1. Reconnect at a later time and reattach the screen window. + + ```bash + screen -list + + screen -r + ``` + +1. Check the download was successful and delete the job. + + ```bash + kubectl exec -- ls /mnt/ceph_rbd/ + + kubectl delete job lightweight-job + ``` + +1. Exit the screen session. + + ```bash + exit + ``` ## Preparing a custom Docker image -Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. Typical use cases require some custom modifications of a base image. +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. -1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks) +It does not provide functionality to build images and create pods from docker files. -1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image +However, use cases may require some custom modifications of a base image, such as adding a python library. + +These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub. + +This is not an introduction to building docker images, please see the [Docker tutorial](https://docs.docker.com/get-started/) for a general overview. + +### Manually building a Docker image locally + +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use to base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). + +1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image. ```txt FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10 RUN pip install pandas - RUN pip install scikit-learn + RUN pip install plotly ``` -1. Build the Docker container locally or on a VM (You will need to install [Docker](https://docs.docker.com/)) +1. Build the Docker container locally (You will need to install [Docker](https://docs.docker.com/)) ```bash - docker build + cd + + docker build . -t /template-docker-image:latest ``` + + > **NOTE** + > + > Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. + > If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. + +1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). -1. Push Docker image to Docker Hub (You will need to create and setup an account) +1. Push the Docker image to the repository. ```bash - docker push template-docker-image + docker push /template-docker-image:latest ``` + +1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. + + ```yaml + apiVersion: batch/v1 + kind: Job + metadata: + name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: + spec: + completions: 1 + parallelism: 1 + template: + spec: + restartPolicy: Never + containers: + - name: template-docker-image + image: /template-docker-image:latest + command: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: "4Gi" + limits: + cpu: 1 + memory: "8Gi" + ``` + +### Automatically building docker images using GitHub Actions + +In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. + +A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the dockerfile in a git repo. + +This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. + +1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. + +1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. + +1. Add the dockerfile to a code/docker folder within an active GitHub repo. + +1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. + + ```yaml + name: ci + on: + push: + paths: + - 'code/docker/**' + + jobs: + docker: + runs-on: ubuntu-latest + steps: + - + name: Set up QEMU + uses: docker/setup-qemu-action@v3 + - + name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + - + name: Login to Docker Hub + uses: docker/login-action@v3 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - + name: Build and push + uses: docker/build-push-action@v5 + with: + context: "{{defaultContext}}:code/docker" + push: true + tags: + ``` + +1. Push a change to the dockerfile and check the Docker Hub image is updated. ## Code development with K8s -A rapid development cycle from code writing to testing requires some initial setup within k8s. +Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together. -The first step is to automatically pull the latest code version before running any tests in a pod. +However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time. -This allows development to be conducted on any device/VM with access to the repo (GitHub/GitLab) and testing to be completed on the cluster with just one `kubectl create` command. +If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes). -This allows custom code/models to be prototyped on the cluster, but typically within a standard base image. +A pod yaml file can be defined to automatically pull the latest code version before running any tests. -However, if the Docker container also needs to be developed then GitHub actions can be used to automatically build a new image and publish it to Docker Hub if any changes to a Dockerfile is detected. +Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the `kubectl create` command. -A template GitHub repo with sample code, k8s yaml files and github actions is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). +You must already have a [GitHub](https://github.com) account to follow this process. + +This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab). + +An alternative method for remote code development using the DevSpace toolkit is described is the next lesson, [Getting started with DevSpace](L5_devspace.md). + +A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). ### Create a job that downloads and runs the latest code version at runtime @@ -279,43 +470,3 @@ A template GitHub repo with sample code, k8s yaml files and github actions is av ```bash kubectl create -f ``` - -### Setup GitHub actions to build and publish any changes to a Dockerfile - -1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. - -1. Add the Dockerfile to a code/docker folder within the active GitHub repo - -1. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected. - - ```yaml - name: ci - on: - push: - paths: - - 'code/docker/**' - - jobs: - docker: - runs-on: ubuntu-latest - steps: - - - name: Set up QEMU - uses: docker/setup-qemu-action@v3 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Login to Docker Hub - uses: docker/login-action@v3 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - - name: Build and push - uses: docker/build-push-action@v5 - with: - context: "{{defaultContext}}:code/docker" - push: true - tags: - ``` From c361982b569772dab8a3156b13f3356d26725ebc Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 14 Feb 2024 16:45:14 +0000 Subject: [PATCH 51/91] Test all example code --- .../training/L4_template_workflow.md | 33 +++++++++++-------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 0b7768954..3064bd6d6 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,7 +6,7 @@ An example workflow for code development using K8s is outlined below. In theory, users can create docker images with all the code, software and data included to complete their analysis. -In practice, docker images with the required software alone can be several gigabytes in size and can be lead to unacceptable download times when ~100GB of data and code is included. +In practice, docker images with the required software alone can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is included. Therefore, it is recommended to separate code, software and data preparation into distinct steps: @@ -332,6 +332,8 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -344,12 +346,11 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu command: ["sleep", "infinity"] resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 1 + memory: "4Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 + cpu: 1 + memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume @@ -366,6 +367,8 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -378,12 +381,11 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu command: ["sleep", "infinity"] resources: requests: - cpu: 10 - memory: "40Gi" + cpu: 1 + memory: "4Gi" limits: - cpu: 10 - memory: "80Gi" - nvidia.com/gpu: 1 + cpu: 1 + memory: "8Gi" volumeMounts: - mountPath: /mnt/ceph_rbd name: volume @@ -406,19 +408,22 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu volumes: - name: volume persistentVolumeClaim: - claimName: benchmark-imagenet-pvc + claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi ``` -1. Change the command argument in the main container to run the code once started. +1. Change the command argument in the main container to run the code once started. +Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. ```yaml apiVersion: batch/v1 kind: Job metadata: name: template-workflow-job + labels: + kueue.x-k8s.io/queue-name: spec: completions: 1 parallelism: 1 @@ -459,7 +464,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu volumes: - name: volume persistentVolumeClaim: - claimName: benchmark-imagenet-pvc + claimName: template-workflow-pvc - name: github-code emptyDir: sizeLimit: 1Gi From b76e683a4b9e8e8624fe0f6bd44c38c7b7770828 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 13:36:30 +0000 Subject: [PATCH 52/91] Add -n to kubectl usage --- .../training/L4_template_workflow.md | 26 ++++++++++--------- 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 3064bd6d6..67fc33c87 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -6,9 +6,9 @@ An example workflow for code development using K8s is outlined below. In theory, users can create docker images with all the code, software and data included to complete their analysis. -In practice, docker images with the required software alone can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is included. +In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added. -Therefore, it is recommended to separate code, software and data preparation into distinct steps: +Therefore, it is recommended to separate code, software, and data preparation into distinct steps: 1. Data Loading: Loading large data sets asynchronously. @@ -26,18 +26,20 @@ Some strategies in the workflow require a [GitHub](https://github.com) account a The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware. -Ensure persistent volume claims are of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. +Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO. Read the [requesting persistent volumes with Kubernetes](L2_requesting_persistent_volumes.md) lesson to learn how to request and mount persistent volumes to pods. -Downloading data sets of 1/2 TB or more to a persistent volume often takes several hours or days and needs to be completed asynchronously. +It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume. + +Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable. ### Asynchronous data downloading with a lightweight job 1. Check a PVC has been created. ``` bash - kubectl get pvc template-workflow-pvc + kubectl -n get pvc template-workflow-pvc ``` 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. @@ -80,19 +82,19 @@ Downloading data sets of 1/2 TB or more to a persistent volume often takes sever 1. Run the data download job. ``` bash - kubectl create -f lightweight-pod.yaml + kubectl -n create -f lightweight-pod.yaml ``` 1. Check if the download has completed. ``` bash - kubectl get jobs + kubectl -n get jobs ``` 1. Delete lightweight job once completed. ``` bash - kubectl delete job lightweight-job + kubectl -n delete job lightweight-job ``` ### Asynchronous data downloading within a screen session @@ -153,7 +155,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -169,9 +171,9 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ - kubectl delete job lightweight-job + kubectl -n delete job lightweight-job ``` 1. Exit the screen session. @@ -473,5 +475,5 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag 1. Submit the yaml file to kubernetes ```bash - kubectl create -f + kubectl -n create -f ``` From 9df91f9c9340fc6a4b3d0597c2307b6639b92b84 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:10:17 +0000 Subject: [PATCH 53/91] Simplify to project namespace --- .../gpuservice/training/L4_template_workflow.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 67fc33c87..bd5e069ee 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -39,7 +39,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta 1. Check a PVC has been created. ``` bash - kubectl -n get pvc template-workflow-pvc + kubectl -n get pvc template-workflow-pvc ``` 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. @@ -82,19 +82,19 @@ Therefore, the data download step needs to be completed asynchronously as mainta 1. Run the data download job. ``` bash - kubectl -n create -f lightweight-pod.yaml + kubectl -n create -f lightweight-pod.yaml ``` 1. Check if the download has completed. ``` bash - kubectl -n get jobs + kubectl -n get jobs ``` 1. Delete lightweight job once completed. ``` bash - kubectl -n delete job lightweight-job + kubectl -n delete job lightweight-job ``` ### Asynchronous data downloading within a screen session @@ -155,7 +155,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -171,9 +171,9 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl -n exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ - kubectl -n delete job lightweight-job + kubectl -n delete job lightweight-job ``` 1. Exit the screen session. @@ -475,5 +475,5 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag 1. Submit the yaml file to kubernetes ```bash - kubectl -n create -f + kubectl -n create -f ``` From e38d073f75e6c189811f093cb6b163653eb4ba07 Mon Sep 17 00:00:00 2001 From: awat31 Date: Mon, 25 Mar 2024 14:28:48 +0000 Subject: [PATCH 54/91] Update to make SSH docs clearer --- docs/access/ssh.md | 28 ++++++++++++++++++++++++---- 1 file changed, 24 insertions(+), 4 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index e6f955e87..cca056b27 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -92,7 +92,7 @@ To enable this for your EIDF account: ### Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux -1. From your local terminal, import the SSH Key you generated above: ```$ ssh-add [sshkey]``` +1. From your local terminal, import the SSH Key you generated above: ```ssh-add /path/to/ssh-key``` 1. This should return "Identity added [Path to SSH Key]" if successful. You can then follow the steps below to access your VM. @@ -103,10 +103,26 @@ To enable this for your EIDF account: OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal. -Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below. +Ensure you have created and added an ssh key as specified in the 'Generate a new SSH Key' section above, then follow the below steps: + +1. Add your SSH-Key to the SSH-Agent + +``` +ssh-add /path/to/ssh-key +``` + +!!! info + If the above command fails saying the SSH Agent is not running, run the below command:
+ ``` eval `ssh-agent` ```
+ Then re-run the ssh-add command above + +Now you can ssh to your VM using the below command ```bash -ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] +ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] + +For example: +ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. @@ -137,8 +153,12 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. This should return "Identity added [Path to SSH Key]" if successful. 1. Login by jumping through the gateway. + ```bash -ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] +ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] + +For example: +ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. From 9d9afd9cc734b710e2ce93a48e08586bd40775b8 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:04:38 +0000 Subject: [PATCH 55/91] Restore yaml file --- conda-requirements.yaml | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) create mode 100644 conda-requirements.yaml diff --git a/conda-requirements.yaml b/conda-requirements.yaml new file mode 100644 index 000000000..566edc658 --- /dev/null +++ b/conda-requirements.yaml @@ -0,0 +1,33 @@ +name: mkdocs +channels: + - conda-forge +dependencies: + - backports=1.1 + - cfgv=3.3.0 + - click=8.0.1 + - distlib=0.3.2 + - filelock=3.0.12 + - ghp-import=2.0.1 + - identify=2.2.11 + - importlib-metadata=4.6.1 + - Jinja2=3.0.1 + - Markdown=3.3.4 + - MarkupSafe=2.0.1 + - mergedeep=1.3.4 + - mkdocs=1.2.1 + - mkdocs-material=7.1.10 + - mkdocs-material-extensions=1.0.1 + - nodeenv=1.6.0 + - packaging=21.0 + - platformdirs=3.2 + - pre-commit=2.13.0 + - Pygments=2.9.0 + - pymdown-extensions=8.2 + - pyparsing=2.4.7 + - python-dateutil=2.8.1 + - PyYAML=5.4.1 + - pyyaml-env-tag=0.1 + - six=1.16.0 + - toml=0.10.2 + - watchdog=2.1.3 + - zipp=3.5.0 From f82513b1697661e6a7332b797fb394dd7e13387d Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:22:04 +0000 Subject: [PATCH 56/91] Restore L1 to previous version --- docs/services/gpuservice/training/L1_getting_started.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 71fa10d1b..9ebd1bea7 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -237,11 +237,6 @@ The GPU resource requests can be made more specific by adding the type of GPU pr - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` - `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` -> **WARNING** -> -> If you request a GPU but do not specify a GPU type you will be assigned one at random. -> Please check you are requesting a GPU with the correct memory and double check spelling. - ### Example yaml file ```yaml From cf0ca63c16d1e8f2d78bfb60fb73039792347fd0 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:35:20 +0000 Subject: [PATCH 57/91] Respond to Alistair comments --- .../training/L4_template_workflow.md | 42 ++++++++++--------- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index bd5e069ee..b5a70eede 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -1,5 +1,9 @@ # Template workflow +## Requirements + + It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial. + ## Overview An example workflow for code development using K8s is outlined below. @@ -50,7 +54,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta metadata: name: lightweight-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -123,7 +127,7 @@ Using screen rather than a single download job can be helpful if downloading mul metadata: name: lightweight-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -155,7 +159,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Download data set. Change the curl URL to your data set of interest. ``` bash - kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip + kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` 1. Exit the remote session by either ending the session or `ctrl-a d`. @@ -171,7 +175,7 @@ Using screen rather than a single download job can be helpful if downloading mul 1. Check the download was successful and delete the job. ```bash - kubectl -n exec -- ls /mnt/ceph_rbd/ + kubectl -n exec -- ls /mnt/ceph_rbd/ kubectl -n delete job lightweight-job ``` @@ -209,22 +213,22 @@ This is not an introduction to building docker images, please see the [Docker tu 1. Build the Docker container locally (You will need to install [Docker](https://docs.docker.com/)) ```bash - cd + cd - docker build . -t /template-docker-image:latest + docker build . -t /template-docker-image:latest ``` - > **NOTE** - > - > Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. - > If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. +!!! important "Building images for different CPU architectures" + Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. + + If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. 1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). 1. Push the Docker image to the repository. ```bash - docker push /template-docker-image:latest + docker push /template-docker-image:latest ``` 1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. @@ -235,7 +239,7 @@ This is not an introduction to building docker images, please see the [Docker tu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -244,7 +248,7 @@ This is not an introduction to building docker images, please see the [Docker tu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -335,7 +339,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -344,7 +348,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -370,7 +374,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -379,7 +383,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ["sleep", "infinity"] resources: requests: @@ -425,7 +429,7 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag metadata: name: template-workflow-job labels: - kueue.x-k8s.io/queue-name: + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 parallelism: 1 @@ -434,7 +438,7 @@ Add the URL of the GitHub repo of interest to the `initContainers: command:` tag restartPolicy: Never containers: - name: template-docker-image - image: /template-docker-image:latest + image: /template-docker-image:latest command: ['sh', '-c', "python3 /code/"] resources: requests: From c6ae85863108310af144dd4a03d17f68a3a28de2 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:40:16 +0000 Subject: [PATCH 58/91] Add workflow lesson to overview table --- docs/services/gpuservice/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 99629d1b0..53bbc6949 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -87,6 +87,7 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it | [Getting started with Kubernetes](training/L1_getting_started.md) | a. What is Kubernetes?
b. How to send a task to a GPU node.
c. How to define the GPU resources needed. | | [Requesting persistent volumes with Kubernetes](training/L2_requesting_persistent_volumes.md) | a. What is a persistent volume?
b. How to request a PV resource. | | [Running a PyTorch task](training/L3_running_a_pytorch_task.md) | a. Accessing a Pytorch container.
b. Submitting a PyTorch task to the cluster.
c. Inspecting the results. | +| [Template workflow](training/L4_template workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | ## Further Reading and Help From 2da5e49d49070ea5bab0b2e804bfb7a8e0f29a90 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 17:45:38 +0000 Subject: [PATCH 59/91] Fix typos --- docs/services/gpuservice/training/L4_template_workflow.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index b5a70eede..6362e1704 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -95,7 +95,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta kubectl -n get jobs ``` -1. Delete lightweight job once completed. +1. Delete the lightweight job once completed. ``` bash kubectl -n delete job lightweight-job @@ -200,7 +200,7 @@ This is not an introduction to building docker images, please see the [Docker tu ### Manually building a Docker image locally -1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use to base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). +1. Select a suitable base image (The [Nvidia container catalog](https://catalog.ngc.nvidia.com/containers) is often a useful starting place for GPU accelerated tasks). We'll use the base [RAPIDS image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/rapidsai/containers/base). 1. Create a [Dockerfile](https://docs.docker.com/engine/reference/builder/) to add any additional packages required to the base image. @@ -263,7 +263,7 @@ This is not an introduction to building docker images, please see the [Docker tu In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. -A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the dockerfile in a git repo. +A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo. This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. From a87f10c01f54d2db1adf732cc35320167d1a57bb Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 15:45:04 +0100 Subject: [PATCH 60/91] Fix EIDF Docs and add SSH Alias Section --- docs/access/ssh.md | 103 ++++++++++++++++++++++++++++++++++----------- 1 file changed, 79 insertions(+), 24 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index cca056b27..84e780582 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -103,25 +103,14 @@ To enable this for your EIDF account: OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal. -Ensure you have created and added an ssh key as specified in the 'Generate a new SSH Key' section above, then follow the below steps: +Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below. -1. Add your SSH-Key to the SSH-Agent - -``` +```bash ssh-add /path/to/ssh-key +ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` - -!!! info - If the above command fails saying the SSH Agent is not running, run the below command:
- ``` eval `ssh-agent` ```
- Then re-run the ssh-add command above - -Now you can ssh to your VM using the below command - -```bash -ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] - For example: +``` ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` @@ -137,9 +126,9 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Click the ‘Start’ button at the bottom of the screen 1. Click the ‘Settings’ cog icon -1. Search in the top bar ‘Add or Remove Programs’ and select the entry -1. Select the ‘Optional Features’ blue text link -1. If ‘OpenSSH Client’ is not under ‘Installed Features’, click the ‘Add a Feature’ button +1. Select 'System' +1. Select the ‘Optional Features’ option at the bottom of the list +1. If ‘OpenSSH Client’ is not under ‘Installed Features’, click the ‘View Features’ button 1. Search ‘OpenSSH Client’ 1. Select the check box next to ‘OpenSSH Client’ and click ‘Install’ @@ -148,21 +137,87 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob !!! warning If this is your first time connecting to EIDF using a new account, you have to set a password as described in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). -1. Open either Powershell (the Windows Terminal) or a WSL Linux Terminal -1. Import the SSH Key you generated above: ```$ ssh-add [/path/to/sshkey]``` -1. This should return "Identity added [Path to SSH Key]" if successful. -1. Login by jumping through the gateway. - +1. Open either Powershell or the Windows Terminal +1. Import the SSH Key you generated above: ```$ ssh-add \path\to\sshkey``` +1. This should return "Identity added [Path to SSH Key]" if successful. If it doesn't, run the following in Powershell: +```powershell +Get-Service -Name ssh-agent | Set-Service -StartupType Manual +Start-Service ssh-agent +ssh-add \path\to\sshkey +``` +1. Login by jumping through the gateway. ```bash ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] - +``` For example: +``` ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. +## SSH Aliases + +You can use SSH Aliases to access your VMs with a single word. + +1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) +``` +vi ~/.ssh/config +``` + +1. Insert the following lines: +``` +Host eidf-gateway + Hostname eidf-gateway.epcc.ed.ac.uk + User + IdentityFile /path/to/ssh/key +``` +For example: +``` +Host eidf-gateway + Hostname eidf-gateway.epcc.ed.ac.uk + User alice + IdentityFile ~/.ssh/id_ed25519 +``` + +1. Save and quit the file. + +1. Now you can ssh to your VM using the below command: +```bash +ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key +``` +For example: +``` +ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 +``` + +1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: + ``` + Host + HostName 10.24.VM.IP + User + IdentityFile /path/to/ssh/key + ProxyCommand ssh eidf-gateway -W %h:%p + ``` + For Example: + ``` + Host demo + HostName 10.24.1.1 + User alice + IdentityFile ~/.ssh/id_ed25519 + ProxyCommand ssh eidf-gateway -W %h:%p + ``` +1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. +

Note for this setup, if your key is RSA, you will need to add the line ```HostKeyAlgorithms +ssh-rsa``` to the bottom of the 'demo' alias. + +!!! info + This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. + This method allows for a much less complicated ssh command to reach your VMs.
+ You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like.
+ The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. + + ## First Password Setting and Password Resets Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). From 524e7977eeaf17786489a8c3711e213266e7d46b Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 15:48:59 +0100 Subject: [PATCH 61/91] Corrections --- docs/access/ssh.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 84e780582..985397dd3 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -111,6 +111,7 @@ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` For example: ``` +ssh-add ~/.ssh/keys/id_ed25519 ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` @@ -138,7 +139,14 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob If this is your first time connecting to EIDF using a new account, you have to set a password as described in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). 1. Open either Powershell or the Windows Terminal -1. Import the SSH Key you generated above: ```$ ssh-add \path\to\sshkey``` +1. Import the SSH Key you generated above: + ``` + ssh-add \path\to\sshkey + ``` + For Example + ``` + ssh-add .\.ssh\id_ed25519 + ``` 1. This should return "Identity added [Path to SSH Key]" if successful. If it doesn't, run the following in Powershell: ```powershell Get-Service -Name ssh-agent | Set-Service -StartupType Manual From 9f63b7766bc7ecce9d570d9ab3b98faa197fa41e Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 15:51:11 +0100 Subject: [PATCH 62/91] Add eval fix for ssh-agent not running --- docs/access/ssh.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 985397dd3..e9d02f0cb 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -115,6 +115,11 @@ ssh-add ~/.ssh/keys/id_ed25519 ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` +!!! info + If the ```ssh-add``` command fails saying the SSH Agent is not running, run the below command:
+ ``` eval `ssh-agent` ```
+ Then re-run the ssh-add command above + The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. From 2f9deb961860587bdaeb0b0642b6a338afc8eb98 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:00:39 +0100 Subject: [PATCH 63/91] Removed some unnecessary indents --- docs/access/ssh.md | 59 ++++++++++++++++++++++++---------------------- 1 file changed, 31 insertions(+), 28 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index e9d02f0cb..d89d101ad 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -108,8 +108,10 @@ Ensure you have created and added an ssh key as specified in the 'Generating and ```bash ssh-add /path/to/ssh-key ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] + ``` For example: + ``` ssh-add ~/.ssh/keys/id_ed25519 ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 @@ -145,13 +147,14 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Open either Powershell or the Windows Terminal 1. Import the SSH Key you generated above: - ``` - ssh-add \path\to\sshkey - ``` - For Example - ``` - ssh-add .\.ssh\id_ed25519 - ``` +``` +ssh-add \path\to\sshkey +``` +For Example +``` +ssh-add .\.ssh\id_ed25519 +``` + 1. This should return "Identity added [Path to SSH Key]" if successful. If it doesn't, run the following in Powershell: ```powershell Get-Service -Name ssh-agent | Set-Service -StartupType Manual @@ -182,16 +185,16 @@ vi ~/.ssh/config 1. Insert the following lines: ``` Host eidf-gateway - Hostname eidf-gateway.epcc.ed.ac.uk - User - IdentityFile /path/to/ssh/key + Hostname eidf-gateway.epcc.ed.ac.uk + User + IdentityFile /path/to/ssh/key ``` For example: ``` Host eidf-gateway - Hostname eidf-gateway.epcc.ed.ac.uk - User alice - IdentityFile ~/.ssh/id_ed25519 + Hostname eidf-gateway.epcc.ed.ac.uk + User alice + IdentityFile ~/.ssh/id_ed25519 ``` 1. Save and quit the file. @@ -206,21 +209,21 @@ ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 ``` 1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: - ``` - Host - HostName 10.24.VM.IP - User - IdentityFile /path/to/ssh/key - ProxyCommand ssh eidf-gateway -W %h:%p - ``` - For Example: - ``` - Host demo - HostName 10.24.1.1 - User alice - IdentityFile ~/.ssh/id_ed25519 - ProxyCommand ssh eidf-gateway -W %h:%p - ``` +``` +Host + HostName 10.24.VM.IP + User + IdentityFile /path/to/ssh/key + ProxyCommand ssh eidf-gateway -W %h:%p +``` +For Example: +``` +Host demo + HostName 10.24.1.1 + User alice + IdentityFile ~/.ssh/id_ed25519 + ProxyCommand ssh eidf-gateway -W %h:%p +``` 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM.

Note for this setup, if your key is RSA, you will need to add the line ```HostKeyAlgorithms +ssh-rsa``` to the bottom of the 'demo' alias. From d2b13d473c375f2a08cf2319cc50b7d50aee2ac3 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:16:01 +0100 Subject: [PATCH 64/91] Fixing codeblocks --- docs/access/ssh.md | 34 ++++++++++++++++------------------ 1 file changed, 16 insertions(+), 18 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index d89d101ad..281d00773 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -45,11 +45,9 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Open a new window of whatever terminal you will use to SSH to EIDF. 1. Generate a new SSH Key: - - ```bash - ssh-keygen - ``` - +```bash +ssh-keygen +``` 1. It is fine to accept the default name and path for the key unless you manage a number of keys. 1. Press enter to finish generating the key @@ -92,7 +90,7 @@ To enable this for your EIDF account: ### Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux -1. From your local terminal, import the SSH Key you generated above: ```ssh-add /path/to/ssh-key``` +1. From your local terminal, import the SSH Key you generated above:
`ssh-add /path/to/ssh-key` 1. This should return "Identity added [Path to SSH Key]" if successful. You can then follow the steps below to access your VM. @@ -112,13 +110,13 @@ ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] ``` For example: -``` +```bash ssh-add ~/.ssh/keys/id_ed25519 ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` !!! info - If the ```ssh-add``` command fails saying the SSH Agent is not running, run the below command:
+ If the `ssh-add` command fails saying the SSH Agent is not running, run the below command:
``` eval `ssh-agent` ```
Then re-run the ssh-add command above @@ -147,11 +145,11 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Open either Powershell or the Windows Terminal 1. Import the SSH Key you generated above: -``` +```powershell ssh-add \path\to\sshkey ``` For Example -``` +```powershell ssh-add .\.ssh\id_ed25519 ``` @@ -167,7 +165,7 @@ ssh-add \path\to\sshkey ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] ``` For example: -``` +```bash ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` @@ -178,19 +176,19 @@ You will be prompted for a 'TOTP' code upon successful public key authentication You can use SSH Aliases to access your VMs with a single word. 1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) -``` +```bash vi ~/.ssh/config ``` 1. Insert the following lines: -``` +```bash Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk User IdentityFile /path/to/ssh/key ``` For example: -``` +```bash Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk User alice @@ -204,7 +202,7 @@ Host eidf-gateway ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key ``` For example: -``` +```bash ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 ``` @@ -225,7 +223,8 @@ Host demo ProxyCommand ssh eidf-gateway -W %h:%p ``` 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. -

Note for this setup, if your key is RSA, you will need to add the line ```HostKeyAlgorithms +ssh-rsa``` to the bottom of the 'demo' alias. +

Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: +`HostKeyAlgorithms +ssh-rsa` !!! info This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. @@ -233,7 +232,6 @@ Host demo You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like.
The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. - ## First Password Setting and Password Resets -Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file From 77f7cd5cc0c9101f0574996f9c3f43db51ed2ee3 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:34:30 +0100 Subject: [PATCH 65/91] Fixing Indentation --- docs/access/ssh.md | 146 ++++++++++++++++++++++++++------------------- 1 file changed, 84 insertions(+), 62 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 281d00773..bff45f632 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -45,9 +45,12 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Open a new window of whatever terminal you will use to SSH to EIDF. 1. Generate a new SSH Key: -```bash -ssh-keygen -``` + + + ```bash + ssh-keygen + ``` + 1. It is fine to accept the default name and path for the key unless you manage a number of keys. 1. Press enter to finish generating the key @@ -101,13 +104,13 @@ To enable this for your EIDF account: OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal. -Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the command below. +Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the commands below: ```bash ssh-add /path/to/ssh-key ssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip] - ``` + For example: ```bash @@ -145,29 +148,37 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Open either Powershell or the Windows Terminal 1. Import the SSH Key you generated above: -```powershell -ssh-add \path\to\sshkey -``` -For Example -```powershell -ssh-add .\.ssh\id_ed25519 -``` + + ```powershell + ssh-add \path\to\sshkey + ``` + + For Example + + ```powershell + ssh-add .\.ssh\id_ed25519 + ``` 1. This should return "Identity added [Path to SSH Key]" if successful. If it doesn't, run the following in Powershell: -```powershell -Get-Service -Name ssh-agent | Set-Service -StartupType Manual -Start-Service ssh-agent -ssh-add \path\to\sshkey -``` + + ```powershell + Get-Service -Name ssh-agent | Set-Service -StartupType Manual + Start-Service ssh-agent + ssh-add \path\to\sshkey + ``` 1. Login by jumping through the gateway. -```bash -ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] -``` -For example: -```bash -ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 -``` + + + ```bash + ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] + ``` + + For example: + + ```bash + ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 + ``` You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. @@ -176,52 +187,63 @@ You will be prompted for a 'TOTP' code upon successful public key authentication You can use SSH Aliases to access your VMs with a single word. 1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) -```bash -vi ~/.ssh/config -``` + + ```bash + vi ~/.ssh/config + ``` 1. Insert the following lines: -```bash -Host eidf-gateway - Hostname eidf-gateway.epcc.ed.ac.uk - User - IdentityFile /path/to/ssh/key -``` -For example: -```bash -Host eidf-gateway - Hostname eidf-gateway.epcc.ed.ac.uk - User alice - IdentityFile ~/.ssh/id_ed25519 -``` + + ```bash + Host eidf-gateway + Hostname eidf-gateway.epcc.ed.ac.uk + User + IdentityFile /path/to/ssh/key + ``` + + For example: + + ```bash + Host eidf-gateway + Hostname eidf-gateway.epcc.ed.ac.uk + User alice + IdentityFile ~/.ssh/id_ed25519 + ``` 1. Save and quit the file. 1. Now you can ssh to your VM using the below command: -```bash -ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key -``` -For example: -```bash -ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 -``` + + ```bash + ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key + ``` + + For example: + + ```bash + ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 + ``` 1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: -``` -Host - HostName 10.24.VM.IP - User - IdentityFile /path/to/ssh/key - ProxyCommand ssh eidf-gateway -W %h:%p -``` -For Example: -``` -Host demo - HostName 10.24.1.1 - User alice - IdentityFile ~/.ssh/id_ed25519 - ProxyCommand ssh eidf-gateway -W %h:%p -``` + + ``` + Host + HostName 10.24.VM.IP + User + IdentityFile /path/to/ssh/key + ProxyCommand ssh eidf-gateway -W %h:%p + ``` + + For Example: + + ``` + Host demo + HostName 10.24.1.1 + User alice + IdentityFile ~/.ssh/id_ed25519 + ProxyCommand ssh eidf-gateway -W %h:%p + ``` + 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM.

Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: `HostKeyAlgorithms +ssh-rsa` From a9084aab9fe7187ec0498fa519a38c3a12ab8664 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:42:31 +0100 Subject: [PATCH 66/91] Fixing Indentation and tabs --- docs/access/ssh.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index bff45f632..b5d0c003f 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -46,7 +46,6 @@ If not, you'll need to generate an SSH-Key, to do this: 1. Open a new window of whatever terminal you will use to SSH to EIDF. 1. Generate a new SSH Key: - ```bash ssh-keygen ``` @@ -169,11 +168,10 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Login by jumping through the gateway. - ```bash ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] ``` - + For example: ```bash @@ -187,22 +185,22 @@ You will be prompted for a 'TOTP' code upon successful public key authentication You can use SSH Aliases to access your VMs with a single word. 1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) - + ```bash vi ~/.ssh/config ``` 1. Insert the following lines: - + ```bash Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk User IdentityFile /path/to/ssh/key ``` - + For example: - + ```bash Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk @@ -213,11 +211,11 @@ You can use SSH Aliases to access your VMs with a single word. 1. Save and quit the file. 1. Now you can ssh to your VM using the below command: - + ```bash ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key ``` - + For example: ```bash @@ -225,7 +223,7 @@ You can use SSH Aliases to access your VMs with a single word. ``` 1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: - + ``` Host HostName 10.24.VM.IP @@ -235,7 +233,7 @@ You can use SSH Aliases to access your VMs with a single word. ``` For Example: - + ``` Host demo HostName 10.24.1.1 From 3914eef3b5539e1d3ad5e1674e8d0fde85215a1c Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:48:46 +0100 Subject: [PATCH 67/91] Fixing Indentation and tabs --- docs/access/ssh.md | 35 +++++++++++++++++------------------ 1 file changed, 17 insertions(+), 18 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index b5d0c003f..c0320ad88 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -119,8 +119,10 @@ ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 !!! info If the `ssh-add` command fails saying the SSH Agent is not running, run the below command:
- ``` eval `ssh-agent` ```
- Then re-run the ssh-add command above + + ``` eval `ssh-agent` ``` + + Then re-run the ssh-add command above. The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. @@ -150,11 +152,8 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob ```powershell ssh-add \path\to\sshkey - ``` - - For Example - - ```powershell + + For Example: ssh-add .\.ssh\id_ed25519 ``` @@ -170,11 +169,8 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob ```bash ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] - ``` - For example: - - ```bash + For Example: ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 ``` @@ -187,25 +183,31 @@ You can use SSH Aliases to access your VMs with a single word. 1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) ```bash + vi ~/.ssh/config + ``` 1. Insert the following lines: ```bash + Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk User IdentityFile /path/to/ssh/key + ``` For example: ```bash + Host eidf-gateway Hostname eidf-gateway.epcc.ed.ac.uk User alice IdentityFile ~/.ssh/id_ed25519 + ``` 1. Save and quit the file. @@ -213,28 +215,25 @@ You can use SSH Aliases to access your VMs with a single word. 1. Now you can ssh to your VM using the below command: ```bash - ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key - ``` - For example: + ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key - ```bash + For Example: ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 ``` 1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: ``` + Host HostName 10.24.VM.IP User IdentityFile /path/to/ssh/key ProxyCommand ssh eidf-gateway -W %h:%p - ``` For Example: - - ``` + Host demo HostName 10.24.1.1 User alice From 887b934ed2693487dff6c2f732ec94d51bc29a64 Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:54:04 +0100 Subject: [PATCH 68/91] Fixing Indentation and tabs --- docs/access/ssh.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index c0320ad88..0e480dd1d 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -151,27 +151,33 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob 1. Import the SSH Key you generated above: ```powershell + ssh-add \path\to\sshkey For Example: ssh-add .\.ssh\id_ed25519 + ``` 1. This should return "Identity added [Path to SSH Key]" if successful. If it doesn't, run the following in Powershell: ```powershell + Get-Service -Name ssh-agent | Set-Service -StartupType Manual Start-Service ssh-agent ssh-add \path\to\sshkey + ``` 1. Login by jumping through the gateway. ```bash + ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip] For Example: ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 + ``` You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application. @@ -220,6 +226,7 @@ You can use SSH Aliases to access your VMs with a single word. For Example: ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 + ``` 1. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM: @@ -233,12 +240,13 @@ You can use SSH Aliases to access your VMs with a single word. ProxyCommand ssh eidf-gateway -W %h:%p For Example: - + Host demo HostName 10.24.1.1 User alice IdentityFile ~/.ssh/id_ed25519 ProxyCommand ssh eidf-gateway -W %h:%p + ``` 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. From 083a11e8d65b7ec150c6a2e9a3c5e516925e8896 Mon Sep 17 00:00:00 2001 From: Amy Krause Date: Fri, 5 Apr 2024 16:56:31 +0100 Subject: [PATCH 69/91] trim whitespaces --- docs/access/ssh.md | 14 +++++++------- docs/access/virtualmachines-vdi.md | 1 - 2 files changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 0e480dd1d..499202b74 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -119,9 +119,9 @@ ssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1 !!! info If the `ssh-add` command fails saying the SSH Agent is not running, run the below command:
- + ``` eval `ssh-agent` ``` - + Then re-run the ssh-add command above. The `-J` flag is use to specify that we will access the second specified host by jumping through the first specified host. @@ -148,12 +148,12 @@ Windows will require the installation of OpenSSH-Server to use SSH. Putty or Mob If this is your first time connecting to EIDF using a new account, you have to set a password as described in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). 1. Open either Powershell or the Windows Terminal -1. Import the SSH Key you generated above: +1. Import the SSH Key you generated above: ```powershell ssh-add \path\to\sshkey - + For Example: ssh-add .\.ssh\id_ed25519 @@ -246,12 +246,12 @@ You can use SSH Aliases to access your VMs with a single word. User alice IdentityFile ~/.ssh/id_ed25519 ProxyCommand ssh eidf-gateway -W %h:%p - + ``` 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM.

Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: -`HostKeyAlgorithms +ssh-rsa` +`HostKeyAlgorithms +ssh-rsa` !!! info This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. @@ -261,4 +261,4 @@ You can use SSH Aliases to access your VMs with a single word. ## First Password Setting and Password Resets -Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). diff --git a/docs/access/virtualmachines-vdi.md b/docs/access/virtualmachines-vdi.md index abc7a18a2..390e4007b 100644 --- a/docs/access/virtualmachines-vdi.md +++ b/docs/access/virtualmachines-vdi.md @@ -85,4 +85,3 @@ For users who do not have standard `English (UK)` keyboard layouts, key presses are transmitted to your VM. Please contact the EIDF helpdesk at [eidf@epcc.ed.ac.uk](mailto:eidf@epcc.ed.ac.uk) if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration. - From 3bfb5286703f76c2c5275f88c98c4f315eb1b84c Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 16:58:47 +0100 Subject: [PATCH 70/91] Final Aesthetic fixes --- docs/access/ssh.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 0e480dd1d..890ca7d40 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -186,7 +186,7 @@ You will be prompted for a 'TOTP' code upon successful public key authentication You can use SSH Aliases to access your VMs with a single word. -1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. In the text editor of your choice (vi used as an example) +1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. Using the text editor of your choice (vi used as an example), edit the .ssh/config file: ```bash @@ -224,7 +224,12 @@ You can use SSH Aliases to access your VMs with a single word. ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key + ``` + For Example: + + ``` + ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519 ``` @@ -239,14 +244,18 @@ You can use SSH Aliases to access your VMs with a single word. IdentityFile /path/to/ssh/key ProxyCommand ssh eidf-gateway -W %h:%p + ``` + For Example: + ``` + Host demo HostName 10.24.1.1 User alice IdentityFile ~/.ssh/id_ed25519 ProxyCommand ssh eidf-gateway -W %h:%p - + ``` 1. Now, by running `ssh demo` your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. From d3522d8445b9baa96dac424482d7ffee346f52bb Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 17:01:11 +0100 Subject: [PATCH 71/91] Remove whitespace --- docs/access/ssh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 1bb8b769f..c62bd7b46 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -270,4 +270,4 @@ You can use SSH Aliases to access your VMs with a single word. ## First Password Setting and Password Resets -Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file From 2be81d075dc126b34073939af2af68297033ce5a Mon Sep 17 00:00:00 2001 From: awat31 Date: Fri, 5 Apr 2024 17:02:19 +0100 Subject: [PATCH 72/91] Remove whitespace --- docs/access/ssh.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index c62bd7b46..788db1c67 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -249,7 +249,7 @@ You can use SSH Aliases to access your VMs with a single word. For Example: ``` - + Host demo HostName 10.24.1.1 User alice From cf81e5cff62a79fc42ef86d6442842ce416d519c Mon Sep 17 00:00:00 2001 From: Amy Krause Date: Fri, 5 Apr 2024 17:03:41 +0100 Subject: [PATCH 73/91] fix whitespace --- docs/access/ssh.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/access/ssh.md b/docs/access/ssh.md index 788db1c67..cd00a4666 100644 --- a/docs/access/ssh.md +++ b/docs/access/ssh.md @@ -245,7 +245,7 @@ You can use SSH Aliases to access your VMs with a single word. ProxyCommand ssh eidf-gateway -W %h:%p ``` - + For Example: ``` @@ -270,4 +270,4 @@ You can use SSH Aliases to access your VMs with a single word. ## First Password Setting and Password Resets -Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). \ No newline at end of file +Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in [Set or change the password for a user account](../services/virtualmachines/quickstart.md#set-or-change-the-password-for-a-user-account). From e5d7cfc481cc94322a835a495ec5d6aeab89f4b8 Mon Sep 17 00:00:00 2001 From: Samuel Joseph Haynes <37002508+DimmestP@users.noreply.github.com> Date: Tue, 9 Apr 2024 15:24:08 +0100 Subject: [PATCH 74/91] Removed reference to devspace and new line in bullet point. --- docs/services/gpuservice/training/L4_template_workflow.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 6362e1704..73348097f 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -325,8 +325,6 @@ You must already have a [GitHub](https://github.com) account to follow this proc This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab). -An alternative method for remote code development using the DevSpace toolkit is described is the next lesson, [Getting started with DevSpace](L5_devspace.md). - A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available [here](https://github.com/DimmestP/template-EIDFGPU-workflow). ### Create a job that downloads and runs the latest code version at runtime @@ -420,8 +418,7 @@ A template GitHub repo with sample code, k8s yaml files and a Docker build Githu sizeLimit: 1Gi ``` -1. Change the command argument in the main container to run the code once started. -Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. +1. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the `initContainers: command:` tag. ```yaml apiVersion: batch/v1 From d9a80c84f0452dd43cb87e6ffdcfa508b6e30f5e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 15:42:59 +0100 Subject: [PATCH 75/91] Fixed whitespaces and incorrect link to lesson 4 in summary table --- docs/access/virtualmachines-vdi.md | 1 - docs/services/gpuservice/index.md | 2 +- .../training/L4_template_workflow.md | 36 +++++++++---------- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/docs/access/virtualmachines-vdi.md b/docs/access/virtualmachines-vdi.md index abc7a18a2..390e4007b 100644 --- a/docs/access/virtualmachines-vdi.md +++ b/docs/access/virtualmachines-vdi.md @@ -85,4 +85,3 @@ For users who do not have standard `English (UK)` keyboard layouts, key presses are transmitted to your VM. Please contact the EIDF helpdesk at [eidf@epcc.ed.ac.uk](mailto:eidf@epcc.ed.ac.uk) if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration. - diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 53bbc6949..bca3f0dea 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -87,7 +87,7 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it | [Getting started with Kubernetes](training/L1_getting_started.md) | a. What is Kubernetes?
b. How to send a task to a GPU node.
c. How to define the GPU resources needed. | | [Requesting persistent volumes with Kubernetes](training/L2_requesting_persistent_volumes.md) | a. What is a persistent volume?
b. How to request a PV resource. | | [Running a PyTorch task](training/L3_running_a_pytorch_task.md) | a. Accessing a Pytorch container.
b. Submitting a PyTorch task to the cluster.
c. Inspecting the results. | -| [Template workflow](training/L4_template workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | +| [Template workflow](training/L4_template_workflow.md) | a. Loading large data sets asynchronously.
b. Manually or automatically building Docker images.
c. Iteratively changing and testing code in a job. | ## Further Reading and Help diff --git a/docs/services/gpuservice/training/L4_template_workflow.md b/docs/services/gpuservice/training/L4_template_workflow.md index 73348097f..8c410c839 100644 --- a/docs/services/gpuservice/training/L4_template_workflow.md +++ b/docs/services/gpuservice/training/L4_template_workflow.md @@ -3,7 +3,7 @@ ## Requirements It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial. - + ## Overview An example workflow for code development using K8s is outlined below. @@ -20,7 +20,7 @@ Therefore, it is recommended to separate code, software, and data preparation in 1. Code development with K8s: Iteratively changing and testing code in a job. -The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. +The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service. The three stages are interchangeable and may not be relevant to every project. @@ -45,7 +45,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta ``` bash kubectl -n get pvc template-workflow-pvc ``` - + 1. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest. ``` yaml @@ -55,7 +55,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta name: lightweight-job labels: kueue.x-k8s.io/queue-name: -user-queue - spec: + spec: completions: 1 parallelism: 1 template: @@ -105,7 +105,7 @@ Therefore, the data download step needs to be completed asynchronously as mainta [Screen](https://www.gnu.org/software/screen/manual/screen.html#Overview) is a window manager available in Linux that allows you to create multiple interactive shells and swap between then. -Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. +Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect. This allows you to start a task, such as downloading a data set, and check in on it asynchronously. @@ -128,7 +128,7 @@ Using screen rather than a single download job can be helpful if downloading mul name: lightweight-job labels: kueue.x-k8s.io/queue-name: -user-queue - spec: + spec: completions: 1 parallelism: 1 template: @@ -162,13 +162,13 @@ Using screen rather than a single download job can be helpful if downloading mul kubectl -n exec -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip ``` -1. Exit the remote session by either ending the session or `ctrl-a d`. +1. Exit the remote session by either ending the session or `ctrl-a d`. 1. Reconnect at a later time and reattach the screen window. - + ```bash screen -list - + screen -r ``` @@ -176,7 +176,7 @@ Using screen rather than a single download job can be helpful if downloading mul ```bash kubectl -n exec -- ls /mnt/ceph_rbd/ - + kubectl -n delete job lightweight-job ``` @@ -188,7 +188,7 @@ Using screen rather than a single download job can be helpful if downloading mul ## Preparing a custom Docker image -Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. +Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub. It does not provide functionality to build images and create pods from docker files. @@ -214,15 +214,15 @@ This is not an introduction to building docker images, please see the [Docker tu ```bash cd - + docker build . -t /template-docker-image:latest ``` - + !!! important "Building images for different CPU architectures" Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture. - If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. - + If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the `--platform linux/amd64` flag to the build function. + 1. Create a repository to hold the image on [Docker Hub](https://hub.docker.com) (You will need to create and setup an account). 1. Push the Docker image to the repository. @@ -230,7 +230,7 @@ This is not an introduction to building docker images, please see the [Docker tu ```bash docker push /template-docker-image:latest ``` - + 1. Finally, specify your Docker image in the `image:` tag of the job specification yaml file. ```yaml @@ -258,7 +258,7 @@ This is not an introduction to building docker images, please see the [Docker tu cpu: 1 memory: "8Gi" ``` - + ### Automatically building docker images using GitHub Actions In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and [GitHub Actions](https://github.com/features/actions) can simplify the build process. @@ -267,7 +267,7 @@ A GitHub action can build and push a Docker image to Docker Hub whenever it dete This process requires you to already have a [GitHub](https://github.com) and [Docker Hub](https://hub.docker.com) account. -1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. +1. Create an [access token](https://docs.docker.com/security/for-developers/access-tokens/) on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo. 1. Create two [GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions) to securely provide your Docker Hub username and access token. From a1db9537ba2fba38a2c69bbfdf3ee6bcb352322a Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 12:30:31 +0000 Subject: [PATCH 76/91] Clarify namespace flag in kubectl usage --- docs/services/gpuservice/faq.md | 6 +++++ docs/services/gpuservice/index.md | 45 ++++++++++++++++++++++++------- 2 files changed, 41 insertions(+), 10 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 456870b7a..ccec549ab 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -10,6 +10,12 @@ The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM wi Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project. +### Access to GPU Service resources in default namespace is 'Forbidden' + +```Error from server (Forbidden): error when creating : jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` + +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create ` should solve the issue. + ### I can't mount my PVC in multiple containers or pods at the same time The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation. diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index bca3f0dea..6ac8f5b6c 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -9,7 +9,7 @@ The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximate The service provides access to: - Nvidia A100 40GB -- Nvidia 80GB +- Nvidia A100 80GB - Nvidia MIG A100 1G.5GB - Nvidia MIG A100 3G.20GB - Nvidia H100 80GB @@ -27,6 +27,7 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: - 32 Nvidia H100 80 GB !!! important "Quotas" + This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. @@ -40,16 +41,29 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: ## Service Access -Users should have an [EIDF Account](../../access/project.md). +Users should have an [EIDF Account](../../access/project.md) as access to the EIDF GPU Service can only be obtained through an EIDF virtual machine. + +Project Leads can request access to the EIDF GPU Service from VMs in an existing project through a service request to the EIDF helpdesk. + +Otherwise, Project Leads need to apply for a new EIDF project and specify access to the EIDF GPU service. + +Each project will be given a namespace within the EIDF GPU service to operate in. + +Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). + +All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. -Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. +The VM does not require to be GPU-enabled. -Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). +A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. -All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. +If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" - The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. + + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. + + This allows a project to access multiple GPUs of different types. An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. @@ -64,16 +78,27 @@ A standard project namespace has the following initial quota (subject to ongoing - GPU: 12 !!! important "Quota is a maximum on a Shared Resource" + A project quota is the maximum proportion of the service available for use by that project. - - During periods of high demand, Jobs will be queued awaiting resource availability on the Service. - - This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. + + This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. + + Any submitted resource requests that would exceed the total project quota will be rejected. ## Project Queues EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md). +!!! important "Job Queuing" + + During periods of high demand, jobs will be queued awaiting resource availability on the Service. + + As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. + + GPUs in high demand, such as Nvidia H100s, typically have longer wait times. + + Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. + ## Additional Service Policy Information Additional information on service policies can be found [here](policies.md). From 1de7fc678964919f26b52dbcc67cf3ffdb678e2e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 12:42:43 +0000 Subject: [PATCH 77/91] Pre commit checks --- docs/services/gpuservice/index.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 6ac8f5b6c..115fd2f76 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -61,8 +61,8 @@ If this is first time you have connected to the GPU service the response should !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" - The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. - + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. + This allows a project to access multiple GPUs of different types. An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. @@ -78,11 +78,11 @@ A standard project namespace has the following initial quota (subject to ongoing - GPU: 12 !!! important "Quota is a maximum on a Shared Resource" - + A project quota is the maximum proportion of the service available for use by that project. - + This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. - + Any submitted resource requests that would exceed the total project quota will be rejected. ## Project Queues @@ -92,9 +92,9 @@ EIDF GPU Service is introducing the Kueue system in February 2024. The use of th !!! important "Job Queuing" During periods of high demand, jobs will be queued awaiting resource availability on the Service. - - As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. - + + As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated. + GPUs in high demand, such as Nvidia H100s, typically have longer wait times. Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. From 8b048c670ab109bf5060824d8dac9db441447370 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 13:15:43 +0000 Subject: [PATCH 78/91] Adds typical namespace example --- docs/services/gpuservice/index.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 115fd2f76..52523dc65 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -49,6 +49,8 @@ Otherwise, Project Leads need to apply for a new EIDF project and specify access Each project will be given a namespace within the EIDF GPU service to operate in. +Typically, the namespace is the same as the EIDF project code but with 'ns' appended, i.e. `eidf989ns` for a project with code 'eidf989'. + Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. From cfd144c31fb2d476cf15e10010479a1fab480434 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 09:49:36 +0000 Subject: [PATCH 79/91] Swap manifest-filename to more specific myjobyaml --- docs/services/gpuservice/faq.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index ccec549ab..4a26e42ed 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,9 +12,9 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -```Error from server (Forbidden): error when creating : jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` +```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` -Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create ` should solve the issue. +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. ### I can't mount my PVC in multiple containers or pods at the same time From f3f73871025b602b7ac8dc94568fb66d071ae8fc Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:08:43 +0000 Subject: [PATCH 80/91] Simplify to project namespace --- docs/services/gpuservice/faq.md | 2 +- docs/services/gpuservice/index.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 4a26e42ed..40692bb65 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -14,7 +14,7 @@ Project Leads and Managers can access the kubeconfig file from the Project page ```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` -Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. +Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. ### I can't mount my PVC in multiple containers or pods at the same time diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index 52523dc65..cc7836ff0 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -57,9 +57,9 @@ All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU The VM does not require to be GPU-enabled. -A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. +A quick check to see if a VM has access to the EIDF GPU service can be completed by typing `kubectl -n get jobs` in to the command line. -If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. +If this is first time you have connected to the GPU service the response should be `No resources found in namespace`. !!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" From 7f23046bfee15b1a684bd8ba5aa760844c514621 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 16:40:42 +0000 Subject: [PATCH 81/91] Respond to alistair comments --- docs/services/gpuservice/index.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index cc7836ff0..8e9992334 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -41,19 +41,19 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: ## Service Access -Users should have an [EIDF Account](../../access/project.md) as access to the EIDF GPU Service can only be obtained through an EIDF virtual machine. +Users should have an [EIDF Account](../../access/project.md) as the EIDF GPU Service is only accessible through EIDF Virtual Machines. -Project Leads can request access to the EIDF GPU Service from VMs in an existing project through a service request to the EIDF helpdesk. +Existing projects can request access to the EIDF GPU Service through a service request to the [EIDF helpdesk](https://portal.eidf.ac.uk/queries/submit) or emailing eidf@epcc.ed.ac.uk . -Otherwise, Project Leads need to apply for a new EIDF project and specify access to the EIDF GPU service. +New projects wanting to using the GPU Service should include this in their EIDF Project Application. Each project will be given a namespace within the EIDF GPU service to operate in. -Typically, the namespace is the same as the EIDF project code but with 'ns' appended, i.e. `eidf989ns` for a project with code 'eidf989'. +This namespace will normally be the EIDF Project code appended with ’ns’, i.e. `eidf989ns` for a project with code 'eidf989'. Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). -All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl API. +All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool. The VM does not require to be GPU-enabled. @@ -83,9 +83,7 @@ A standard project namespace has the following initial quota (subject to ongoing A project quota is the maximum proportion of the service available for use by that project. - This is a sum of all requested resources across all submitted jobs/pods/deployments within a project. - - Any submitted resource requests that would exceed the total project quota will be rejected. + Any submitted job requests that would exceed the total project quota will be queued. ## Project Queues From c3f8020b2e907b8271ba53d3c6a04bf56c54cf7e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 16:13:33 +0100 Subject: [PATCH 82/91] Place error example within triangular brackets --- docs/services/gpuservice/faq.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 40692bb65..1d67da17f 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,7 +12,7 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -```Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default"``` +``` cannot create resource "jobs" in API group "" in the namespace "default">``` Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. From 6b0753ae263725623c16f64c57ffa2f328a79996 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 16:22:05 +0100 Subject: [PATCH 83/91] Fixed code block formatting --- docs/services/gpuservice/faq.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index 1d67da17f..c859e0fb9 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -12,7 +12,9 @@ Project Leads and Managers can access the kubeconfig file from the Project page ### Access to GPU Service resources in default namespace is 'Forbidden' -``` cannot create resource "jobs" in API group "" in the namespace "default">``` +```bash +Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User cannot create resource "jobs" in API group "" in the namespace "default" +``` Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with `kubectl -n create "myjobfile.yml"` should solve the issue. From def79a81a1182ce8bb6d2e54b3b1517fc5964b7e Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Thu, 21 Mar 2024 18:14:03 +0000 Subject: [PATCH 84/91] Add additional references to kubeconfig, namespaces and jobs --- .../gpuservice/training/L1_getting_started.md | 264 ++++++++++++------ 1 file changed, 176 insertions(+), 88 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index 9ebd1bea7..a4f3aa9a2 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -1,5 +1,22 @@ # Getting started with Kubernetes +## Requirements + +In order to follow this tutorial on the EIDF GPU Cluster you will need to have: + +- An account on the EIDF Portal. + +- An active EIDF Project on the Portal with access to the EIDF GPU Service. + +- The EIDF GPU Service kubernetes namespace associated with the project, e.g. eidf001ns. + +- The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue. + +- Downloaded the kubeconfig file to a VM with the project along with the kubectl API. + +!!! Important "Downloading the kubeconfig file and kubectl API" + Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed. + ## Introduction Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications. @@ -17,110 +34,126 @@ An overview of the key components of a K8s container can be seen on the [Kuberne The primary component of a K8s cluster is a pod. -A pod is a set of one or more containers (and their storage volumes) that share resources. +A pod is a set of one or more docker containers (and their storage volumes) that share resources. -Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers to be ran in the pod by writing a yaml file. +It is the EIDF GPU Cluster policy that all pods should be wrapped within a K8s [job](https://kubernetes.io/docs/concepts/workloads/controllers/job/). -The pod definition yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran. +This allows GPU/CPU/Memory resource requests to be managed by the cluster queue management system, kueue. -A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs. +Pods which attempt to bypass the queue mechanism will affect the experience of other project users. + +Any pods not associated with a job (or other K8s object) are at risk of being deleted without notice. + +K8s jobs also provide additional functionality such as parallelism (described later in this tutorial). -Multiple pods can be defined and maintained using several different methods depending on purpose: [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [services](https://kubernetes.io/docs/concepts/services-networking/service/) and [jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/); see the K8s docs for more details. +Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers/code to be ran in the pod by defining a template within a job manifest file written in yaml. + +The job yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran. + +A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs. Users interact with the K8s API using the `kubectl` (short for kubernetes control) commands. Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces. +!!! important "Ensure kubectl is interacting with your project namespace." + + You will need to pass the name of your project namespace to `kubectl` in order for it to have permission to interact with the cluster. + + `kubectl` will attempt to interact with the `default` namespace which will return a permissions error if it is not told otherwise. + + `kubectl -n ` will tell kubectl to pass the commands to the correct namespace. + Useful commands are: -- `kubectl create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. -- `kubectl apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. -- `kubectl delete pod `: Delete a pod from the cluster. -- `kubectl get pods`: Summarise all pods the namespace has active (or pending). -- `kubectl describe pods`: Verbose description of all pods the namespace has active (or pending). -- `kubectl describe pod `: Verbose summary of the specified pod. -- `kubectl logs `: Retrieve the log files associated with a running pod. -- `kubectl get jobs`: List all jobs the namespace has active (or pending). -- `kubectl describe job `: Verbose summary of the specified job. -- `kubectl delete job `: Delete a job from the cluster. +- `kubectl -n create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. +- `kubectl -n apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. +- `kubectl -n delete pod `: Delete a pod from the cluster. +- `kubectl -n get pods`: Summarise all pods the namespace has active (or pending). +- `kubectl -n describe pods`: Verbose description of all pods the namespace has active (or pending). +- `kubectl -n describe pod `: Verbose summary of the specified pod. +- `kubectl -n logs `: Retrieve the log files associated with a running pod. +- `kubectl -n get jobs`: List all jobs the namespace has active (or pending). +- `kubectl -n describe job `: Verbose summary of the specified job. +- `kubectl -n delete job `: Delete a job from the cluster. -## Creating your first job +## Creating your first pod template within a job yaml file -To access the GPUs on the service, it is recommended to start with one of the prebuild container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs. +To access the GPUs on the service, it is recommended to start with one of the prebuilt container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs. The list of Nvidia images is available on their [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags). The following example uses their CUDA sample code simulating nbody interactions. -1. Open an editor of your choice and create the file test_NBody.yml -1. Copy the following in to the file, replacing `namespace-user-queue` with -user-queue, e.g. eidf001ns-user-queue: - - ``` yaml - apiVersion: batch/v1 - kind: Job - metadata: - generateName: jobtest- - labels: - kueue.x-k8s.io/queue-name: namespace-user-queue - spec: - completions: 1 - template: - metadata: - name: job-test - spec: - containers: - - name: cudasample - image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 - args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] - resources: - requests: - cpu: 2 - memory: '1Gi' - limits: - cpu: 2 - memory: '4Gi' - nvidia.com/gpu: 1 - restartPolicy: Never - ``` +``` yaml +apiVersion: batch/v1 +kind: Job +metadata: + generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: -user-queue +spec: + completions: 1 + template: + metadata: + name: job-test + spec: + containers: + - name: cudasample + image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + nvidia.com/gpu: 1 + restartPolicy: Never +``` + +The pod resources are defined under the `resources` tags using the `requests` and `limits` tags. - The pod resources are defined under the `resources` tags using the `requests` and `limits` tags. +Resources defined under the `requests` tags are the reserved resources required for the pod to be scheduled. - Resources defined under the `requests` tags are the reserved resources required for the pod to be scheduled. +If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested. - If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested. +This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node. - This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node. +The `limits` tag specifies the maximum resources that can be assigned to a pod. - The `limits` tag specifies the maximum resources that can be assigned to a pod. +The EIDF GPU Service requires all pods have `requests` and `limits` tags for CPU and memory defined in order to be accepted. - The EIDF GPU Service requires all pods have `requests` and `limits` tags for CPU and memory defined in order to be accepted. +GPU resources requests are optional and only an entry under the `limits` tag is needed to specify the use of a GPU, `nvidia.com/gpu: 1`. Without this no GPU will be available to the pod. - GPU resources requests are optional and only an entry under the `limits` tag is needed to specify the use of a GPU, `nvidia.com/gpu: 1`. Without this no GPU will be available to the pod. +The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users. - The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users. +## Submitting your first job +1. Open an editor of your choice and create the file test_NBody.yml +1. Copy the above job yaml in to the file, filling in `-user-queue`, e.g. eidf001ns-user-queue: 1. Save the file and exit the editor -1. Run `kubectl create -f test_NBody.yml` +1. Run `kubectl -n create -f test_NBody.yml` 1. This will output something like: ``` bash job.batch/jobtest-b92qg created ``` -1. Run `kubectl get jobs` + The five character code appended to the job name, i.e. `b92qg`, is randomly generated and will differ from your run. + +1. Run `kubectl -n get jobs` 1. This will output something like: ```bash NAME COMPLETIONS DURATION AGE - jobtest-b92qg 3/3 48s 6m27s - jobtest-d45sr 5/5 15m 22h - jobtest-kwmwk 3/3 48s 29m - jobtest-kw22k 1/1 48s 29m + jobtest-b92qg 1/1 48s 29m ``` - This displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age. + There may be more than one entry as this displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age. -1. Describe your job using the command `kubectl describe job jobtest-b92-qg`, replacing the job name with your job name. +1. Inspect your job further using the command `kubectl -n describe job jobtest-b92qg`, updating the job name with your five character code. 1. This will output something like: ```bash @@ -172,25 +205,18 @@ The following example uses their CUDA sample code simulating nbody interactions. Normal Completed 7m12s job-controller Job completed ``` -1. Run `kubectl get pods` +1. Run `kubectl -n get pods` 1. This will output something like: ``` bash NAME READY STATUS RESTARTS AGE jobtest-b92qg-lh64s 0/1 Completed 0 11m - jobtest-b92qg-lvmrf 0/1 Completed 0 10m - jobtest-b92qg-xhvdm 0/1 Completed 0 10m - jobtest-d45sr-8tf4d 0/1 Completed 0 22h - jobtest-d45sr-jjhgg 0/1 Completed 0 22h - jobtest-d45sr-n5w6c 0/1 Completed 0 22h - jobtest-d45sr-v9p4j 0/1 Completed 0 22h - jobtest-d45sr-xgq5s 0/1 Completed 0 22h - jobtest-kwmwk-cgwmf 0/1 Completed 0 33m - jobtest-kwmwk-mttdw 0/1 Completed 0 33m - jobtest-kwmwk-r2q9h 0/1 Completed 0 33m ``` -1. View the logs of a pod from the job you ran `kubectl logs jobtest-b92qg-lh64s` - note that the pods for the job in this case start with the job name. + Again, there may be more than one entry as this displays all the jobs in the current namespace. + Also, each pod within a job is given another unique 5 character code appended to the job name. + +1. View the logs of a pod from the job you ran `kubectl -n logs jobtest-b92qg-lh64s` - again update with you run's pod and job five letter code. 1. This will output something like: ``` bash @@ -221,15 +247,15 @@ The following example uses their CUDA sample code simulating nbody interactions. = 7439.679 double-precision GFLOP/s at 30 flops per interaction ``` -1. Delete your job with `kubectl delete job jobtest-b92qg` - this will delete the associated pods as well. +1. Delete your job with `kubectl -n delete job jobtest-b92qg` - this will delete the associated pods as well. ## Specifying GPU requirements If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from `Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]`. -The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of whether what GPU type is present on the node. +The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of the type of GPU present on the node. -The GPU resource requests can be made more specific by adding the type of GPU product the pod is requesting to the node selector: +The GPU resource requests can be made more specific by adding the type of GPU product the pod template is requesting to the node selector: - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'` @@ -237,7 +263,15 @@ The GPU resource requests can be made more specific by adding the type of GPU pr - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` - `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` -### Example yaml file +### Example yaml file with GPU type specified + +The `nodeSelector:` key at the bottom of the pod template states the pod should be ran on a node with a 1g.5gb MIG GPU. + +!!! important "Exact GPU product names only" + + K8s will fail to assign the pod if you misspell the GPU type. + + Be especially careful when requesting a full 80Gb or 40Gb A100 GPU as attempting to load GPUs with more data than its memory can handle can have unexpected consequences. ```yaml @@ -246,7 +280,7 @@ kind: Job metadata: generateName: jobtest- labels: - kueue.x-k8s.io/queue-name: namespace-user-queue + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 template: @@ -272,15 +306,11 @@ spec: ## Running multiple pods with K8s jobs -The recommended use of the EIDF GPU Service is to use a job request which wraps around a pod specification and provide several useful attributes. - -Firstly, if a pod is assigned to a node that dies then the pod itself will fail and the user has to manually restart it. - -Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod, if the restartPolicy is set. +Wrapping a pod within a job provides additional functionality on top of accessing the queuing system. -Jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. +Firstly, the restartPolicy within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod. -Jobs allow for better scheduling of resources using the Kueue service implemented on the EIDF GPU Service. Pods which attempt to bypass the queue mechanism this provides will affect the experience of other project users. +Jobs also allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends. @@ -290,7 +320,7 @@ kind: Job metadata: generateName: jobtest- labels: - kueue.x-k8s.io/queue-name: namespace-user-queue + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 3 parallelism: 1 @@ -312,3 +342,61 @@ spec: nvidia.com/gpu: 1 restartPolicy: Never ``` + +## Change the default kubectl namespace in the project kubeconfig file + + Passing the `-n ` flag every time you want to interact with the cluster can be cumbersome. + + You can alter the kubeconfig on your VM to send commands to your project namespace by default. + +1. Open the command line on your EIDF VM with access to the EIDF GPU Service + +1. Create a folder to hold your adapted kubeconfig file + + ```bash + mkdir ~/.kube + ``` + +1. Create a copy of shared kubeconfig file to your home directory. + + ```bash + cp /kubenetes/config ~/.kube/_config + ``` + +1. Add the namespace line with your project's kubernetes namespace to the "eidf-general-prod" context entry in your copy of the config file + + ```bash + nano ~/.kube/_config + ``` + + ```txt + *** MORE CONFIG *** + + contexts: + - name: "eidf-general-prod" + context: + user: "eidf-general-prod" + namespace: "" # INSERT LINE + cluster: "eidf-general-prod" + + *** MORE CONFIG *** + ``` + +1. Changing the KUBECONFIG bash variable to point to the amended file in your home directory by adding the following lines to the end of the .bashrc file. + + ```bash + nano ~/.bashrc + ``` + + ```txt + *** END OF .bashrc *** + + # Add correct K8s config credentials + export KUBECONFIG=_config_file> + ``` + +1. Check kubectl connects to the cluster. If this does not work you can return to the original kubeconfig file by removing the above export line from the .bashrc file and restarting the terminal. + + ```bash + kubectl get pods + ``` From b7bac45a5f48bb47b01a9d5779e023db6d3e5724 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 09:42:30 +0000 Subject: [PATCH 85/91] Change download sentence for clarity --- docs/services/gpuservice/training/L1_getting_started.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index a4f3aa9a2..bdd8e6c34 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -12,9 +12,10 @@ In order to follow this tutorial on the EIDF GPU Cluster you will need to have: - The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue. -- Downloaded the kubeconfig file to a VM with the project along with the kubectl API. +- Downloaded the kubeconfig file to a Project VM along with the kubectl API. !!! Important "Downloading the kubeconfig file and kubectl API" + Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed. ## Introduction From 00894491a082ab480f3b7d98f6dc5f640f45b871 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:06:34 +0000 Subject: [PATCH 86/91] Simplify to project namespace --- .../gpuservice/training/L1_getting_started.md | 46 +++++++++---------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index bdd8e6c34..e57087711 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -63,20 +63,20 @@ Some of the kubectl commands are restricted on the EIDF cluster in order to ensu `kubectl` will attempt to interact with the `default` namespace which will return a permissions error if it is not told otherwise. - `kubectl -n ` will tell kubectl to pass the commands to the correct namespace. + `kubectl -n ` will tell kubectl to pass the commands to the correct namespace. Useful commands are: -- `kubectl -n create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. -- `kubectl -n apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. -- `kubectl -n delete pod `: Delete a pod from the cluster. -- `kubectl -n get pods`: Summarise all pods the namespace has active (or pending). -- `kubectl -n describe pods`: Verbose description of all pods the namespace has active (or pending). -- `kubectl -n describe pod `: Verbose summary of the specified pod. -- `kubectl -n logs `: Retrieve the log files associated with a running pod. -- `kubectl -n get jobs`: List all jobs the namespace has active (or pending). -- `kubectl -n describe job `: Verbose summary of the specified job. -- `kubectl -n delete job `: Delete a job from the cluster. +- `kubectl -n create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. +- `kubectl -n apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. +- `kubectl -n delete pod `: Delete a pod from the cluster. +- `kubectl -n get pods`: Summarise all pods the namespace has active (or pending). +- `kubectl -n describe pods`: Verbose description of all pods the namespace has active (or pending). +- `kubectl -n describe pod `: Verbose summary of the specified pod. +- `kubectl -n logs `: Retrieve the log files associated with a running pod. +- `kubectl -n get jobs`: List all jobs the namespace has active (or pending). +- `kubectl -n describe job `: Verbose summary of the specified job. +- `kubectl -n delete job `: Delete a job from the cluster. ## Creating your first pod template within a job yaml file @@ -92,7 +92,7 @@ kind: Job metadata: generateName: jobtest- labels: - kueue.x-k8s.io/queue-name: -user-queue + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 template: @@ -133,9 +133,9 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you ## Submitting your first job 1. Open an editor of your choice and create the file test_NBody.yml -1. Copy the above job yaml in to the file, filling in `-user-queue`, e.g. eidf001ns-user-queue: +1. Copy the above job yaml in to the file, filling in `-user-queue`, e.g. eidf001ns-user-queue: 1. Save the file and exit the editor -1. Run `kubectl -n create -f test_NBody.yml` +1. Run `kubectl -n create -f test_NBody.yml` 1. This will output something like: ``` bash @@ -144,7 +144,7 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you The five character code appended to the job name, i.e. `b92qg`, is randomly generated and will differ from your run. -1. Run `kubectl -n get jobs` +1. Run `kubectl -n get jobs` 1. This will output something like: ```bash @@ -154,7 +154,7 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you There may be more than one entry as this displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age. -1. Inspect your job further using the command `kubectl -n describe job jobtest-b92qg`, updating the job name with your five character code. +1. Inspect your job further using the command `kubectl -n describe job jobtest-b92qg`, updating the job name with your five character code. 1. This will output something like: ```bash @@ -206,7 +206,7 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you Normal Completed 7m12s job-controller Job completed ``` -1. Run `kubectl -n get pods` +1. Run `kubectl -n get pods` 1. This will output something like: ``` bash @@ -217,7 +217,7 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you Again, there may be more than one entry as this displays all the jobs in the current namespace. Also, each pod within a job is given another unique 5 character code appended to the job name. -1. View the logs of a pod from the job you ran `kubectl -n logs jobtest-b92qg-lh64s` - again update with you run's pod and job five letter code. +1. View the logs of a pod from the job you ran `kubectl -n logs jobtest-b92qg-lh64s` - again update with you run's pod and job five letter code. 1. This will output something like: ``` bash @@ -248,7 +248,7 @@ The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting you = 7439.679 double-precision GFLOP/s at 30 flops per interaction ``` -1. Delete your job with `kubectl -n delete job jobtest-b92qg` - this will delete the associated pods as well. +1. Delete your job with `kubectl -n delete job jobtest-b92qg` - this will delete the associated pods as well. ## Specifying GPU requirements @@ -281,7 +281,7 @@ kind: Job metadata: generateName: jobtest- labels: - kueue.x-k8s.io/queue-name: -user-queue + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 1 template: @@ -321,7 +321,7 @@ kind: Job metadata: generateName: jobtest- labels: - kueue.x-k8s.io/queue-name: -user-queue + kueue.x-k8s.io/queue-name: -user-queue spec: completions: 3 parallelism: 1 @@ -346,7 +346,7 @@ spec: ## Change the default kubectl namespace in the project kubeconfig file - Passing the `-n ` flag every time you want to interact with the cluster can be cumbersome. + Passing the `-n ` flag every time you want to interact with the cluster can be cumbersome. You can alter the kubeconfig on your VM to send commands to your project namespace by default. @@ -377,7 +377,7 @@ spec: - name: "eidf-general-prod" context: user: "eidf-general-prod" - namespace: "" # INSERT LINE + namespace: "" # INSERT LINE cluster: "eidf-general-prod" *** MORE CONFIG *** From 617f5037976182ffa6366317911f6bb82af8cd87 Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 13:47:52 +0000 Subject: [PATCH 87/91] Remove incorrect reference to kubectl API --- docs/services/gpuservice/training/L1_getting_started.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index e57087711..ba0444048 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -12,9 +12,9 @@ In order to follow this tutorial on the EIDF GPU Cluster you will need to have: - The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue. -- Downloaded the kubeconfig file to a Project VM along with the kubectl API. +- Downloaded the kubeconfig file to a Project VM along with the kubectl command line tool to interact with the K8s API. -!!! Important "Downloading the kubeconfig file and kubectl API" +!!! Important "Downloading the kubeconfig file and kubectl" Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed. From 8213f5e4ddb27087fafb0daf892a0898a62e30ca Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 14:20:01 +0000 Subject: [PATCH 88/91] Change local config to root config --- .../gpuservice/training/L1_getting_started.md | 35 ++++--------------- 1 file changed, 7 insertions(+), 28 deletions(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index ba0444048..fc260aed8 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -349,26 +349,18 @@ spec: Passing the `-n ` flag every time you want to interact with the cluster can be cumbersome. You can alter the kubeconfig on your VM to send commands to your project namespace by default. + + Only users with sudo privileges can change the root kubectl config file. -1. Open the command line on your EIDF VM with access to the EIDF GPU Service +1. Open the command line on your EIDF VM with access to the EIDF GPU Service. -1. Create a folder to hold your adapted kubeconfig file +1. Open the root kubeconfig file with sudo privileges. ```bash - mkdir ~/.kube + sudo nano /kubernetes/config ``` -1. Create a copy of shared kubeconfig file to your home directory. - - ```bash - cp /kubenetes/config ~/.kube/_config - ``` - -1. Add the namespace line with your project's kubernetes namespace to the "eidf-general-prod" context entry in your copy of the config file - - ```bash - nano ~/.kube/_config - ``` +1. Add the namespace line with your project's kubernetes namespace to the "eidf-general-prod" context entry in your copy of the config file. ```txt *** MORE CONFIG *** @@ -383,20 +375,7 @@ spec: *** MORE CONFIG *** ``` -1. Changing the KUBECONFIG bash variable to point to the amended file in your home directory by adding the following lines to the end of the .bashrc file. - - ```bash - nano ~/.bashrc - ``` - - ```txt - *** END OF .bashrc *** - - # Add correct K8s config credentials - export KUBECONFIG=_config_file> - ``` - -1. Check kubectl connects to the cluster. If this does not work you can return to the original kubeconfig file by removing the above export line from the .bashrc file and restarting the terminal. +1. Check kubectl connects to the cluster. If this does not work you delete and re-download the kubeconfig file using the button on the project page of the EIDF portal. ```bash kubectl get pods From a39a975b3a60bec445dd650c7db4cfb911dd961d Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Tue, 9 Apr 2024 16:30:32 +0100 Subject: [PATCH 89/91] Trimmed trailing whitespace --- docs/services/gpuservice/training/L1_getting_started.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index fc260aed8..c287cfb8a 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -349,7 +349,7 @@ spec: Passing the `-n ` flag every time you want to interact with the cluster can be cumbersome. You can alter the kubeconfig on your VM to send commands to your project namespace by default. - + Only users with sudo privileges can change the root kubectl config file. 1. Open the command line on your EIDF VM with access to the EIDF GPU Service. From 85c34af91248422c249c3f8718df7c466f1e155d Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Mon, 25 Mar 2024 10:22:22 +0000 Subject: [PATCH 90/91] Simplify to project namespace --- .../L2_requesting_persistent_volumes.md | 22 ++++++++----- .../training/L3_running_a_pytorch_task.md | 32 +++++++++++-------- 2 files changed, 33 insertions(+), 21 deletions(-) diff --git a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md index cfd546181..d676da1ea 100644 --- a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md +++ b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md @@ -1,4 +1,10 @@ -# Requesting Persistent Volumes With Kubernetes +# Requesting persistent volumes With Kubernetes + +## Requirements + +It is recommended that users complete [Getting started with Kubernetes](L1_getting_started.md) before proceeding with this tutorial. + +## Overview Pods in the K8s EIDF GPU Service are intentionally ephemeral. @@ -42,12 +48,12 @@ spec: storageClassName: csi-rbd-sc ``` -You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl create ` +You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl -n create ` Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands: -- `kubectl delete pvc ` -- `kubectl get pvc ` -- `kubectl apply -f ` +- `kubectl -n delete pvc ` +- `kubectl -n get pvc ` +- `kubectl -n apply -f ` ## Mounting a persistent Volume to a Pod @@ -95,7 +101,7 @@ To move files in/out of the persistent volume from outside a pod you can use the ```bash *** On Login Node - replacing pod name with your pod name *** -kubectl cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd +kubectl -n cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd ``` For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted. @@ -105,7 +111,7 @@ The bash command rsync can be amended to manage file transfers into the mounted ## Clean up ```bash -kubectl delete job test-ceph-pvc-job +kubectl -n delete job test-ceph-pvc-job -kubectl delete pvc test-ceph-pvc +kubectl -n delete pvc test-ceph-pvc ``` diff --git a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md index 33dae5ffb..752ea058a 100644 --- a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md +++ b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md @@ -1,6 +1,12 @@ # Running a PyTorch task -In the following lesson, we'll build a NLP neural network and train it using the EIDF GPU Service. +## Requirements + +It is recommended that users complete [Getting started with Kubernetes](L1_getting_started.md) and [Requesting persistent volumes With Kubernetes](L3_running_a_pytorch_task.md) before proceeding with this tutorial. + +## Overview + +In the following lesson, we'll build a CNN neural network and train it using the EIDF GPU Service. The model was taken from the [PyTorch Tutorials](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html). @@ -17,7 +23,7 @@ The lesson will be split into three parts: Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below). ``` bash -kubectl create -f +kubectl -n create -f ``` ### Example PyTorch PersistentVolumeClaim @@ -41,13 +47,13 @@ spec: 1. Check PVC has been created ``` bash - kubectl get pvc + kubectl -n get pvc ``` 1. Create a lightweight job with pod with PV mounted (example job below) ``` bash - kubectl create -f lightweight-pod-job.yaml + kubectl -n create -f lightweight-pod-job.yaml ``` 1. Download the PyTorch code @@ -59,19 +65,19 @@ spec: 1. Copy the Python script into the PV ``` bash - kubectl cp example_pytorch_code.py lightweight-job-:/mnt/ceph_rbd/ + kubectl -n cp example_pytorch_code.py lightweight-job-:/mnt/ceph_rbd/ ``` 1. Check whether the files were transferred successfully ``` bash - kubectl exec lightweight-job- -- ls /mnt/ceph_rbd + kubectl -n exec lightweight-job- -- ls /mnt/ceph_rbd ``` 1. Delete the lightweight job ``` bash - kubectl delete job lightweight-job- + kubectl -n delete job lightweight-job- ``` ### Example lightweight job specification @@ -119,7 +125,7 @@ The PyTorch container will be held within a pod that has the persistent volume m Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name. ``` bash -kubectl create -f +kubectl -n create -f ``` ### Example PyTorch Job Specification File @@ -169,19 +175,19 @@ This is not intended to be an introduction to PyTorch, please see the [online tu 1. Check that the model ran to completion ``` bash - kubectl logs + kubectl -n logs ``` 1. Spin up a lightweight pod to retrieve results ``` bash - kubectl create -f lightweight-pod-job.yaml + kubectl -n create -f lightweight-pod-job.yaml ``` 1. Copy the trained model back to your access VM ``` bash - kubectl cp lightweight-job-:mnt/ceph_rbd/model.pth model.pth + kubectl -n cp lightweight-job-:mnt/ceph_rbd/model.pth model.pth ``` ## Using a Kubernetes job to train the pytorch model multiple times @@ -235,7 +241,7 @@ spec: ## Clean up ``` bash -kubectl delete pod pytorch-job +kubectl -n delete pod pytorch-job -kubectl delete pvc pytorch-pvc +kubectl -n delete pvc pytorch-pvc ``` From adc80ca12c006c4c09709884a6769af6a73a9f0f Mon Sep 17 00:00:00 2001 From: Sam Haynes Date: Wed, 27 Mar 2024 13:31:02 +0000 Subject: [PATCH 91/91] Adds anchoring to requirements section --- .../gpuservice/training/L2_requesting_persistent_volumes.md | 2 +- docs/services/gpuservice/training/L3_running_a_pytorch_task.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md index d676da1ea..3ee9837f3 100644 --- a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md +++ b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md @@ -2,7 +2,7 @@ ## Requirements -It is recommended that users complete [Getting started with Kubernetes](L1_getting_started.md) before proceeding with this tutorial. +It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) before proceeding with this tutorial. ## Overview diff --git a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md index 752ea058a..d8f211c86 100644 --- a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md +++ b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md @@ -2,7 +2,7 @@ ## Requirements -It is recommended that users complete [Getting started with Kubernetes](L1_getting_started.md) and [Requesting persistent volumes With Kubernetes](L3_running_a_pytorch_task.md) before proceeding with this tutorial. +It is recommended that users complete [Getting started with Kubernetes](../L1_getting_started/#requirements) and [Requesting persistent volumes With Kubernetes](../L2_requesting_persistent_volumes/#requirements) before proceeding with this tutorial. ## Overview