From 2541900f378e590e57dce4fbdd9d0e064a1fef77 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Wed, 14 Feb 2024 12:48:35 +0000 Subject: [PATCH 1/2] AG: updated style rules to avoid issues with indent blocks. Update GPU service overview. --- .mdl_style.rb | 1 + docs/services/gpuservice/index.md | 55 ++++++++++++++++++++++--------- 2 files changed, 40 insertions(+), 16 deletions(-) diff --git a/.mdl_style.rb b/.mdl_style.rb index e1c0cd8ba..d3b4f8de3 100644 --- a/.mdl_style.rb +++ b/.mdl_style.rb @@ -1,4 +1,5 @@ all exclude_rule 'MD033' +exclude_rule 'MD046' rule 'MD013', :line_length => 500 rule 'MD026', :punctuation => '.,:;' diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index b44e7b7b4..d96433fb4 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -1,32 +1,50 @@ # Overview -The EIDF GPU Service (EIDFGPUS) uses Nvidia A100 GPUs as accelerators. +The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon [Kubernetes](https://kubernetes.io). -Full Nvidia A100 GPUs are connected to 40GB of dynamic memory. +MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion. -Multi-instance usage (MIG) GPUs allow multiple tasks or users to share the same GPU (similar to CPU threading). +The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU. -There are two types of MIG GPUs inside the EIDFGPUS the Nvidia A100 3G.20GB GPUs and the Nvidia A100 1G.5GB GPUs which equate to ~1/2 and ~1/7 of a full Nvidia A100 40 GB GPU. +The service provides access to: -The current specification of the EIDFGPUS is: +- Nvidia A100 40GB +- Nvidia 80GB +- Nvidia MIG A100 1G.5GB +- Nvidia MIG A100 3G.20GB +- Nvidia H100 80GB -- 1856 CPU Cores -- 8.7 TiB Memory -- Local Disk Space (Node Image Cache and Local Workspace) - 21 TiB +The current full specification of the EIDF GPU Service as of 14 February 2024: + +- 4912 CPU Cores (AMD EPYC and Intel Xeon) +- 23 TiB Memory +- Local Disk Space (Node Image Cache and Local Workspace) - 40 TiB - Ceph Persistent Volumes (Long Term Data) - up to 100TiB -- 70 Nvidia A100 40 GB GPUs -- 14 MIG Nvidia A100 40 GB GPUs equating to 28 Nvidia A100 3G.20GB GPUs -- 20 MIG Nvidia A100 40 GB GPU equating to 140 A100 1G.5GB GPUs +- 112 Nvidia A100 40 GB +- 39 Nvidia A100 80 GB +- 16 Nvidia A100 3G.20GB +- 56 Nvidia A100 1G.5GB +- 32 Nvidia H100 80 GB -The EIDFGPUS is managed using [Kubernetes](https://kubernetes.io), with up to 8 GPUs being on a single node. +!!! Quotas + This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team. ## Service Access Users should have an EIDF account - [EIDF Accounts](../../access/project.md). -Project Leads will be able to have access to the EIDFGPUS added to their project during the project application process or through a request to the EIDF helpdesk. +Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. + +Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). + +All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. + +!!! Important + The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. -Each project will be given a namespace to operate in and a kubeconfig file in a Virtual Machine on the EIDF DSC - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). + An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type. + + Projects do not have to apply for a GPU-enabled VM to access the GPU Service. ## Project Quotas @@ -36,7 +54,12 @@ A standard project namespace has the following initial quota (subject to ongoing - Memory: 1TiB - GPU: 12 -Note these quotas are maximum use by a single project, and that during periods of high usage Kubernetes Jobs maybe queued waiting for resource to become available on the cluster. +!!! Important + A project quota is the maximum proportion of the service available for use by that project. + + During periods of high demand, Jobs will queued awaiting resource availability on the Service. + + This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. ## Additional Service Policy Information @@ -44,7 +67,7 @@ Additional information on service policies can be found [here](policies.md). ## EIDF GPU Service Tutorial -This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a comprehensive overview of Kubernetes. +This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes. | Lesson | Objective | |-----------------------------------|-------------------------------------| From d24b745ed9856f7a4ca44d48381f26eb22fe6c85 Mon Sep 17 00:00:00 2001 From: agrant3 Date: Wed, 14 Feb 2024 16:57:48 +0000 Subject: [PATCH 2/2] AG: pre-commit failures corrected in cs2 and ultra2 and gpuservice latest --- docs/services/cs2/run.md | 8 +- docs/services/gpuservice/faq.md | 6 +- docs/services/gpuservice/index.md | 30 +- docs/services/gpuservice/kueue.md | 450 ++++++++++++++++++ docs/services/gpuservice/policies.md | 25 +- .../gpuservice/training/L1_getting_started.md | 266 +++++++---- .../L2_requesting_persistent_volumes.md | 74 +-- .../training/L3_running_a_pytorch_task.md | 214 +++++---- docs/services/ultra2/run.md | 1 - mkdocs.yml | 3 +- 10 files changed, 834 insertions(+), 243 deletions(-) create mode 100644 docs/services/gpuservice/kueue.md diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index e6c00a791..46b9ec3a6 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate cerebras_install_check ``` -### Modify venv files to remove clock sync check on EPCC system. +### Modify venv files to remove clock sync check on EPCC system Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: @@ -91,7 +91,7 @@ if modified_time > self._last_modified: ) ``` -### Comment out the whole section +### Comment out the section `if modified_time > self._last_modified` ```python #if modified_time > self._last_modified: @@ -123,7 +123,7 @@ The section should look like this: ) ``` -### Comment out the whole section +### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns` ```python #if stat.st_mtime_ns > self._stat.st_mtime_ns: @@ -138,7 +138,7 @@ The section should look like this: ### Save the file -### Run jobs as per existing documentation. +### Run jobs as per existing documentation ## Paths, PYTHONPATH and mount_dirs diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md index e91502968..456870b7a 100644 --- a/docs/services/gpuservice/faq.md +++ b/docs/services/gpuservice/faq.md @@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by ### How many GPUs can I use in a pod? -The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs. +The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs. ### Why did a validation error occur when submitting a pod or job with a valid specification file? @@ -76,3 +76,7 @@ Example fragment for a Bash command start: - '-c' - '--' ``` + +### My large number of GPUs Job takes a long time to be scheduled + +When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available. diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md index d96433fb4..7dde82aaf 100644 --- a/docs/services/gpuservice/index.md +++ b/docs/services/gpuservice/index.md @@ -4,7 +4,7 @@ The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPU MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion. -The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU. +The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively. The service provides access to: @@ -26,23 +26,27 @@ The current full specification of the EIDF GPU Service as of 14 February 2024: - 56 Nvidia A100 1G.5GB - 32 Nvidia H100 80 GB -!!! Quotas - This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team. +!!! important "Quotas" + This is the full configuration of the cluster. + + Each project will have access to a quota across this shared configuration. + + Changes to the default quota must be discussed and agreed with the EIDF Services team. ## Service Access -Users should have an EIDF account - [EIDF Accounts](../../access/project.md). +Users should have an [EIDF Account](../../access/project.md). Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk. -Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md). +Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md). All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled. -!!! Important +!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs" The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types. - An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type. + An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type. Projects do not have to apply for a GPU-enabled VM to access the GPU Service. @@ -54,13 +58,17 @@ A standard project namespace has the following initial quota (subject to ongoing - Memory: 1TiB - GPU: 12 -!!! Important +!!! important "Quota is a maximum on a Shared Resource" A project quota is the maximum proportion of the service available for use by that project. - During periods of high demand, Jobs will queued awaiting resource availability on the Service. + During periods of high demand, Jobs will be queued awaiting resource availability on the Service. This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time. +## Project Queues + +EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md). + ## Additional Service Policy Information Additional information on service policies can be found [here](policies.md). @@ -79,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it - The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster. -- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources) +- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources). -- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run) +- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run). diff --git a/docs/services/gpuservice/kueue.md b/docs/services/gpuservice/kueue.md new file mode 100644 index 000000000..55a614564 --- /dev/null +++ b/docs/services/gpuservice/kueue.md @@ -0,0 +1,450 @@ +# Kueue + +## Overview + +[Kueue](https://kueue.sigs.k8s.io/docs/overview/) is a native Kubernetes quota and job management system. + +This is the job queue system for the EIDF GPU Service, starting with February 2024. + +All users should submit jobs to their local namespace user queue, this queue will have the name `eidf project namespace`-user-queue. + +### Changes to Job Specs + +Jobs can be submitted as before but will require the addition of a metadata label: + +```yaml + labels: + kueue.x-k8s.io/queue-name: -user-queue +``` + +This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted. + +## Useful commands for looking at your local queue + +### `kubectl get queue` + +This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start: + +```bash +NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS +eidf001-user-queue eidf001-project-gpu-cq 0 2 +``` + +### `kubectl describe queue ` + +This command will output more detailed information on the current resource usage in your queue: + +```bash +Name: eidf001-user-queue +Namespace: eidf001 +Labels: +Annotations: +API Version: kueue.x-k8s.io/v1beta1 +Kind: LocalQueue +Metadata: + Creation Timestamp: 2024-02-06T13:06:23Z + Generation: 1 + Managed Fields: + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:spec: + .: + f:clusterQueue: + Manager: kubectl-create + Operation: Update + Time: 2024-02-06T13:06:23Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + .: + f:admittedWorkloads: + f:conditions: + .: + k:{"type":"Active"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + f:flavorUsage: + .: + k:{"name":"default-flavor"}: + .: + f:name: + f:resources: + .: + k:{"name":"cpu"}: + .: + f:name: + f:total: + k:{"name":"memory"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-1g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-3g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-80"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + f:flavorsReservation: + .: + k:{"name":"default-flavor"}: + .: + f:name: + f:resources: + .: + k:{"name":"cpu"}: + .: + f:name: + f:total: + k:{"name":"memory"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-1g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-3g"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + k:{"name":"gpu-a100-80"}: + .: + f:name: + f:resources: + .: + k:{"name":"nvidia.com/gpu"}: + .: + f:name: + f:total: + f:pendingWorkloads: + f:reservingWorkloads: + Manager: kueue + Operation: Update + Subresource: status + Time: 2024-02-14T10:54:20Z + Resource Version: 333898946 + UID: bca097e2-6c55-4305-86ac-d1bd3c767751 +Spec: + Cluster Queue: eidf001-project-gpu-cq +Status: + Admitted Workloads: 2 + Conditions: + Last Transition Time: 2024-02-06T13:06:23Z + Message: Can submit new workloads to clusterQueue + Reason: Ready + Status: True + Type: Active + Flavor Usage: + Name: gpu-a100 + Resources: + Name: nvidia.com/gpu + Total: 2 + Name: gpu-a100-3g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-1g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-80 + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: default-flavor + Resources: + Name: cpu + Total: 16 + Name: memory + Total: 256Gi + Flavors Reservation: + Name: gpu-a100 + Resources: + Name: nvidia.com/gpu + Total: 2 + Name: gpu-a100-3g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-1g + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: gpu-a100-80 + Resources: + Name: nvidia.com/gpu + Total: 0 + Name: default-flavor + Resources: + Name: cpu + Total: 16 + Name: memory + Total: 256Gi + Pending Workloads: 0 + Reserving Workloads: 2 +Events: +``` + +### `kubectl get workloads` + +This command will return the list of workloads in the queue: + +```bash +NAME QUEUE ADMITTED BY AGE +job-jobtest-366ab eidf001-user-queue eidf001-project-gpu-cq 4h45m +job-jobtest-34ba9 eidf001-user-queue eidf001-project-gpu-cq 6h48m +``` + +### `kubectl describe workload ` + +This command will return a detailed summary of the workload including status and resource usage: + +```bash +Name: job-pytorch-job-0b664 +Namespace: t4 +Labels: kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc +Annotations: +API Version: kueue.x-k8s.io/v1beta1 +Kind: Workload +Metadata: + Creation Timestamp: 2024-02-14T15:22:16Z + Generation: 2 + Managed Fields: + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + f:admission: + f:clusterQueue: + f:podSetAssignments: + k:{"name":"main"}: + .: + f:count: + f:flavors: + f:cpu: + f:memory: + f:nvidia.com/gpu: + f:name: + f:resourceUsage: + f:cpu: + f:memory: + f:nvidia.com/gpu: + f:conditions: + k:{"type":"Admitted"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + k:{"type":"QuotaReserved"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + Manager: kueue-admission + Operation: Apply + Subresource: status + Time: 2024-02-14T15:22:16Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:status: + f:conditions: + k:{"type":"Finished"}: + .: + f:lastTransitionTime: + f:message: + f:reason: + f:status: + f:type: + Manager: kueue-job-controller-Finished + Operation: Apply + Subresource: status + Time: 2024-02-14T15:25:06Z + API Version: kueue.x-k8s.io/v1beta1 + Fields Type: FieldsV1 + fieldsV1: + f:metadata: + f:labels: + .: + f:kueue.x-k8s.io/job-uid: + f:ownerReferences: + .: + k:{"uid":"33bc1e48-4dca-4252-9387-bf68b99759dc"}: + f:spec: + .: + f:podSets: + .: + k:{"name":"main"}: + .: + f:count: + f:name: + f:template: + .: + f:metadata: + .: + f:labels: + .: + f:controller-uid: + f:job-name: + f:name: + f:spec: + .: + f:containers: + f:dnsPolicy: + f:nodeSelector: + f:restartPolicy: + f:schedulerName: + f:securityContext: + f:terminationGracePeriodSeconds: + f:volumes: + f:priority: + f:priorityClassSource: + f:queueName: + Manager: kueue + Operation: Update + Time: 2024-02-14T15:22:16Z + Owner References: + API Version: batch/v1 + Block Owner Deletion: true + Controller: true + Kind: Job + Name: pytorch-job + UID: 33bc1e48-4dca-4252-9387-bf68b99759dc + Resource Version: 270812029 + UID: 8cfa93ba-1142-4728-bc0c-e8de817e8151 +Spec: + Pod Sets: + Count: 1 + Name: main + Template: + Metadata: + Labels: + Controller - UID: 33bc1e48-4dca-4252-9387-bf68b99759dc + Job - Name: pytorch-job + Name: pytorch-pod + Spec: + Containers: + Args: + /mnt/ceph_rbd/example_pytorch_code.py + Command: + python3 + Image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + Image Pull Policy: IfNotPresent + Name: pytorch-con + Resources: + Limits: + Cpu: 4 + Memory: 4Gi + nvidia.com/gpu: 1 + Requests: + Cpu: 2 + Memory: 1Gi + Termination Message Path: /dev/termination-log + Termination Message Policy: File + Volume Mounts: + Mount Path: /mnt/ceph_rbd + Name: volume + Dns Policy: ClusterFirst + Node Selector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB + Restart Policy: Never + Scheduler Name: default-scheduler + Security Context: + Termination Grace Period Seconds: 30 + Volumes: + Name: volume + Persistent Volume Claim: + Claim Name: pytorch-pvc + Priority: 0 + Priority Class Source: + Queue Name: t4-user-queue +Status: + Admission: + Cluster Queue: project-cq + Pod Set Assignments: + Count: 1 + Flavors: + Cpu: default-flavor + Memory: default-flavor + nvidia.com/gpu: gpu-a100 + Name: main + Resource Usage: + Cpu: 2 + Memory: 1Gi + nvidia.com/gpu: 1 + Conditions: + Last Transition Time: 2024-02-14T15:22:16Z + Message: Quota reserved in ClusterQueue project-cq + Reason: QuotaReserved + Status: True + Type: QuotaReserved + Last Transition Time: 2024-02-14T15:22:16Z + Message: The workload is admitted + Reason: Admitted + Status: True + Type: Admitted + Last Transition Time: 2024-02-14T15:25:06Z + Message: Job finished successfully + Reason: JobFinished + Status: True + Type: Finished +``` diff --git a/docs/services/gpuservice/policies.md b/docs/services/gpuservice/policies.md index b083965de..5587d223f 100644 --- a/docs/services/gpuservice/policies.md +++ b/docs/services/gpuservice/policies.md @@ -16,12 +16,29 @@ Each project will be assigned a kubeconfig file for access to the service which ## Kubernetes Job Time to Live -All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via "spec.ttlSecondsAfterFinished" automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service. +All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via `spec.ttlSecondsAfterFinished`> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service. -Note: This policy is automated and does not require users to change their job specifications. +!!! important + This policy is automated and does not require users to change their job specifications. ## Kubernetes Active Deadline Seconds -All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via "spec.spec.activeDeadlineSeconds" automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service. +All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via `spec.spec.activeDeadlineSeconds` automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service. -Note: This policy is automated and does not require users to change their job or pod specifications. +!!! important + This policy is automated and does not require users to change their job or pod specifications. + +## Kueue + +All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload. + +Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata: + +```yaml + labels: + kueue.x-k8s.io/queue-name: -user-queue +``` + +Jobs without this queue name tag will be rejected. + +Pods bypassing the queue system will be deleted. diff --git a/docs/services/gpuservice/training/L1_getting_started.md b/docs/services/gpuservice/training/L1_getting_started.md index eef9015c6..9ebd1bea7 100644 --- a/docs/services/gpuservice/training/L1_getting_started.md +++ b/docs/services/gpuservice/training/L1_getting_started.md @@ -2,14 +2,14 @@ ## Introduction -Kubernetes (K8s) is a systems administration tool originally developed by Google to orchestrate the deployment, scaling, and management of containerised applications. +Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications. -Nvidia have created drivers to officially support clusters of Nvidia GPUs managed by K8s. +Nvidia GPUs are supported through K8s native Nvidia GPU Operators. -Using K8s to manage the EIDFGPUS provides two key advantages: +The use of K8s to manage the EIDF GPU Service provides two key advantages: -- native support for containers enabling reproducible analysis whilst minimising demand on system admin. -- automated resource allocation for GPUs and storage volumes that are shared across multiple users. +- support for containers enabling reproducible analysis whilst minimising demand on system admin. +- automated resource allocation management for GPUs and storage volumes that are shared across multiple users. ## Interacting with a K8s cluster @@ -23,97 +23,174 @@ Users define the resource requirements of a pod (i.e. number/type of GPU) and th The pod definition yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran. -A node is a unit of the cluster, e.g. a group of GPUs or virtual GPUs. +A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs. Multiple pods can be defined and maintained using several different methods depending on purpose: [deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/), [services](https://kubernetes.io/docs/concepts/services-networking/service/) and [jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job/); see the K8s docs for more details. Users interact with the K8s API using the `kubectl` (short for kubernetes control) commands. + Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces. + Useful commands are: -- `kubectl create -f `: Create a new pod with requested resources. Returns an error if a pod with the same name already exists. -- `kubectl apply -f `: Create a new pod with requested resources. If a pod with the same name already exists it updates that pod with the new resource/container requirements outlined in the yaml. +- `kubectl create -f `: Create a new job with requested resources. Returns an error if a job with the same name already exists. +- `kubectl apply -f `: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml. - `kubectl delete pod `: Delete a pod from the cluster. -- `kubectl get pods`: Summarise all pods the users has active (or queued). -- `kubectl describe pods`: Verbose description of all pods the users has active (or queued). +- `kubectl get pods`: Summarise all pods the namespace has active (or pending). +- `kubectl describe pods`: Verbose description of all pods the namespace has active (or pending). +- `kubectl describe pod `: Verbose summary of the specified pod. - `kubectl logs `: Retrieve the log files associated with a running pod. +- `kubectl get jobs`: List all jobs the namespace has active (or pending). +- `kubectl describe job `: Verbose summary of the specified job. +- `kubectl delete job `: Delete a job from the cluster. -## Creating your first pod +## Creating your first job -Nvidia have several prebuilt docker images to perform different tasks on their GPU hardware. +To access the GPUs on the service, it is recommended to start with one of the prebuild container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs. -The list of docker images is available on their [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags). +The list of Nvidia images is available on their [website](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/cuda-sample/tags). -This example uses their CUDA sample code simulating nbody interactions. +The following example uses their CUDA sample code simulating nbody interactions. 1. Open an editor of your choice and create the file test_NBody.yml -1. Copy the following in to the file: +1. Copy the following in to the file, replacing `namespace-user-queue` with -user-queue, e.g. eidf001ns-user-queue: + + ``` yaml + apiVersion: batch/v1 + kind: Job + metadata: + generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue + spec: + completions: 1 + template: + metadata: + name: job-test + spec: + containers: + - name: cudasample + image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + nvidia.com/gpu: 1 + restartPolicy: Never + ``` -The pod resources are defined with the `requests` and `limits` tags. + The pod resources are defined under the `resources` tags using the `requests` and `limits` tags. -Resources defined in the `requests` tags are the minimum possible resources required for the pod to run. + Resources defined under the `requests` tags are the reserved resources required for the pod to be scheduled. -If a pod is assigned to an unused node then it may use resources beyond those requested. + If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested. -This may allow the task within the pod to run faster, but it also runs the risk of unnecessarily blocking off resources for future pod requests. + This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node. -The `limits` tag specifies the maximum resources that can be assigned to a pod. + The `limits` tag specifies the maximum resources that can be assigned to a pod. -The EIDFGPUS cluster requires all pods to have `requests` and `limits` tags for cpu and memory resources in order to be accepted. + The EIDF GPU Service requires all pods have `requests` and `limits` tags for CPU and memory defined in order to be accepted. -Finally, it optional to define GPU resources but only the `limits` tag is used to specify the use of a GPU, `limits: nvidia.com/gpu: 1`. + GPU resources requests are optional and only an entry under the `limits` tag is needed to specify the use of a GPU, `nvidia.com/gpu: 1`. Without this no GPU will be available to the pod. -``` yaml -apiVersion: v1 -kind: Pod -metadata: - generateName: first-pod- -spec: - restartPolicy: OnFailure - containers: - - name: cudasample - image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 - args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 -``` + The label `kueue.x-k8s.io/queue-name` specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users. 1. Save the file and exit the editor -1. Run `kubectl create -f test_NBody.yml' +1. Run `kubectl create -f test_NBody.yml` 1. This will output something like: ``` bash - pod/first-pod-7gdtb created + job.batch/jobtest-b92qg created + ``` + +1. Run `kubectl get jobs` +1. This will output something like: + + ```bash + NAME COMPLETIONS DURATION AGE + jobtest-b92qg 3/3 48s 6m27s + jobtest-d45sr 5/5 15m 22h + jobtest-kwmwk 3/3 48s 29m + jobtest-kw22k 1/1 48s 29m + ``` + + This displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age. + +1. Describe your job using the command `kubectl describe job jobtest-b92-qg`, replacing the job name with your job name. +1. This will output something like: + + ```bash + Name: jobtest-b92qg + Namespace: t4 + Selector: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 + Labels: kueue.x-k8s.io/queue-name=t4-user-queue + Annotations: batch.kubernetes.io/job-tracking: + Parallelism: 1 + Completions: 3 + Completion Mode: NonIndexed + Start Time: Wed, 14 Feb 2024 14:07:44 +0000 + Completed At: Wed, 14 Feb 2024 14:08:32 +0000 + Duration: 48s + Pods Statuses: 0 Active (0 Ready) / 3 Succeeded / 0 Failed + Pod Template: + Labels: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 + job-name=jobtest-b92qg + Containers: + cudasample: + Image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + Port: + Host Port: + Args: + -benchmark + -numbodies=512000 + -fp64 + -fullscreen + Limits: + cpu: 2 + memory: 4Gi + nvidia.com/gpu: 1 + Requests: + cpu: 2 + memory: 1Gi + Environment: + Mounts: + Volumes: + Events: + Type Reason Age From Message + ---- ------ ---- ---- ------- + Normal Suspended 8m1s job-controller Job suspended + Normal CreatedWorkload 8m1s batch/job-kueue-controller Created Workload: t4/job-jobtest-b92qg-3b890 + Normal Started 8m1s batch/job-kueue-controller Admitted by clusterQueue project-cq + Normal SuccessfulCreate 8m job-controller Created pod: jobtest-b92qg-lh64s + Normal Resumed 8m job-controller Job resumed + Normal SuccessfulCreate 7m44s job-controller Created pod: jobtest-b92qg-xhvdm + Normal SuccessfulCreate 7m28s job-controller Created pod: jobtest-b92qg-lvmrf + Normal Completed 7m12s job-controller Job completed ``` 1. Run `kubectl get pods` 1. This will output something like: ``` bash - pi-tt9kq 0/1 Completed 0 24h - first-pod-24n7n 0/1 Completed 0 24h - first-pod-2j5tc 0/1 Completed 0 24h - first-pod-2kjbx 0/1 Completed 0 24h - sample-2mnvg 0/1 Completed 0 24h - sample-4sng2 0/1 Completed 0 24h - sample-5h6sr 0/1 Completed 0 24h - sample-6bqql 0/1 Completed 0 24h - first-pod-7gdtb 0/1 Completed 0 39s - sample-8dnht 0/1 Completed 0 24h - sample-8pxz4 0/1 Completed 0 24h - sample-bphjx 0/1 Completed 0 24h - sample-cp97f 0/1 Completed 0 24h - sample-gcbbb 0/1 Completed 0 24h - sample-hdlrr 0/1 Completed 0 24h + NAME READY STATUS RESTARTS AGE + jobtest-b92qg-lh64s 0/1 Completed 0 11m + jobtest-b92qg-lvmrf 0/1 Completed 0 10m + jobtest-b92qg-xhvdm 0/1 Completed 0 10m + jobtest-d45sr-8tf4d 0/1 Completed 0 22h + jobtest-d45sr-jjhgg 0/1 Completed 0 22h + jobtest-d45sr-n5w6c 0/1 Completed 0 22h + jobtest-d45sr-v9p4j 0/1 Completed 0 22h + jobtest-d45sr-xgq5s 0/1 Completed 0 22h + jobtest-kwmwk-cgwmf 0/1 Completed 0 33m + jobtest-kwmwk-mttdw 0/1 Completed 0 33m + jobtest-kwmwk-r2q9h 0/1 Completed 0 33m ``` -1. View the logs of the pod you ran `kubectl logs first-pod-7gdtb` +1. View the logs of a pod from the job you ran `kubectl logs jobtest-b92qg-lh64s` - note that the pods for the job in this case start with the job name. 1. This will output something like: ``` bash @@ -144,65 +221,76 @@ spec: = 7439.679 double-precision GFLOP/s at 30 flops per interaction ``` -1. delete your pod with `kubectl delete pod first-pod-7gdtb` +1. Delete your job with `kubectl delete job jobtest-b92qg` - this will delete the associated pods as well. ## Specifying GPU requirements -If you create multiple pods with the same yaml file and compare their log files you may notice the CUDA device may differ from `Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]`. +If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from `Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]`. -This is because K8s is allocating the pod to any free node irrespective of whether that node contains a full 80GB Nvida A100 or a GPU from a MIG Nvida A100. +The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of whether what GPU type is present on the node. -The GPU resource request can be more specific by adding the type of product the pod is requesting to the node selector: +The GPU resource requests can be made more specific by adding the type of GPU product the pod is requesting to the node selector: - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'` - `nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'` +- `nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'` ### Example yaml file -``` yaml -apiVersion: v1 -kind: Pod +```yaml + +apiVersion: batch/v1 +kind: Job metadata: - generateName: first-pod- + generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue spec: - restartPolicy: OnFailure - containers: - - name: cudasample - image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 - args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + completions: 1 + template: + metadata: + name: job-test + spec: + containers: + - name: cudasample + image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 + args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + nvidia.com/gpu: 1 + restartPolicy: Never + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb ``` ## Running multiple pods with K8s jobs -A typical use case of the EIDFGPUS cluster will not consist of sending pod requests directly to Kubernetes. - -Instead, users will use a job request which wraps around a pod specification and provide several useful attributes. +The recommended use of the EIDF GPU Service is to use a job request which wraps around a pod specification and provide several useful attributes. Firstly, if a pod is assigned to a node that dies then the pod itself will fail and the user has to manually restart it. -Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod. +Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod, if the restartPolicy is set. + +Jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. -Furthermore, jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate. +Jobs allow for better scheduling of resources using the Kueue service implemented on the EIDF GPU Service. Pods which attempt to bypass the queue mechanism this provides will affect the experience of other project users. -See below for an example K8s pod that requires three pods to successfully complete the example CUDA code before the job itself ends. +See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends. ``` yaml apiVersion: batch/v1 kind: Job metadata: generateName: jobtest- + labels: + kueue.x-k8s.io/queue-name: namespace-user-queue spec: completions: 3 parallelism: 1 diff --git a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md index f99a0527b..cfd546181 100644 --- a/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md +++ b/docs/services/gpuservice/training/L2_requesting_persistent_volumes.md @@ -1,6 +1,6 @@ # Requesting Persistent Volumes With Kubernetes -Pods in the K8s EIDFGPUS are intentionally ephemeral. +Pods in the K8s EIDF GPU Service are intentionally ephemeral. They only last as long as required to complete the task that they were created for. @@ -10,9 +10,9 @@ However, this means the default storage volumes within a pod are temporary. If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance. -Instead, K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs. +K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs. -These persistent volumes will remain even if the pods it is mounted to are deleted, are updated or crash. +These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash. ## Submitting a Persistent Volume Claim @@ -20,11 +20,11 @@ Before a persistent volume can be mounted to a pod, the required storage resourc A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources. -The storage resources are held on a Ceph server which can accept requests up 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDFGPUS. This means at this stage, pods can mount the same PVC in sequence, but not concurrently. +The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently. Example PVCs can be seen on the [Kubernetes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) documentation page. -All PVCs on the EIDFGPUS must use the `csi-rbd-sc` storage class. +All PVCs on the EIDF GPU Service must use the `csi-rbd-sc` storage class. ### Example PersistentVolumeClaim @@ -42,12 +42,12 @@ spec: storageClassName: csi-rbd-sc ``` -You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl create ` +You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml `kubectl create ` Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands: -- `kubectl delete pvc ` -- `kubectl get pvc ` -- `kubectl apply -f ` +- `kubectl delete pvc ` +- `kubectl get pvc ` +- `kubectl apply -f ` ## Mounting a persistent Volume to a Pod @@ -56,29 +56,37 @@ Introducing a persistent volume to a pod requires the addition of a volumeMount ### Example pod specification yaml with mounted persistent volume ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: test-ceph-pvc-pod + name: test-ceph-pvc-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - containers: - - name: trial - image: busybox - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: test-ceph-pvc + completions: 1 + template: + metadata: + name: test-ceph-pvc-pod + spec: + containers: + - name: cudasample + image: busybox + args: ["sleep", "infinity"] + resources: + requests: + cpu: 2 + memory: '1Gi' + limits: + cpu: 2 + memory: '4Gi' + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + restartPolicy: Never + volumes: + - name: volume + persistentVolumeClaim: + claimName: test-ceph-pvc ``` ## Accessing the persistent volume outside a pod @@ -86,8 +94,8 @@ spec: To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command. ```bash -*** On Login Node *** -kubectl cp /home/data/test_data.csv test-ceph-pvc-pod:/mnt/ceph_rbd +*** On Login Node - replacing pod name with your pod name *** +kubectl cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd ``` For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted. @@ -97,7 +105,7 @@ The bash command rsync can be amended to manage file transfers into the mounted ## Clean up ```bash -kubectl delete pod test-ceph-pvc-pod +kubectl delete job test-ceph-pvc-job kubectl delete pvc test-ceph-pvc ``` diff --git a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md index b3fad8906..33dae5ffb 100644 --- a/docs/services/gpuservice/training/L3_running_a_pytorch_task.md +++ b/docs/services/gpuservice/training/L3_running_a_pytorch_task.md @@ -1,6 +1,6 @@ # Running a PyTorch task -In the following lesson, we'll build a NLP neural network and train it using the EIDFGPUS. +In the following lesson, we'll build a NLP neural network and train it using the EIDF GPU Service. The model was taken from the [PyTorch Tutorials](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html). @@ -8,7 +8,7 @@ The lesson will be split into three parts: - Requesting a persistent volume and transferring code/data to it - Creating a pod with a PyTorch container downloaded from DockerHub -- Submitting a job to the EIDFGPUS and retrieving the results +- Submitting a job to the EIDF GPU Service and retrieving the results ## Load training data and ML code into a persistent volume @@ -44,132 +44,147 @@ spec: kubectl get pvc ``` -1. Create a lightweight pod with PV mounted (example pod below) +1. Create a lightweight job with pod with PV mounted (example job below) ``` bash - kubectl create -f lightweight-pod.yaml + kubectl create -f lightweight-pod-job.yaml ``` -1. Download the pytorch code +1. Download the PyTorch code ``` bash wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py ``` -1. Copy python script into the PV +1. Copy the Python script into the PV ``` bash - kubectl cp example_pytorch_code.py lightweight-pod:/mnt/ceph_rbd/ + kubectl cp example_pytorch_code.py lightweight-job-:/mnt/ceph_rbd/ ``` -1. Check files were transferred successfully +1. Check whether the files were transferred successfully ``` bash - kubectl exec lightweight-pod -- ls /mnt/ceph_rbd + kubectl exec lightweight-job- -- ls /mnt/ceph_rbd ``` -1. Delete lightweight pod +1. Delete the lightweight job ``` bash - kubectl delete pod lightweight-pod + kubectl delete job lightweight-job- ``` -### Example lightweight pod specification +### Example lightweight job specification ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: lightweight-pod + name: lightweight-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - containers: - - name: data-loader - image: busybox - command: ["sleep", "infinity"] - resources: - requests: - cpu: 1 - memory: "1Gi" - limits: - cpu: 1 - memory: "1Gi" - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 1 + template: + metadata: + name: lightweight-pod + spec: + containers: + - name: data-loader + image: busybox + args: ["sleep", "infinity"] + resources: + requests: + cpu: 1 + memory: '1Gi' + limits: + cpu: 1 + memory: '1Gi' + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + restartPolicy: Never + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` -## Creating a pod with a PyTorch container +## Creating a Job with a PyTorch container We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model. The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU. -Submit the specification file to K8s to create the pod. +Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name. ``` bash -kubectl create -f +kubectl create -f ``` -### Example PyTorch Pod Specification File +### Example PyTorch Job Specification File ``` yaml -apiVersion: v1 -kind: Pod +apiVersion: batch/v1 +kind: Job metadata: - name: pytorch-pod + name: pytorch-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - restartPolicy: Never - containers: - - name: pytorch-con - image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel - command: ["python3"] - args: ["/mnt/ceph_rbd/example_pytorch_code.py"] - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - resources: - requests: - cpu: 2 - memory: "1Gi" - limits: - cpu: 4 - memory: "4Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 1 + template: + metadata: + name: pytorch-pod + spec: + restartPolicy: Never + containers: + - name: pytorch-con + image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + command: ["python3"] + args: ["/mnt/ceph_rbd/example_pytorch_code.py"] + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + resources: + requests: + cpu: 2 + memory: "1Gi" + limits: + cpu: 4 + memory: "4Gi" + nvidia.com/gpu: 1 + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` ## Reviewing the results of the PyTorch model This is not intended to be an introduction to PyTorch, please see the [online tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html) for details about the model. -1. Check model ran to completion +1. Check that the model ran to completion ``` bash kubectl logs ``` -1. Spin up lightweight pod to retrieve results +1. Spin up a lightweight pod to retrieve results ``` bash - kubectl create -f lightweight-pod.yaml + kubectl create -f lightweight-pod-job.yaml ``` -1. Copy trained model back to the head node +1. Copy the trained model back to your access VM ``` bash - kubectl cp lightweight-pod:mnt/ceph_rbd/model.pth model.pth + kubectl cp lightweight-job-:mnt/ceph_rbd/model.pth model.pth ``` -## Using a Kubernetes job to train the pytorch model +## Using a Kubernetes job to train the pytorch model multiple times A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets. @@ -183,42 +198,43 @@ Below is an example job yaml for running the pytorch model which will continue t apiVersion: batch/v1 kind: Job metadata: - name: pytorch-job + name: pytorch-job + labels: + kueue.x-k8s.io/queue-name: -user-queue spec: - completions: 3 - parallelism: 1 - template: - spec: - restartPolicy: Never - containers: - - name: pytorch-con - image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel - command: ["python3"] - args: ["/mnt/ceph_rbd/example_pytorch_code.py"] - volumeMounts: - - mountPath: /mnt/ceph_rbd - name: volume - resources: - requests: - cpu: 1 - memory: "4Gi" - limits: - cpu: 1 - memory: "8Gi" - nvidia.com/gpu: 1 - nodeSelector: - nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb - volumes: - - name: volume - persistentVolumeClaim: - claimName: pytorch-pvc + completions: 3 + template: + metadata: + name: pytorch-pod + spec: + restartPolicy: Never + containers: + - name: pytorch-con + image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + command: ["python3"] + args: ["/mnt/ceph_rbd/example_pytorch_code.py"] + volumeMounts: + - mountPath: /mnt/ceph_rbd + name: volume + resources: + requests: + cpu: 2 + memory: "1Gi" + limits: + cpu: 4 + memory: "4Gi" + nvidia.com/gpu: 1 + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb + volumes: + - name: volume + persistentVolumeClaim: + claimName: pytorch-pvc ``` ## Clean up ``` bash -kubectl delete pod pytorch-pod - kubectl delete pod pytorch-job kubectl delete pvc pytorch-pvc diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 6374cdc67..f181eb08a 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -70,7 +70,6 @@ Remember, you will need to use both an SSH key and Time-based one-time password --- !!! note "First Login" - When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) diff --git a/mkdocs.yml b/mkdocs.yml index cbfe4d1d2..fb602f696 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -63,7 +63,8 @@ nav: - "GPU Service": - "Overview": services/gpuservice/index.md - "Policies": services/gpuservice/policies.md - - "Tutorial": + - "Kueue": services/gpuservice/kueue.md + - "Tutorials": - "Getting Started": services/gpuservice/training/L1_getting_started.md - "Persistent Volumes": services/gpuservice/training/L2_requesting_persistent_volumes.md - "Running a Pytorch Pod": services/gpuservice/training/L3_running_a_pytorch_task.md