Merge pull request #135 from EPCCed/kueue_update

Kueue update changes
EPCCed · Feb 15, 2024 · 5cc5e65 · 5cc5e65
2 parents ef3889f + d24b745
commit 5cc5e65
Show file tree

Hide file tree

Showing 11 changed files with 866 additions and 251 deletions.
diff --git a/.mdl_style.rb b/.mdl_style.rb
@@ -1,4 +1,5 @@
 all
 exclude_rule 'MD033'
+exclude_rule 'MD046'
 rule 'MD013', :line_length => 500
 rule 'MD026', :punctuation => '.,:;'
diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md
@@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate
 cerebras_install_check
 ```
 
-### Modify venv files to remove clock sync check on EPCC system.
+### Modify venv files to remove clock sync check on EPCC system
 
 Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
 
@@ -91,7 +91,7 @@ if modified_time > self._last_modified:
     )
 ```
 
-### Comment out the whole section
+### Comment out the section `if modified_time > self._last_modified`
 
 ```python
  #if modified_time > self._last_modified:
@@ -123,7 +123,7 @@ The section should look like this:
        )
 ```
 
-### Comment out the whole section
+### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns`
 
 ```python
    #if stat.st_mtime_ns > self._stat.st_mtime_ns:
@@ -138,7 +138,7 @@ The section should look like this:
 
 ### Save the file
 
-### Run jobs as per existing documentation.
+### Run jobs as per existing documentation
 
 ## Paths, PYTHONPATH and mount_dirs
 

diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md
@@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by
 
 ### How many GPUs can I use in a pod?
 
-The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs.
+The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.
 
 ### Why did a validation error occur when submitting a pod or job with a valid specification file?
 
@@ -76,3 +76,7 @@ Example fragment for a Bash command start:
         - '-c'
         - '--'
 ```
+
+### My large number of GPUs Job takes a long time to be scheduled
+
+When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md
@@ -1,32 +1,54 @@
 # Overview
 
-The EIDF GPU Service (EIDFGPUS) uses Nvidia A100 GPUs as accelerators.
+The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon [Kubernetes](https://kubernetes.io).
 
-Full Nvidia A100 GPUs are connected to 40GB of dynamic memory.
+MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.
 
-Multi-instance usage (MIG) GPUs allow multiple tasks or users to share the same GPU (similar to CPU threading).
+The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.
 
-There are two types of MIG GPUs inside the EIDFGPUS the Nvidia A100 3G.20GB GPUs and the Nvidia A100 1G.5GB GPUs which equate to ~1/2 and ~1/7 of a full Nvidia A100 40 GB GPU.
+The service provides access to:
 
-The current specification of the EIDFGPUS is:
+- Nvidia A100 40GB
+- Nvidia 80GB
+- Nvidia MIG A100 1G.5GB
+- Nvidia MIG A100 3G.20GB
+- Nvidia H100 80GB
 
-- 1856 CPU Cores
-- 8.7 TiB Memory
-- Local Disk Space (Node Image Cache and Local Workspace) - 21 TiB
+The current full specification of the EIDF GPU Service as of 14 February 2024:
+
+- 4912 CPU Cores (AMD EPYC and Intel Xeon)
+- 23 TiB Memory
+- Local Disk Space (Node Image Cache and Local Workspace) - 40 TiB
 - Ceph Persistent Volumes (Long Term Data) - up to 100TiB
-- 70 Nvidia A100 40 GB GPUs
-- 14 MIG Nvidia A100 40 GB GPUs equating to 28 Nvidia A100 3G.20GB GPUs
-- 20 MIG Nvidia A100 40 GB GPU equating to 140 A100 1G.5GB GPUs
+- 112 Nvidia A100 40 GB
+- 39 Nvidia A100 80 GB
+- 16 Nvidia A100 3G.20GB
+- 56 Nvidia A100 1G.5GB
+- 32 Nvidia H100 80 GB
+
+!!! important "Quotas"
+    This is the full configuration of the cluster.
+
+    Each project will have access to a quota across this shared configuration.
 
-The EIDFGPUS is managed using [Kubernetes](https://kubernetes.io), with up to 8 GPUs being on a single node.
+    Changes to the default quota must be discussed and agreed with the EIDF Services team.
 
 ## Service Access
 
-Users should have an EIDF account - [EIDF Accounts](../../access/project.md).
+Users should have an [EIDF Account](../../access/project.md).
+
+Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.
+
+Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).
+
+All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.
 
-Project Leads will be able to have access to the EIDFGPUS added to their project during the project application process or through a request to the EIDF helpdesk.
+!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs"
+    The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.
 
-Each project will be given a namespace to operate in and a kubeconfig file in a Virtual Machine on the EIDF DSC - information on access to VMs is [available here](../../access/virtualmachines-vdi.md).
+    An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.
+
+    Projects do not have to apply for a GPU-enabled VM to access the GPU Service.
 
 ## Project Quotas
 
@@ -36,15 +58,24 @@ A standard project namespace has the following initial quota (subject to ongoing
 - Memory: 1TiB
 - GPU: 12
 
-Note these quotas are maximum use by a single project, and that during periods of high usage Kubernetes Jobs maybe queued waiting for resource to become available on the cluster.
+!!! important "Quota is a maximum on a Shared Resource"
+    A project quota is the maximum proportion of the service available for use by that project.
+
+    During periods of high demand, Jobs will be queued awaiting resource availability on the Service.
+
+    This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.
+
+## Project Queues
+
+EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md).
 
 ## Additional Service Policy Information
 
 Additional information on service policies can be found [here](policies.md).
 
 ## EIDF GPU Service Tutorial
 
-This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a comprehensive overview of Kubernetes.
+This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.
 
 | Lesson                                                                                                   | Objective                                                                                                      |
 |-----------------------------------|-------------------------------------|
@@ -56,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a
 
 - The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster.
 
-- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources)
+- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources).
 
-- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run)
+- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run).