AG: pre-commit failures corrected in cs2 and ultra2 and gpuservice la…

…test
EPCCed · Feb 14, 2024 · d24b745 · d24b745
1 parent 2541900
commit d24b745
Show file tree

Hide file tree

Showing 10 changed files with 834 additions and 243 deletions.
diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md
@@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate
 cerebras_install_check
 ```
 
-### Modify venv files to remove clock sync check on EPCC system.
+### Modify venv files to remove clock sync check on EPCC system
 
 Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
 
@@ -91,7 +91,7 @@ if modified_time > self._last_modified:
     )
 ```
 
-### Comment out the whole section
+### Comment out the section `if modified_time > self._last_modified`
 
 ```python
  #if modified_time > self._last_modified:
@@ -123,7 +123,7 @@ The section should look like this:
        )
 ```
 
-### Comment out the whole section
+### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns`
 
 ```python
    #if stat.st_mtime_ns > self._stat.st_mtime_ns:
@@ -138,7 +138,7 @@ The section should look like this:
 
 ### Save the file
 
-### Run jobs as per existing documentation.
+### Run jobs as per existing documentation
 
 ## Paths, PYTHONPATH and mount_dirs
 

diff --git a/docs/services/gpuservice/faq.md b/docs/services/gpuservice/faq.md
@@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by
 
 ### How many GPUs can I use in a pod?
 
-The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs.
+The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.
 
 ### Why did a validation error occur when submitting a pod or job with a valid specification file?
 
@@ -76,3 +76,7 @@ Example fragment for a Bash command start:
         - '-c'
         - '--'
 ```
+
+### My large number of GPUs Job takes a long time to be scheduled
+
+When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
diff --git a/docs/services/gpuservice/index.md b/docs/services/gpuservice/index.md
@@ -4,7 +4,7 @@ The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPU
 
 MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.
 
-The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU.
+The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.
 
 The service provides access to:
 
@@ -26,23 +26,27 @@ The current full specification of the EIDF GPU Service as of 14 February 2024:
 - 56 Nvidia A100 1G.5GB
 - 32 Nvidia H100 80 GB
 
-!!! Quotas
-    This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team.
+!!! important "Quotas"
+    This is the full configuration of the cluster.
+
+    Each project will have access to a quota across this shared configuration.
+
+    Changes to the default quota must be discussed and agreed with the EIDF Services team.
 
 ## Service Access
 
-Users should have an EIDF account - [EIDF Accounts](../../access/project.md).
+Users should have an [EIDF Account](../../access/project.md).
 
 Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.
 
-Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md).
+Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).
 
 All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.
 
-!!! Important
+!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs"
     The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.
 
-    An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type.
+    An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.
 
     Projects do not have to apply for a GPU-enabled VM to access the GPU Service.
 
@@ -54,13 +58,17 @@ A standard project namespace has the following initial quota (subject to ongoing
 - Memory: 1TiB
 - GPU: 12
 
-!!! Important
+!!! important "Quota is a maximum on a Shared Resource"
     A project quota is the maximum proportion of the service available for use by that project.
 
-    During periods of high demand, Jobs will queued awaiting resource availability on the Service.
+    During periods of high demand, Jobs will be queued awaiting resource availability on the Service.
 
     This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.
 
+## Project Queues
+
+EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md).
+
 ## Additional Service Policy Information
 
 Additional information on service policies can be found [here](policies.md).
@@ -79,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it
 
 - The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster.
 
-- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources)
+- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources).
 
-- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run)
+- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run).