Skip to content

Commit

Permalink
Merge pull request #135 from EPCCed/kueue_update
Browse files Browse the repository at this point in the history
Kueue update changes
  • Loading branch information
agngrant authored Feb 15, 2024
2 parents ef3889f + d24b745 commit 5cc5e65
Show file tree
Hide file tree
Showing 11 changed files with 866 additions and 251 deletions.
1 change: 1 addition & 0 deletions .mdl_style.rb
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
all
exclude_rule 'MD033'
exclude_rule 'MD046'
rule 'MD013', :line_length => 500
rule 'MD026', :punctuation => '.,:;'
8 changes: 4 additions & 4 deletions docs/services/cs2/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate
cerebras_install_check
```
### Modify venv files to remove clock sync check on EPCC system.
### Modify venv files to remove clock sync check on EPCC system
Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
Expand Down Expand Up @@ -91,7 +91,7 @@ if modified_time > self._last_modified:
)
```
### Comment out the whole section
### Comment out the section `if modified_time > self._last_modified`
```python
#if modified_time > self._last_modified:
Expand Down Expand Up @@ -123,7 +123,7 @@ The section should look like this:
)
```
### Comment out the whole section
### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns`
```python
#if stat.st_mtime_ns > self._stat.st_mtime_ns:
Expand All @@ -138,7 +138,7 @@ The section should look like this:
### Save the file
### Run jobs as per existing documentation.
### Run jobs as per existing documentation
## Paths, PYTHONPATH and mount_dirs
Expand Down
6 changes: 5 additions & 1 deletion docs/services/gpuservice/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by

### How many GPUs can I use in a pod?

The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs.
The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.

### Why did a validation error occur when submitting a pod or job with a valid specification file?

Expand Down Expand Up @@ -76,3 +76,7 @@ Example fragment for a Bash command start:
- '-c'
- '--'
```
### My large number of GPUs Job takes a long time to be scheduled
When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
69 changes: 50 additions & 19 deletions docs/services/gpuservice/index.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,54 @@
# Overview

The EIDF GPU Service (EIDFGPUS) uses Nvidia A100 GPUs as accelerators.
The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon [Kubernetes](https://kubernetes.io).

Full Nvidia A100 GPUs are connected to 40GB of dynamic memory.
MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.

Multi-instance usage (MIG) GPUs allow multiple tasks or users to share the same GPU (similar to CPU threading).
The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.

There are two types of MIG GPUs inside the EIDFGPUS the Nvidia A100 3G.20GB GPUs and the Nvidia A100 1G.5GB GPUs which equate to ~1/2 and ~1/7 of a full Nvidia A100 40 GB GPU.
The service provides access to:

The current specification of the EIDFGPUS is:
- Nvidia A100 40GB
- Nvidia 80GB
- Nvidia MIG A100 1G.5GB
- Nvidia MIG A100 3G.20GB
- Nvidia H100 80GB

- 1856 CPU Cores
- 8.7 TiB Memory
- Local Disk Space (Node Image Cache and Local Workspace) - 21 TiB
The current full specification of the EIDF GPU Service as of 14 February 2024:

- 4912 CPU Cores (AMD EPYC and Intel Xeon)
- 23 TiB Memory
- Local Disk Space (Node Image Cache and Local Workspace) - 40 TiB
- Ceph Persistent Volumes (Long Term Data) - up to 100TiB
- 70 Nvidia A100 40 GB GPUs
- 14 MIG Nvidia A100 40 GB GPUs equating to 28 Nvidia A100 3G.20GB GPUs
- 20 MIG Nvidia A100 40 GB GPU equating to 140 A100 1G.5GB GPUs
- 112 Nvidia A100 40 GB
- 39 Nvidia A100 80 GB
- 16 Nvidia A100 3G.20GB
- 56 Nvidia A100 1G.5GB
- 32 Nvidia H100 80 GB

!!! important "Quotas"
This is the full configuration of the cluster.

Each project will have access to a quota across this shared configuration.

The EIDFGPUS is managed using [Kubernetes](https://kubernetes.io), with up to 8 GPUs being on a single node.
Changes to the default quota must be discussed and agreed with the EIDF Services team.

## Service Access

Users should have an EIDF account - [EIDF Accounts](../../access/project.md).
Users should have an [EIDF Account](../../access/project.md).

Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.

Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).

All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.

Project Leads will be able to have access to the EIDFGPUS added to their project during the project application process or through a request to the EIDF helpdesk.
!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs"
The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.

Each project will be given a namespace to operate in and a kubeconfig file in a Virtual Machine on the EIDF DSC - information on access to VMs is [available here](../../access/virtualmachines-vdi.md).
An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

Projects do not have to apply for a GPU-enabled VM to access the GPU Service.

## Project Quotas

Expand All @@ -36,15 +58,24 @@ A standard project namespace has the following initial quota (subject to ongoing
- Memory: 1TiB
- GPU: 12

Note these quotas are maximum use by a single project, and that during periods of high usage Kubernetes Jobs maybe queued waiting for resource to become available on the cluster.
!!! important "Quota is a maximum on a Shared Resource"
A project quota is the maximum proportion of the service available for use by that project.

During periods of high demand, Jobs will be queued awaiting resource availability on the Service.

This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

## Project Queues

EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md).

## Additional Service Policy Information

Additional information on service policies can be found [here](policies.md).

## EIDF GPU Service Tutorial

This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a comprehensive overview of Kubernetes.
This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.

| Lesson | Objective |
|-----------------------------------|-------------------------------------|
Expand All @@ -56,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a

- The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster.

- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources)
- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources).

- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run)
- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run).
Loading

0 comments on commit 5cc5e65

Please sign in to comment.