Skip to content

Commit

Permalink
AG: pre-commit failures corrected in cs2 and ultra2 and gpuservice la…
Browse files Browse the repository at this point in the history
…test
  • Loading branch information
agrant3 committed Feb 14, 2024
1 parent 2541900 commit d24b745
Show file tree
Hide file tree
Showing 10 changed files with 834 additions and 243 deletions.
8 changes: 4 additions & 4 deletions docs/services/cs2/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ source venv_cerebras_pt/bin/activate
cerebras_install_check
```
### Modify venv files to remove clock sync check on EPCC system.
### Modify venv files to remove clock sync check on EPCC system
Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
Expand Down Expand Up @@ -91,7 +91,7 @@ if modified_time > self._last_modified:
)
```
### Comment out the whole section
### Comment out the section `if modified_time > self._last_modified`
```python
#if modified_time > self._last_modified:
Expand Down Expand Up @@ -123,7 +123,7 @@ The section should look like this:
)
```
### Comment out the whole section
### Comment out the section `if stat.st_mtime_ns > self._stat.st_mtime_ns`
```python
#if stat.st_mtime_ns > self._stat.st_mtime_ns:
Expand All @@ -138,7 +138,7 @@ The section should look like this:
### Save the file
### Run jobs as per existing documentation.
### Run jobs as per existing documentation
## Paths, PYTHONPATH and mount_dirs
Expand Down
6 changes: 5 additions & 1 deletion docs/services/gpuservice/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ The current PVC provisioner is based on Ceph RBD. The block devices provided by

### How many GPUs can I use in a pod?

The current limit is 8 GPUs per pod. Each underlying host has 8 GPUs.
The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.

### Why did a validation error occur when submitting a pod or job with a valid specification file?

Expand Down Expand Up @@ -76,3 +76,7 @@ Example fragment for a Bash command start:
- '-c'
- '--'
```
### My large number of GPUs Job takes a long time to be scheduled
When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.
30 changes: 19 additions & 11 deletions docs/services/gpuservice/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPU

MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.

The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU.
The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.

The service provides access to:

Expand All @@ -26,23 +26,27 @@ The current full specification of the EIDF GPU Service as of 14 February 2024:
- 56 Nvidia A100 1G.5GB
- 32 Nvidia H100 80 GB

!!! Quotas
This is the full configuration of the cluster. Each project will have access to a quota across this shared configuration. This quota is agreed with the EIDF Services team.
!!! important "Quotas"
This is the full configuration of the cluster.

Each project will have access to a quota across this shared configuration.

Changes to the default quota must be discussed and agreed with the EIDF Services team.

## Service Access

Users should have an EIDF account - [EIDF Accounts](../../access/project.md).
Users should have an [EIDF Account](../../access/project.md).

Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.

Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is [available here](../../access/virtualmachines-vdi.md).
Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available [here](../../access/virtualmachines-vdi.md).

All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.

!!! Important
!!! important "EIDF GPU Service vs EIDF GPU-Enabled VMs"
The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.

An EIDF Virtual Desktop GPU-enabled VM is be limited to a small number (1-2) of GPUs of a single type.
An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

Projects do not have to apply for a GPU-enabled VM to access the GPU Service.

Expand All @@ -54,13 +58,17 @@ A standard project namespace has the following initial quota (subject to ongoing
- Memory: 1TiB
- GPU: 12

!!! Important
!!! important "Quota is a maximum on a Shared Resource"
A project quota is the maximum proportion of the service available for use by that project.

During periods of high demand, Jobs will queued awaiting resource availability on the Service.
During periods of high demand, Jobs will be queued awaiting resource availability on the Service.

This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

## Project Queues

EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the [Kueue](kueue.md).

## Additional Service Policy Information

Additional information on service policies can be found [here](policies.md).
Expand All @@ -79,6 +87,6 @@ This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it

- The [Nvidia developers blog](https://developer.nvidia.com/blog/search-posts/?q=Kubernetes) provides several examples of how to run ML tasks on a Kubernetes GPU cluster.

- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources)
- Kubernetes documentation has a useful [kubectl cheat sheet](https://kubernetes.io/docs/reference/kubectl/cheatsheet/#viewing-and-finding-resources).

- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run)
- More detailed use cases for the `kubectl` can be found in the [Kubernetes documentation](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#run).
Loading

0 comments on commit d24b745

Please sign in to comment.