From 0264608104235ee58d7e7e80b76a0dd7f362ac1a Mon Sep 17 00:00:00 2001 From: <> Date: Tue, 9 Apr 2024 15:34:19 +0000 Subject: [PATCH] Deployed f2ab068 with MkDocs version: 1.5.3 --- search/search_index.json | 2 +- services/gpuservice/faq/index.html | 22 ++++++++++++++++++++++ services/gpuservice/index.html | 29 +++++++++++++++++++++-------- sitemap.xml.gz | Bin 844 -> 844 bytes 4 files changed, 44 insertions(+), 9 deletions(-) diff --git a/search/search_index.json b/search/search_index.json index e1ae1edcf..f3540448a 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"EIDF User Documentation","text":"

The Edinburgh International Data Facility (EIDF) is built and operated by EPCC at the University of Edinburgh. EIDF is a place to store, find and work with data of all kinds. You can find more information on the service and the research it supports on the EIDF website.

For more information or for support with our services, please email eidf@epcc.ed.ac.uk in the first instance.

"},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"

This documentation gives more in-depth coverage of current EIDF services. It is aimed primarily at developers or power users.

"},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"

The source for this documentation is publicly available in the EIDF documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or additions to the content and/or addition of Issues providing suggestions for how it can be improved.

Full details of how to contribute can be found in the README.md file of the repository.

This documentation set is a work in progress.

"},{"location":"#credits","title":"Credits","text":"

This documentation draws on the ARCHER2 National Supercomputing Service documentation.

"},{"location":"access/","title":"Accessing EIDF","text":"

Some EIDF services are accessed via a Web browser and some by \"traditional\" command-line ssh.

All EIDF services use the EPCC SAFE service management back end, to ensure compatibility with other EPCC high-performance computing services.

"},{"location":"access/#web-access-to-virtual-machines","title":"Web Access to Virtual Machines","text":"

The Virtual Desktop VM service is browser-based, providing a virtual desktop interface (Apache Guacamole) for \"desktop-in-a-browser\" access. Applications to use the VM service are made through the EIDF Portal.

EIDF Portal: how to ask to join an existing EIDF project and how to apply for a new project

VDI access to virtual machines: how to connect to the virtual desktop interface.

"},{"location":"access/#ssh-access-to-virtual-machines","title":"SSH Access to Virtual Machines","text":"

Users with the appropriate permissions can also use ssh to login to Virtual Desktop VMs

"},{"location":"access/#ssh-access-to-computing-services","title":"SSH Access to Computing Services","text":"

Includes access to the following services:

To login to most command-line services with ssh you should use the username and password you obtained from SAFE when you applied for access, along with the SSH Key you registered when creating the account. You can then login to the host following the appropriately linked instructions above.

"},{"location":"access/project/","title":"EIDF Portal","text":"

Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.

The EIDF Portal uses EPCC's SAFE service management software to manage user accounts across all EPCC services. To log in to the Portal you will first be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

"},{"location":"access/project/#how-to-request-to-join-a-project","title":"How to request to join a project","text":"

Log in to the EIDF Portal and navigate to \"Projects\" and choose \"Request access\". Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".

Now you have to wait for your PI or project manager to accept your request to register.

"},{"location":"access/project/#how-to-apply-for-a-project-as-a-principal-investigator","title":"How to apply for a project as a Principal Investigator","text":""},{"location":"access/project/#create-a-new-project-application","title":"Create a new project application","text":"

Navigate to the EIDF Portal and log in via SAFE if necessary (see above).

Once you have logged in click on \"Applications\" in the menu and choose \"New Application\".

  1. Fill in the Application Title - this will be the name of the project once it is approved.
  2. Choose a start date and an end date for your project.
  3. Click \"Create\" to create your project application.

Once the application has been created you see an overview of the form you are required to fill in. You can revisit the application at any time by clicking on \"Applications\" and choosing \"Your applications\" to display all your current and past applications and their status, or follow the link https://portal.eidf.ac.uk/proposal/.

"},{"location":"access/project/#populate-a-project-application","title":"Populate a project application","text":"

Fill in each section of the application as required:

You can edit and save each section separately and revisit the application at a later time.

"},{"location":"access/project/#datasets","title":"Datasets","text":"

You are required to fill in a \"Dataset\" form for each dataset that you are planning to store and process as part of your project.

We are required to ensure that projects involving \"sensitive\" data have the necessary permissions in place. The answers to these questions will enable us to decide what additional documentation we may need, and whether your project may need to be set up in an independently governed Safe Haven. There may be some projects we are simply unable to host for data protection reasons.

"},{"location":"access/project/#resource-requirements","title":"Resource Requirements","text":"

Add an estimate for each size and type of VM that is required.

"},{"location":"access/project/#submission","title":"Submission","text":"

When you are happy with your application, click \"Submit\". If there are missing fields that are required these are highlighted and your submission will fail.

When your submission was successful the application status is marked as \"Submitted\" and now you have to wait while the EIDF approval team considers your application. You may be contacted if there are any questions regarding your application or further information is required, and you will be notified of the outcome of your application.

"},{"location":"access/project/#approved-project","title":"Approved Project","text":"

If your application was approved, refer to Data Science Virtual Desktops: Quickstart how to view your project and to Data Science Virtual Desktops: Managing VMs how to manage a project and how to create virtual machines and user accounts.

"},{"location":"access/ssh/","title":"SSH Access to Virtual Machines using the EIDF-Gateway Jump Host","text":"

The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. You will require three things to use the gateway:

  1. A user within a project allowed to access the gateway and a password set.
  2. An SSH-key linked to this account, used to authenticate against the gateway.
  3. Have MFA setup with your project account via SAFE.

Steps to meet all of these requirements are explained below.

"},{"location":"access/ssh/#generating-and-adding-an-ssh-key","title":"Generating and Adding an SSH Key","text":"

In order to make use of the EIDF-Gateway, your EIDF account needs an SSH-Key associated with it. If you added one while creating your EIDF account, you can skip this step.

"},{"location":"access/ssh/#check-for-an-existing-ssh-key","title":"Check for an existing SSH Key","text":"

To check if you have an SSH Key associated with your account:

  1. Login to the Portal
  2. Select 'Your Projects'
  3. Select your project name
  4. Select your username

If there is an entry under 'Credentials', then you're all setup. If not, you'll need to generate an SSH-Key, to do this:

"},{"location":"access/ssh/#generate-a-new-ssh-key","title":"Generate a new SSH Key","text":"
  1. Open a new window of whatever terminal you will use to SSH to EIDF.
  2. Generate a new SSH Key:

    ssh-keygen\n
  3. It is fine to accept the default name and path for the key unless you manage a number of keys.

  4. Press enter to finish generating the key
"},{"location":"access/ssh/#adding-the-new-ssh-key-to-your-account-via-the-portal","title":"Adding the new SSH Key to your account via the Portal","text":"
  1. Login into the Portal
  2. Select 'Your Projects'
  3. Select the relevant project
  4. Select your username
  5. Select the plus button under 'Credentials'
  6. Select 'Choose File' to upload the PUBLIC (.pub) ssh key generated in the last step, or open the .pub file you just created and copy its contents into the text box.
  7. Click 'Upload Credential' - it should look something like this:
  8. "},{"location":"access/ssh/#adding-a-new-ssh-key-via-safe","title":"Adding a new SSH Key via SAFE","text":"

    This should not be necessary for most users, so only follow this process if you have an issue or have been told to by the EPCC Helpdesk. If you need to add an SSH Key directly to SAFE, you can follow this guide. However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide.

    "},{"location":"access/ssh/#enabling-mfa-via-the-portal","title":"Enabling MFA via the Portal","text":"

    A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.

    To enable this for your EIDF account:

    1. Login to the portal.
    2. Select 'Projects' then 'Your Projects'
    3. Select the project containing the account you'd like to add MFA to.
    4. Under 'Your Accounts', select the account you would like to add MFA to.
    5. Select 'Set MFA Token'
    6. Within your chosen MFA application, scan the QR Code or enter the key and add the token.
    7. Enter the code displayed in the app into the 'Verification Code' box and select 'Set Token'
    8. You will be redirected to the User Account page and a green 'Added MFA Token' message will confirm the token has been added successfully.

    Note

    TOTP is only required for the SSH Gateway, not to the VMs themselves, and not through the VDI. An MFA token will have to be set for each account you'd like to use to access the EIDF SSH Gateway.

    "},{"location":"access/ssh/#using-the-ssh-key-and-totp-code-to-access-eidf-windows-and-linux","title":"Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux","text":"
    1. From your local terminal, import the SSH Key you generated above: ssh-add /path/to/ssh-key

    2. This should return \"Identity added [Path to SSH Key]\" if successful. You can then follow the steps below to access your VM.

    "},{"location":"access/ssh/#accessing-from-macoslinux","title":"Accessing From MacOS/Linux","text":"

    Warning

    If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.

    OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.

    Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the commands below:

    ssh-add /path/to/ssh-key\nssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]\n

    For example:

    ssh-add ~/.ssh/keys/id_ed25519\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n

    Info

    If the ssh-add command fails saying the SSH Agent is not running, run the below command:

    eval `ssh-agent`

    Then re-run the ssh-add command above.

    The -J flag is use to specify that we will access the second specified host by jumping through the first specified host.

    You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.

    "},{"location":"access/ssh/#accessing-from-windows","title":"Accessing from Windows","text":"

    Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won\u2019t be covered in this tutorial.

    "},{"location":"access/ssh/#installing-and-using-openssh","title":"Installing and using OpenSSH","text":"
    1. Click the \u2018Start\u2019 button at the bottom of the screen
    2. Click the \u2018Settings\u2019 cog icon
    3. Select 'System'
    4. Select the \u2018Optional Features\u2019 option at the bottom of the list
    5. If \u2018OpenSSH Client\u2019 is not under \u2018Installed Features\u2019, click the \u2018View Features\u2019 button
    6. Search \u2018OpenSSH Client\u2019
    7. Select the check box next to \u2018OpenSSH Client\u2019 and click \u2018Install\u2019
    "},{"location":"access/ssh/#accessing-eidf-via-a-terminal","title":"Accessing EIDF via a Terminal","text":"

    Warning

    If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.

    1. Open either Powershell or the Windows Terminal
    2. Import the SSH Key you generated above:

      ssh-add \\path\\to\\sshkey\n\nFor Example:\nssh-add .\\.ssh\\id_ed25519\n
    3. This should return \"Identity added [Path to SSH Key]\" if successful. If it doesn't, run the following in Powershell:

      Get-Service -Name ssh-agent | Set-Service -StartupType Manual\nStart-Service ssh-agent\nssh-add \\path\\to\\sshkey\n
    4. Login by jumping through the gateway.

      ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip]\n\nFor Example:\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n

    You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.

    "},{"location":"access/ssh/#ssh-aliases","title":"SSH Aliases","text":"

    You can use SSH Aliases to access your VMs with a single word.

    1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. Using the text editor of your choice (vi used as an example), edit the .ssh/config file:

      vi ~/.ssh/config\n
    2. Insert the following lines:

      Host eidf-gateway\n  Hostname eidf-gateway.epcc.ed.ac.uk\n  User <eidf project username>\n  IdentityFile /path/to/ssh/key\n

      For example:

      Host eidf-gateway\n  Hostname eidf-gateway.epcc.ed.ac.uk\n  User alice\n  IdentityFile ~/.ssh/id_ed25519\n
    3. Save and quit the file.

    4. Now you can ssh to your VM using the below command:

      ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key\n

      For Example:

      ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519\n
    5. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM:

      Host <vm name/alias>\n  HostName 10.24.VM.IP\n  User <vm username>\n  IdentityFile /path/to/ssh/key\n  ProxyCommand ssh eidf-gateway -W %h:%p\n

      For Example:

      Host demo\n  HostName 10.24.1.1\n  User alice\n  IdentityFile ~/.ssh/id_ed25519\n  ProxyCommand ssh eidf-gateway -W %h:%p\n
    6. Now, by running ssh demo your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: HostKeyAlgorithms +ssh-rsa

    Info

    This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. This method allows for a much less complicated ssh command to reach your VMs. You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like. The -J flag is use to specify that we will access the second specified host by jumping through the first specified host.

    "},{"location":"access/ssh/#first-password-setting-and-password-resets","title":"First Password Setting and Password Resets","text":"

    Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in Set or change the password for a user account.

    "},{"location":"access/virtualmachines-vdi/","title":"Virtual Machines (VMs) and the EIDF Virtual Desktop Interface (VDI)","text":"

    Using the EIDF VDI, members of EIDF projects can connect to VMs that they have been granted access to. The EIDF VDI is a web portal that displays the connections to VMs a user has available to them, and then those connections can be easily initiated by clicking on them in the user interface. Once connected to the target VM, all interactions are mediated through the user's web browser by the EIDF VDI.

    "},{"location":"access/virtualmachines-vdi/#login-to-the-eidf-vdi","title":"Login to the EIDF VDI","text":"

    Once your membership request to join the appropriate EIDF project has been approved, you will be able to login to the EIDF VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi.

    Authentication to the VDI is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

    "},{"location":"access/virtualmachines-vdi/#navigating-the-eidf-vdi","title":"Navigating the EIDF VDI","text":"

    After you have been authenticated through SAFE and logged into the EIDF VDI, if you have multiple connections available to you that have been associated with your user (typically in the case of research projects), you will be presented with the VDI home screen as shown below:

    VDI home page with list of available VM connections

    Adding connections

    Note that if a project manager has added a new connection for you it may not appear in the list of connections immediately. You must log out and log in again to refresh your connections list.

    "},{"location":"access/virtualmachines-vdi/#connecting-to-a-vm","title":"Connecting to a VM","text":"

    If you have only one connection associated with your VDI user account (typically in the case of workshops), you will be automatically connected to the target VM's virtual desktop. Once you are connected to the VM, you will be asked for your username and password as shown below (if you are participating in a workshop, then you may not be asked for credentials)

    VM virtual desktop connection user account login screen

    Once your credentials have been accepted, you will be connected to your VM's desktop environment. For instance, the screenshot below shows a resulting connection to a Xubuntu 20.04 VM with the Xfce desktop environment.

    VM virtual desktop

    "},{"location":"access/virtualmachines-vdi/#vdi-features-for-the-virtual-desktop","title":"VDI Features for the Virtual Desktop","text":"

    The EIDF VDI is an instance of the Apache Guacamole clientless remote desktop gateway. Since the connection to your VM virtual desktop is entirely managed through Guacamole in the web browser, there are some additional features to be aware of that may assist you when using the VDI.

    "},{"location":"access/virtualmachines-vdi/#the-vdi-menu","title":"The VDI Menu","text":"

    The Guacamole menu is a sidebar which is hidden until explicitly shown. On a desktop or other device which has a hardware keyboard, you can show this menu by pressing <Ctrl> + <Alt> + <Shift> on a Windows PC client, or <Ctrl> + <Command> + <Shift> on a Mac client. To hide the menu, you press the same key combination once again. The menu provides various options, including:

    "},{"location":"access/virtualmachines-vdi/#clipboard-copy-and-paste-functionality","title":"Clipboard Copy and Paste Functionality","text":"

    After you have activated the Guacamole menu using the key combination above, at the top of the menu is a text area labeled \u201cclipboard\u201d along with some basic instructions:

    Text copied/cut within Guacamole will appear here. Changes to the text below will affect the remote clipboard.

    The text area functions as an interface between the remote clipboard and the local clipboard. Text from the local clipboard can be pasted into the text area, causing that text to be sent to the clipboard of the remote desktop. Similarly, if you copy or cut text within the remote desktop, you will see that text within the text area, and can manually copy it into the local clipboard if desired.

    You can use the standard keyboard shortcuts to copy text from your client PC or Mac to the Guacamole menu clipboard, then again copy that text from the Guacamole menu clipboard into an application or CLI terminal on the VM's remote desktop. An example of using the copy and paste clipboard is shown in the screenshot below.

    The EIDF VDI Clipboard

    "},{"location":"access/virtualmachines-vdi/#keyboard-language-and-layout-settings","title":"Keyboard Language and Layout Settings","text":"

    For users who do not have standard English (UK) keyboard layouts, key presses can have unexpected translations as they are transmitted to your VM. Please contact the EIDF helpdesk at eidf@epcc.ed.ac.uk if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration.

    "},{"location":"bespoke/","title":"Bespoke Services","text":"

    Ed-DaSH

    "},{"location":"bespoke/eddash/","title":"EIDFWorkshops","text":"

    Ed-DaSH Notebook Service

    Ed-DaSH Virtual Machines

    JupyterHub Notebook Service Access

    "},{"location":"bespoke/eddash/jhub-git/","title":"EIDF JupyterHub Notebook Service Access","text":"

    Using the EIDF JupyterHub, users can access a range of services including standard interactive Python notebooks as well as RStudio Server.

    "},{"location":"bespoke/eddash/jhub-git/#ed-dash-workshops","title":"Ed-DaSH Workshops","text":""},{"location":"bespoke/eddash/jhub-git/#accessing","title":"Accessing","text":"

    In order to access the EIDF JupyterHub, authentication is through GitHub, so you must have an account on https://github.com and that account must be a member of the appropriate organization in GitHub. Please ask your project admin or workshop instructor for the workshop GitHub organization details. Please follow the relevant steps listed below to prepare.

    1. If you do not have a GitHub account associated with the email you registered for the workshop with, follow the steps described in Step 1: Creating a GitHub Account
    2. If you do already have a GitHub account associated with the email address you registered for the workshop with, follow the steps described in Step 2: Registering with the Workshop GitHub Organization
    "},{"location":"bespoke/eddash/jhub-git/#step-1-creating-a-github-account","title":"Step 1: Creating a GitHub Account","text":"
    1. Visit https://github.com/signup in your browser
    2. Enter the email address that you used to register for the workshop
    3. Complete the remaining steps of the GitHub registration process
    4. Send an email to ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop GitHub organization
    5. Wait for an email from GitHub inviting you to join the organization, then follow the steps in Step 2: Registering with the Workshop GitHub Organization
    "},{"location":"bespoke/eddash/jhub-git/#step-2-registering-with-the-workshop-github-organization","title":"Step 2: Registering With the Workshop GitHub Organization","text":"
    1. If you already have a GitHub account associated with the email address that you registered for the workshop with, you should have received an email inviting you to join the relevant GitHub organization. If you have not, email ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop GitHub organization
    2. Once you have been invited to the GitHub organization, you will receive an email with the invitation; click on the button as shown Invitation to join the workshop GitHub organization
    3. Clicking on the button in the email will open a new web page with another form as shown below Form to accept the invitation to join the GitHub organization
    4. Again, click on the button to confirm, then the Ed-DaSH-Training GitHub organization page will open
    "},{"location":"bespoke/eddash/safe-registration/","title":"Accessing","text":"

    In order to access the EIDF VDI and connect to EIDF data science cloud VMs, you need to have an active SAFE account. If you already have a SAFE account, you can skip ahead to the Request Project Membership instructions. Otherwise, follow the Register Account in EPCC SAFE instructions immediately below to create the account.

    Info

    Please also see Register and Join a project in the SAFE documentation for more information.

    "},{"location":"bespoke/eddash/safe-registration/#step-1-register-account-in-epcc-safe","title":"Step 1: Register Account in EPCC SAFE","text":"
    1. Go to SAFE signup and complete the registration form
      1. Mandatory fields are: Email, Nationality, First name, Last name, Institution for reporting, Department, and Gender
      2. Your Email should be the one you used to register for the EIDF service (or Ed-DaSH workshop)
      3. If you are unsure, enter 'University of Edinburgh' for Institution for reporting and 'EIDF' for Department SAFE registration form
    2. Submit the form, then accept the SAFE Acceptable Use policy on the next page SAFE User Access Agreement
    3. After you have completed the registration form and accepted the policy, you will receive an email from support@archer2.ac.uk with a password reset URL
    4. Visit the link in the email and generate a new password, then submit the form
    5. You will now be logged into your new account in SAFE
    "},{"location":"bespoke/eddash/safe-registration/#step-2-request-project-membership","title":"Step 2: Request Project Membership","text":"
    1. While logged into SAFE, select the \u2018Request Access\u2019 menu item from the 'Projects' menu in the top menu bar
    2. This will open the 'Apply for project membership' page
    3. Enter the appropriate project ID into the \u2018Project\u2019 field and click the \u2018Next\u2019 button Apply for project membership in SAFE
    4. In the 'Access route' drop down field that appears, select 'Request membership' (not 'Request machine account') Request project membership in SAFE
    5. The project owner will then receive notification of the application and accept your request
    "},{"location":"bespoke/eddash/workshops/","title":"Workshop Setup","text":"

    Please follow the instructions in JupyterHub Notebook Service Access to arrange access to the EIDF Notebook service before continuing. The table below provides the login URL and the relevant GitHub organization to register with.

    Workshop Login URL GitHub Organization Ed-DaSH Introduction to Statistics https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training Ed-DaSH High-Dimensional Statistics https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training Ed-DaSH Introduction to Machine Learning with Python https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training N8 CIR Introduction to Artificial Neural Networks in Python https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training

    Please follow the sequence of instructions described in the sections below to get ready for the workshop:

    1. Step 1: Accessing the EIDF Notebook Service for the First Time
    2. Step 2: Login to EIDF JupyterHub
    3. Step 3: Creating a New R Script
    "},{"location":"bespoke/eddash/workshops/#step-1-accessing-the-eidf-notebook-service-for-the-first-time","title":"Step 1: Accessing the EIDF Notebook Service for the First Time","text":"

    We will be using the Notebook service provided by the Edinburgh International Data Facility (EIDF). Follow the steps listed below to gain access.

    Warning

    If you are receiving an error response such as '403: Forbidden' when you try to access https://secure.epcc.ed.ac.uk/ed-dash-hub, please send an email to ed-dash-support@mlist.is.ed.ac.uk to request access and also include your IP address which you can find by visiting https://whatismyipaddress.com/ in your browser. Please be aware that if you are accessing the service from outside of the UK, your access might be blocked until you have emailed us with your IP address.

    1. Click on the button
    2. You will be asked to sign in to GitHub, as shown in the form below GitHub sign in form for access to EIDF Notebook Service
    3. Enter your GitHub credentials, or click on the \u2018Create an account\u2019 link if you do not already have one, and follow the prerequisite instructions to register with GitHub and join the workshop organization
    4. Click on the \u2018Sign in\u2019 button
    5. On the next page, you will be asked whether to authorize the workshop organization to access your GitHub account as shown below GitHub form requesting authorization for the workshop organization
    6. Click on the button
    7. At this point, you will receive an email to the email address that you registered with in GitHub, stating that \u201cA third-party OAuth application has been added to your account\u201d for the workshop
    8. If you receive a \u2018403 : Forbidden\u2019 error message on the next screen (if you did not already do so as in step 4 of the prerequisites section) send an email to ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop organization. Otherwise, skip to the next step. N.B. If you are accessing the service from outside of the UK, you may see this error; if so, please contact ed-dash-support@mlist.is.ed.ac.uk to enable access
    9. If you receive a \u2018400 : Bad Request\u2019 error message, you need to accept the invitation that has been emailed to you to join the workshop organization as in the prerequisite instructions
    "},{"location":"bespoke/eddash/workshops/#step-2-login-to-the-eidf-notebook-service","title":"Step 2: Login to the EIDF Notebook Service","text":"

    Now that you have completed registration with the workshop GitHub organization, you can access the workshop RStudio Server in EIDF.

    1. Return to the https://secure.epcc.ed.ac.uk/ed-dash-hub
    2. You will be presented with a choice of server as a list of radio buttons. Select the appropriate one as labelled for your workshop and press the orange 'Start' button
    3. You will now be redirected to the hub spawn pending page for your individual server instance
    4. You will see a message stating that your server is launching. If the page has not updated after 10 seconds, simply refresh the page with the <CTRL> + R or <F5> keys in Windows, or <CMD> + R in macOS
    5. Finally, you will be redirected to either the RStudio Server if it's a statistics workshop, or the Jupyter Lab dashboard otherwise, as shown in the screenshots below The RStudio Server UI The Jupyter Lab Dashboard
    "},{"location":"bespoke/eddash/workshops/#step-3-creating-a-new-r-script","title":"Step 3: Creating a New R Script","text":"

    Follow these quickstart instructions to create your first R script in RStudio Server!

    "},{"location":"faq/","title":"FAQ","text":""},{"location":"faq/#eidf-frequently-asked-questions","title":"EIDF Frequently Asked Questions","text":""},{"location":"faq/#how-do-i-contact-the-eidf-helpdesk","title":"How do I contact the EIDF Helpdesk?","text":"

    Submit a query in the EIDF Portal by selecting \"Submit a Support Request\" in the \"Help and Support\" menu and filling in this form.

    You can also email us at eidf@epcc.ed.ac.uk.

    "},{"location":"faq/#how-do-i-request-more-resources-for-my-project-can-i-extend-my-project","title":"How do I request more resources for my project? Can I extend my project?","text":"

    Submit a support request: In the form select the project that your request relates to and select \"EIDF Project extension: duration and quota\" from the dropdown list of categories. Then enter the new quota or extension date in the description text box below and submit the request.

    The EIDF approval team will consider the extension and you will be notified of the outcome.

    "},{"location":"faq/#new-vms-and-vdi-connections","title":"New VMs and VDI connections","text":"

    My project manager gave me access to a VM but the connection doesn't show up in the VDI connections list?

    This may happen when a machine/VM was added to your connections list while you were logged in to the VDI. Please refresh the connections list by logging out and logging in again, and the new connections should appear.

    "},{"location":"faq/#non-default-ssh-keys","title":"Non-default SSH Keys","text":"

    I have different SSH keys for the SSH gateway and my VM, or I use a key which does not have the default name (~/.ssh/id_rsa) and I cannot login.

    The command syntax shown in our SSH documentation (using the -J <username>@eidf-gateway stanza) makes assumptions about SSH keys and their naming. You should try the full version of the command:

    ssh -o ProxyCommand=\"ssh -i ~/.ssh/<gateway_private_key> -W %h:%p <gateway_username>@eidf-gateway.epcc.ed.ac.uk\" -i ~/.ssh/<vm_private_key> <vm_username>@<vm_ip>\n

    Note that for the majority of users, gateway_username and vm_username are the same, as are gateway_private_key and vm_private_key

    "},{"location":"faq/#username-policy","title":"Username Policy","text":"

    I already have an EIDF username for project Y, can I use this for project X?

    We mandate that every username must be unique across our estate. EPCC machines including EIDF services such as the SDF and DSC VMs, and HPC services such as Cirrus require you to create a new machine account with a unique username for each project you work on. Usernames cannot be used on multiple projects, even if the previous project has finished. However, some projects span multiple machines so you may be able to login to multiple machines with the same username.

    "},{"location":"known-issues/","title":"Known Issues","text":""},{"location":"known-issues/#virtual-desktops","title":"Virtual desktops","text":"

    No known issues.

    "},{"location":"overview/","title":"A Unique Service for Academia and Industry","text":"

    The Edinburgh International Data Facility (EIDF) is a growing set of data and compute services developed to support the Data Driven Innovation Programme at the University of Edinburgh.

    Our goal is to support learners, researchers and innovators across the spectrum, with services from data discovery through simple learn-as-you-play-with-data notebooks to GPU-enabled machine-learning platforms for driving AI application development.

    "},{"location":"overview/#eidf-and-the-data-driven-innovation-initiative","title":"EIDF and the Data-Driven Innovation Initiative","text":"

    Launched at the end of 2018, the Data-Driven Innovation (DDI) programme is one of six funded within the Edinburgh & South-East Scotland City Region Deal. The DDI programme aims to make Edinburgh the \u201cData Capital of Europe\u201d, with ambitious targets to support, enhance and improve talent, research, commercial adoption and entrepreneurship across the region through better use of data.

    The programme targets ten industry sectors, with interactions managed through five DDI Hubs: the Bayes Centre, the Usher Institute, Edinburgh Futures Institute, the National Robotarium, and Easter Bush. The activities of these Hubs are underpinned by EIDF.

    "},{"location":"overview/acknowledgements/","title":"Acknowledging EIDF","text":"

    If you make use of EIDF services in your work, we encourage you to acknowledge us in any publications.

    Acknowledgement of using the facility in publications can be used as an identifiable metric to evaluate the scientific support provided, and helps promote the impact of the wider DDI Programme.

    We encourage our users to ensure that an acknowledgement of EIDF is included in the relevant section of their manuscript. We would suggest:

    This work was supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.

    "},{"location":"overview/contacts/","title":"Contact","text":"

    The Edinburgh International Data Facility is located at the Advanced Computing Facility of EPCC, the supercomputing centre based at the University of Edinburgh.

    "},{"location":"overview/contacts/#email-us","title":"Email us","text":"

    Email EIDF: eidf@epcc.ed.ac.uk

    "},{"location":"overview/contacts/#sign-up","title":"Sign up","text":"

    Join our mailing list to receive updates about EIDF.

    "},{"location":"safe-haven-services/network-access-controls/","title":"Safe Haven Network Access Controls","text":"

    The TRE Safe Haven services are protected against open, global access by IPv4 source address filtering. These network access controls ensure that connections are permitted only from Safe Haven controller partner networks and collaborating research institutions.

    The network access controls are managed by the Safe Haven service controllers who instruct EPCC to add and remove the IPv4 addresses allowed to connect to the service gateway. Researchers must connect to the Safe Haven service by first connecting to their institution or corporate VPN and then connecting to the Safe Haven.

    The Safe Haven IG controller and research project co-ordination teams must submit and confirm IPv4 address filter changes to their service help desk via email.

    "},{"location":"safe-haven-services/overview/","title":"Safe Haven Services","text":"

    The EIDF Trusted Research Environment (TRE) hosts several Safe Haven services that enable researchers to work with sensitive data in a secure environment. These services are operated by EPCC in partnership with Safe Haven controllers who manage the Information Governance (IG) appropriate for the research activities and the data access of their Safe Haven service.

    It is the responsibility of EPCC as the Safe Haven operator to design, implement and administer the technical controls required to deliver the Safe Haven security regime demanded by the Safe Haven controller.

    The role of the Safe Haven controller is to satisfy the needs of the researchers and the data suppliers. The controller is responsible for guaranteeing the confidentiality needs of the data suppliers and matching these with the availability needs of the researchers.

    The service offers secure data sharing and analysis environments allowing researchers access to sensitive data under the terms and conditions prescribed by the data providers. The service prioritises the requirements of the data provider over the demands of the researcher and is an academic TRE operating under the guidance of the Five Safes framework.

    The TRE has dedicated, private cloud infrastructure at EPCC's Advanced Computing Facility (ACF) data centre and has its own HPC cluster and high-performance file systems. When a new Safe Haven service is commissioned in the TRE it is created in a new virtual private cloud providing the Safe Haven service controller with an independent IG domain separate from other Safe Havens in the TRE. All TRE service infrastructure and all TRE project data are hosted at ACF.

    If you have any questions about the EIDF TRE or about Safe Haven services, please contact us.

    "},{"location":"safe-haven-services/safe-haven-access/","title":"Safe Haven Service Access","text":"

    Safe Haven services are accessed from a registered network connection address using a browser. The service URL will be \"https://shs.epcc.ed.ac.uk/<service>\" where <service> is the Safe Haven service name.

    The Safe Haven access process is in three stages from multi-factor authentication to project desktop login.

    Researchers who are active in many research projects and in more than one Safe Haven will need to pay attention to the service they connect to, the project desktop they login to, and the accounts and identities they are using.

    "},{"location":"safe-haven-services/safe-haven-access/#safe-haven-login","title":"Safe Haven Login","text":"

    The first step in the process prompts the user for a Safe Haven username and then for a session PIN code sent via SMS text to the mobile number registered for the username.

    Valid PIN code entry allows the user access to all of the Safe Haven service remote desktop gateways for up to 24 hours without entry of a new PIN code. A user who has successfully entered a PIN code once can access shs.epcc.ed.ac.uk/haven1 and shs.epcc.ed.ac.uk/haven2 without repeating PIN code identity verification.

    When a valid PIN code is accepted, the user is prompted to accept the service use terms and conditions.

    Registration of the user mobile phone number is managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

    "},{"location":"safe-haven-services/safe-haven-access/#remote-desktop-gateway-login","title":"Remote Desktop Gateway Login","text":"

    The second step in the access process is for the user to login to the Safe Haven service remote desktop gateway so that a project desktop connection can be chosen. The user is prompted for a Safe Haven service account identity.

    VDI Safe Haven Service Login Page

    Safe Haven accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

    "},{"location":"safe-haven-services/safe-haven-access/#project-desktop-connection","title":"Project Desktop Connection","text":"

    The third stage in the process is to select the virtual connection from those available on the account's home page. An example home page is shown below offering two connection options to the same virtual machine. Remote desktop connections will have an _rdp suffix and SSH terminal connections have an _ssh suffix. The most recently used connections are shown as screen thumbnails at the top of the page and all the connections available to the user are shown in a tree list below this.

    VM connections available home page

    The remote desktop gateway software used in the Safe Haven services in the TRE is the Apache Guacamole web application. Users new to this application can find the user manual here. It is recommended that users read this short guide, but note that the data sharing features such as copy and paste, connection sharing, and file transfers are disabled on all connections in the TRE Safe Havens.

    A remote desktop or SSH connection is used to access data provided for a specific research project. If a researcher is working on multiple projects within a Safe Haven they can only login to one project at a time. Some connections may allow the user to login to any project and some connections will only allow the user to login into one specific project. This depends on project IG restrictions specified by the Safe Haven and project controllers.

    Project desktop accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/","title":"Using the TRE HPC Cluster","text":""},{"location":"safe-haven-services/using-the-hpc-cluster/#introduction","title":"Introduction","text":"

    The TRE HPC system, also called the SuperDome Flex, is a single node, large memory HPC system. It is provided for compute and data intensive workloads that require more CPU, memory, and better IO performance than can be provided by the project VMs, which have the performance equivalent of small rack mount servers.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#specifications","title":"Specifications","text":"

    The system is an HPE SuperDome Flex configured with 1152 hyper-threaded cores (576 physical cores) and 18TB of memory, of which 17TB is available to users. User home and project data directories are on network mounted storage pods running the BeeGFS parallel filesystem. This storage is built in blocks of 768TB per pod. Multiple pods are available in the TRE for use by the HPC system and the total storage available will vary depending on the project configuration.

    The HPC system runs Red Hat Enterprise Linux, which is not the same flavour of Linux as the Ubuntu distribution running on the desktop VMs. However, most jobs in the TRE run Python and R, and there are few issues moving between the two version of Linux. Use of virtual environments is strongly encouraged to ensure there are no differences between the desktop and HPC runtimes.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#software-management","title":"Software Management","text":"

    All system level software installed and configured on the TRE HPC system is managed by the TRE admin team. Software installation requests may be made by the Safe Haven IG controllers, research project co-ordinators, and researchers by submitting change requests through the dedicated service help desk via email.

    Minor software changes will be made as soon as admin effort can be allocated. Major changes are likely to be scheduled for the TRE monthly maintenance session on the first Thursday of each month.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#hpc-login","title":"HPC Login","text":"

    Login to the HPC system is from the project VM using SSH and is not direct from the VDI. The HPC cluster accounts are the same accounts used on the project VMs, with the same username and password. All project data access on the HPC system is private to the project accounts as it is on the VMs, but it is important to understand that the TRE HPC cluster is shared by projects in other TRE Safe Havens.

    To login to the HPC cluster from the project VMs use ssh shs-sdf01 from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the SSH key configuration instructions here. SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#running-jobs","title":"Running Jobs","text":"

    To use the HPC system fully and fairly, all jobs must be run using the SLURM job manager. More information about SLURM, running batch jobs and running interactive jobs can be found here. Please read this carefully before using the cluster if you have not used SLURM before. The SLURM site also has a set of useful tutorials on HPC clusters and job scheduling.

    All analysis and processing jobs must be run via SLURM. SLURM manages access to all the cores on the system beyond the first 32. If SLURM is not used and programs are run directly from the command line, then there are only 32 cores available, and these are shared by the other users. Normal code development, short test runs, and debugging can be done from the command line without using SLURM.

    There is only one node

    The HPC system is a single node with all cores sharing all the available memory. SLURM jobs should always specify '#SBATCH --nodes=1' to run correctly.

    SLURM email alerts for job status change events are not supported in the TRE.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#resource-limits","title":"Resource Limits","text":"

    There are no resource constraints imposed on the default SLURM partition at present. There are user limits (see the output of ulimit -a). If a project has a requirement for more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours, a resource reservation request should be made by the researchers through email to the service help desk.

    There are no storage quotas enforced in the HPC cluster storage at present. The project storage requirements are negotiated, and space allocated before the project accounts are released. Storage use is monitored, and guidance will be issued before quotas are imposed on projects.

    The HPC system is managed in the spirit of utilising it as fully as possible and as fairly as possible. This approach works best when researchers are aware of their project workload demands and cooperate rather than compete for cluster resources.

    "},{"location":"safe-haven-services/using-the-hpc-cluster/#python-jobs","title":"Python Jobs","text":"

    A basic script to run a Python job in a virtual environment is shown below.

    #!/bin/bash\n#\n#SBATCH --export=ALL                  # Job inherits all env vars\n#SBATCH --job-name=my_job_name        # Job name\n#SBATCH --mem=512G                    # Job memory request\n#SBATCH --output=job-%j.out           # Standard output file\n#SBATCH --error=job-%j.err            # Standard error file\n#SBATCH --nodes=1                     # Run on a single node\n#SBATCH --ntasks=1                    # Run one task per node\n#SBATCH --time=02:00:00               # Time limit hrs:min:sec\n#SBATCH --partition standard          # Run on partition (queue)\n\npwd\nhostname\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\necho \"Running job on a single CPU core\"\n\n# Create the job\u2019s virtual environment\nsource ${HOME}/my_venv/bin/activate\n\n# Run the job code\npython3 ${HOME}/my_job.py\n\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\n
    "},{"location":"safe-haven-services/using-the-hpc-cluster/#mpi-jobs","title":"MPI Jobs","text":"

    An example script for a multi-process MPI example is shown. The system currently supports MPICH MPI.

    #!/bin/bash\n#\n#SBATCH --export=ALL\n#SBATCH --job-name=mpi_test\n#SBATCH --output=job-%j.out\n#SBATCH --error=job-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=5\n#SBATCH --time=05:00\n#SBATCH --partition standard\n\necho \"Submitted Open MPI job\"\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# load Open MPI module\nmodule purge\nmodule load mpi/mpich-x86_64\n\n# run mpi program\nmpirun ${HOME}/test_mpi\n
    "},{"location":"safe-haven-services/using-the-hpc-cluster/#managing-files-and-data","title":"Managing Files and Data","text":"

    There are three file systems to manage in the VM and HPC environment.

    1. The desktop VM /home file system. This can only be used when you login to the VM remote desktop. This file system is local to the VM and is not backed up.
    2. The HPC system /home file system. This can only be used when you login to the HPC system using SSH from the desktop VM. This file system is local to the HPC cluster and is not backed up.
    3. The project file and data space in the /safe_data file system. This file system can only be used when you login to a VM remote desktop session. This file system is backed up.

    The /safe_data file system with the project data cannot be used by the HPC system. The /safe_data file system has restricted access and a relatively slow IO performance compared to the parallel BeeGFS file system storage on the HPC system.

    The process to use the TRE HPC service is to copy and synchronise the project code and data files on the /safe_data file system with the HPC /home file system before and after login sessions and job runs on the HPC cluster. Assuming all the code and data required for the job is in a directory 'current_wip' on the project VM, the workflow is as follows:

    1. Copy project code and data to the HPC cluster (from the desktop VM) rsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:
    2. Run jobs/tests/analysis ssh shs-sdf01, cd current_wip, sbatch/srun my_job
    3. Copy any changed project code and data back to /safe_data (from the desktop VM) rsync -avPz -e ssh shs-sdf01:current_wip /safe_data/my_project
    4. Optionally delete the code and data from the HPC cluster working directory.
    "},{"location":"safe-haven-services/virtual-desktop-connections/","title":"Virtual Machine Connections","text":"

    Sessions on project VMs may be either remote desktop (RDP) logins or SSH terminal logins. Most users will prefer to use the remote desktop connections, but the SSH terminal connection is useful when remote network performance is poor and it must be used for account password changes.

    "},{"location":"safe-haven-services/virtual-desktop-connections/#first-time-login-and-account-password-changes","title":"First Time Login and Account Password Changes","text":"

    Account Password Changes

    Note that first time account login cannot be through RDP as a password change is required. Password reset logins must be SSH terminal sessions as password changes can only be made through SSH connections.

    "},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-ssh-session","title":"Connecting to a Remote SSH Session","text":"

    When a VM SSH connection is selected the browser screen becomes a text terminal and the user is prompted to \"Login as: \" with a project account name, and then prompted for the account password. This connection type is equivalent to a standard xterm SSH session.

    "},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-desktop-session","title":"Connecting to a Remote Desktop Session","text":"

    Remote desktop connections work best by first placing the browser in Full Screen mode and leaving it in this mode for the entire duration of the Safe Haven session.

    When a VM RDP connection is selected the browser screen becomes a remote desktop presenting the login screen shown below.

    VM virtual desktop connection user account login screen

    Once the project account credentials have been accepted, a remote dekstop similar to the one shown below is presented. The default VM environment in the TRE is Ubuntu 22.04 with the Xfce desktop.

    VM virtual desktop

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/","title":"Accessing the Superdome Flex inside the EPCC Trusted Research Environment","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#what-is-the-superdome-flex","title":"What is the Superdome Flex?","text":"

    The Superdome Flex (SDF) is a high-performance computing cluster manufactured by Hewlett Packard Enterprise. It has been designed to handle multi-core, high-memory tasks in environments where security is paramount. The hardware specifications of the SDF within the Trusted Research Environment (TRE) are as follows:

    The software specification of the SDF are:

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#key-point","title":"Key Point","text":"

    The SDF is within the TRE. Therefore, the same restrictions apply, i.e. the SDF is isolated from the internet (no downloading code from public GitHub repos) and copying/recording/extracting anything on the SDF outside of the TRE is strictly prohibited unless through approved processes.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#accessing-the-sdf","title":"Accessing the SDF","text":"

    Users can only access the SDF by ssh-ing into it via their VM desktop.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#hello-world","title":"Hello world","text":"
    **** On the VM desktop terminal ****\n\nssh shs-sdf01\n<Enter VM password>\n\necho \"Hello World\"\n\nexit\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#sdf-vs-vm-file-systems","title":"SDF vs VM file systems","text":"

    The SDF file system is separate from the VM file system, which is again separate from the project data space. Files need to be transferred between the three systems for any analysis to be completed within the SDF.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-showing-separate-sdf-and-vm-file-systems","title":"Example showing separate SDF and VM file systems","text":"
    **** On the VM desktop terminal ****\n\ncd ~\ntouch test.txt\nls\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is not here\nexit\n\nscp test.txt shs-sdf01:/home/<USERNAME>/\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is here\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-copying-data-between-project-data-space-and-sdf","title":"Example copying data between project data space and SDF","text":"

    Transferring and synchronising data sets between the project data space and the SDF is easier with the rsync command (rather than manually checking and copying files/folders with scp). rsync only transfers files that are different between the two targets, more details in its manual.

    **** On the VM desktop terminal ****\n\nman rsync # check instructions for using rsync\n\nrsync -avPz -e ssh /safe_data/my_project/ shs-sdf01:/home/<USERNAME>/my_project/ # sync project folder and SDF home folder\n\nssh shs-sdf01\n<Enter VM password>\n\n*** Conduct analysis on SDF ***\n\nexit\n\nrsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:/home/<USERNAME>/my_project/ # sync project file and ssh home page # re-syncronise project folder and SDF home folder\n\n*** Optionally remove the project folder on SDF ***\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/","title":"Running R/Python Scripts","text":"

    Running analysis scripts on the SDF is slightly different to running scripts on the Desktop VMs. The Linux distribution differs between the two with the SDF using Red Hat Enterprise Linux (RHEL) and the Desktop VMs using Ubuntu. Therefore, it is highly advisable to use virtual environments (e.g. conda environments) to complete any analysis and aid the transition between the two distributions. Conda should run out of the box on the Desktop VMs, but some configuration is required on the SDF.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#setting-up-conda-environments-on-you-first-connection-to-the-sdf","title":"Setting up conda environments on you first connection to the SDF","text":"
    *** SDF Terminal ***\n\nconda activate base # Test conda environment\n\n# Conda command will not be found. There is no need to install!\n\neval \"$(/opt/anaconda3/bin/conda shell.bash hook)\" # Tells your terminal where conda is\n\nconda init # changes your .bashrc file so conda is automatically available in the future\n\nconda config --set auto_activate_base false # stop conda base from being activated on startup\n\npython # note python version\n\nexit()\n

    The base conda environment is now available but note that the python and gcc compilers are not the latest (Python 3.9.7 and gcc 7.5.0).

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#getting-an-up-to-date-python-version","title":"Getting an up-to-date python version","text":"

    In order to get an up-to-date python version we first need to use an updated gcc version. Fortunately, conda has an updated gcc toolset that can be installed.

    *** SDF Terminal ***\n\nconda activate base # If conda isn't already active\n\nconda create -n python-v3.11 gcc_linux-64=11.2.0 python=3.11.3\n\nconda activate python-v3.11\n\npython\n\nexit()\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#running-r-scripts-on-the-sdf","title":"Running R scripts on the SDF","text":"

    The default version of R available on the SDF is v4.1.2. Alternative R versions can be installed using conda similar to the python conda environment above.

    conda create -n r-v4.3 gcc_linux-64=11.2.0 r-base=4.3\n\nconda activate r-v4.3\n\nR\n\nq()\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#final-points","title":"Final points","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/","title":"Submitting Scripts to Slurm","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#what-is-slurm","title":"What is Slurm?","text":"

    Slurm is a workload manager that schedules jobs submitted to a shared resource. Slurm is a well-developed tool that can manage large computing clusters, such as ARCHER2, with thousands of users each with different priorities and allocated computing hours. Inside the TRE, Slurm is used to help ensure all users of the SDF get equitable access. Therefore, users who are submitting jobs with high resource requirements (>80 cores, >1TB of memory) may have to wait longer for resource allocation to enable users with lower resource demands to continue their work.

    Slurm is currently set up so all users have equal priority and there is no limit to the total number of CPU hours allocated to a user per month. However, there are limits to the maximum amount of resources that can be allocated to an individual job. Jobs that require more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours will be rejected. If users need to submit jobs with large resource demand, they need to submit a resource reservation request by emailing their project's service desk.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#why-do-you-need-to-use-slurm","title":"Why do you need to use Slurm?","text":"

    The SDF is a resource shared across all projects within the TRE and all users should have equal opportunity to use the SDF to complete resource-intense tasks appropriate to their projects. Users of the SDF are required to consider the needs of the wider community by:

    Users can develop code, complete test runs, and debug from the SDF command line without using Slurm. However, only 32 of the 512 cores are accessible without submitting a job request to Slurm. These cores are accessible to all users simultaneously.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#slurm-basics","title":"Slurm basics","text":"

    Slurm revolves around four main entities: nodes, partitions, jobs and job steps. Nodes and partitions are relevant for more complex distributed computing clusters so Slurm can allocate appropriate resources to jobs across multiple pieces of hardware. Jobs are requests for resources and job steps are what need to be completed once the resources have been allocated (completed in sequence or parallel). Job steps can be further broken down into tasks.

    There are four key commands for Slurm users:

    More details on these functions (and several not mentioned here) can be seen on the Slurm website.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-simple-job","title":"Submitting a simple job","text":"
    *** SDF Terminal ***\n\nsqueue -u $USER # Check if there are jobs already queued or running for you\n\nsrun --job-name=my_first_slurm_job --nodes 1 --ntasks 10 --cpus-per-task 2 echo 'Hello World'\n\nsqueue -u $USER --state=CD # List all completed jobs\n

    In this instance, the srun command completes two steps: job submission and job step execution. First, it submits a job request to be allocated 10 CPUs (1 CPU for each of the 10 tasks). Once the resources are available, it executes the job step consisting of 10 tasks each running the 'echo \"Hello World\"' function.

    srun accepts a wide variety of options to specify the resources required to complete its job step. Within the SDF, you must always request 1 node (as there is only one node) and never use the --exclusive option (as no one will have exclusive access to this shared resource). Notice that running srun blocks your terminal from accepting any more commands and the output from each task in the job step, i.e. Hello World in the above example, outputs to your terminal. We will compare this to running a sbatch command.\u0011

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-batch-job","title":"Submitting a batch job","text":"

    Batch jobs are incredibly useful because they run in the background without blocking your terminal. Batch jobs also output the results to a log file rather than straight to your terminal. This allows you to check a job was completed successfully at a later time so you can move on to other things whilst waiting for a job to complete.

    A batch job can be submitted to Slurm by passing a job script to the sbatch command. The first few lines of a job script outline the resources to be requested as part of the job. The remainder of a job script consists of one or more srun commands outlining the job steps that need to be completed (in sequence or parallel) once the resources are available. There are numerous options for defining the resource requirements of a job including:

    More information on the various options are in the sbatch documentation.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script","title":"Example Job Script","text":"
    #!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=10\n#SBATCH --cpus-per-task=2\n\n% Run echo task in sequence\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task A. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\n% Run echo task in parallel with the ampersand character\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task A. Time: \" $(date +\u201d%H:%M:%S\u201d) &\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script-submission","title":"Example job script submission","text":"
    *** SDF Terminal ***\n\nnano example_job_script.sh\n\n*** Copy example job script above ***\n\nsbatch example_job_script.sh\n\nsqueue -u $USER -r 5\n\n*** Wait for the batch job to be completed ***\n\ncat example_job_script.log # The series tasks should be grouped together and the parallel tasks interspersed.\n

    The example batch job is intended to show two things: 1) the usefulness of the sbatch command and 2) the versatility of a job script. As the sbatch command allows you to submit scripts and check their outcome at your own discretion, it is the most common way of interacting with Slurm. Meanwhile, the job script command allows you to specify one global resource request and break it up into multiple job steps with different resource demands that can be completed in parallel or in sequence.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-pythonr-code-to-slurm","title":"Submitting python/R code to Slurm","text":"

    Although submitting job steps containing python/R analysis scripts can be done with srun directly, as below, it is more common to submit bash scripts that call the analysis scripts after setting up the environment (i.e. after calling conda activate).

    **** Python code job submission ****\n\nsrun --job-name=my_first_python_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G python3 example_script.py\n\n**** R code job submission ****\n\nsrun --job-name=my_first_r_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G Rscript example_script.R\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#signposting","title":"Signposting","text":"

    Useful websites for learning more about Slurm:

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/","title":"Parallelised Python analysis with Dask","text":"

    This lesson is adapted from a workshop introducing users to running python scripts on ARCHER2 as developed by Adrian Jackson.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#introduction","title":"Introduction","text":"

    Python does not have native support for parallelisation. Python contains a Global Interpreter Lock (GIL) which means the python interpreter only allows one thread to execute at a time. The advantage of the GIL is that C libraries can be easily integrated into Python scripts without checking if they are thread-safe. However, this means that most common python modules cannot be easily parallelised. Fortunately, there are now several re-implementations of common python modules that work around the GIL and are therefore parallelisable. Dask is a python module that contains a parallelised version of the pandas data frame as well as a general format for parallelising any python code.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask","title":"Dask","text":"

    Dask enables thread-safe parallelised python execution by creating task graphs (a graph of the dependencies of the inputs and outputs of each function) and then deducing which ones can be run separately. This lesson introduces some general concepts required for programming using Dask. There are also some exercises with example answers to help you write your first parallelised python scripts.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#arrays-data-frames-and-bags","title":"Arrays, data frames and bags","text":"

    Dask contains three data objects to enable parallelised analysis of large data sets in a way familiar to most python programmers. If the same operations are being applied to a large data set then Dask can split up the data set and apply the operations in parallel. The three data objects that Dask can easily split up are:

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#example-dask-array","title":"Example Dask array","text":"

    You may need to install dask or create a new conda environment with it in.

    conda create -n dask-env gcc_linux-64=11.2.0 python=3.11.3 dask\n\nconda activate dask-env\n

    Try running the following Python using dask:

    import dask.array as da\n\nx = da.random.random((10000, 10000), chunks=(1000, 1000))\n\nprint(x)\n\nprint(x.compute())\n\nprint(x.sum())\n\nprint(x.sum().compute())\n

    This should demonstrate that dask is both straightforward to implement simple parallelism, but also lazy in that it does not compute anything until you force it to with the .compute() function.

    You can also try out dask DataFrames, using the following code:

    import dask.dataframe as dd\n\ndf = dd.read_csv('surveys.csv')\n\ndf.head()\ndf.tail()\n\ndf.weight.max().compute()\n

    You can try using different blocksizes when reading in the csv file, and then undertaking an operation on the data, as follows: Experiment with varying blocksizes, although you should be aware that making your block size too small is likely to cause poor performance (the blocksize affects the number of bytes read in at each operation).

    df = dd.read_csv('surveys.csv', blocksize=\"10000\")\ndf.weight.max().compute()\n

    You can also experiment with Dask Bags to see how that functionality works:

    import dask.bag as db\nfrom operator import add\nb = db.from_sequence([1, 2, 3, 4, 5], npartitions=2)\nprint(b.compute())\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask-delayed","title":"Dask Delayed","text":"

    Dask delayed lets you construct your own task graphs/parallelism from Python functions. You can find out more about dask delayed from the dask documentation Try parallelising the code below using the .delayed function or the @delayed decorator, an example answer can be found here.

    def inc(x):\n    return x + 1\n\ndef double(x):\n    return x * 2\n\ndef add(x, y):\n    return x + y\n\ndata = [1, 2, 3, 4, 5]\n\noutput = []\nfor x in data:\n    a = inc(x)\n    b = double(x)\n    c = add(a, b)\n    output.append(c)\n\ntotal = sum(output)\n\nprint(total)\n
    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#mandelbrot-exercise","title":"Mandelbrot Exercise","text":"

    The code below calculates the members of a Mandelbrot set using Python functions:

    import sys\nimport time\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef mandelbrot(h, w, maxit=20, r=2):\n    \"\"\"Returns an image of the Mandelbrot fractal of size (h,w).\"\"\"\n    start = time.time()\n\n    x = np.linspace(-2.5, 1.5, 4*h+1)\n\n    y = np.linspace(-1.5, 1.5, 3*w+1)\n\n    A, B = np.meshgrid(x, y)\n\n    C = A + B*1j\n\n    z = np.zeros_like(C)\n\n    divtime = maxit + np.zeros(z.shape, dtype=int)\n\n    for i in range(maxit):\n        z = z**2 + C\n        diverge = abs(z) > r # who is diverging\n        div_now = diverge & (divtime == maxit) # who is diverging now\n        divtime[div_now] = i # note when\n        z[diverge] = r # avoid diverging too much\n\n    end = time.time()\n\n    return divtime, end-start\n\nh = 2000\nw = 2000\n\nmandelbrot_space, time = mandelbrot(h, w)\n\nplt.imshow(mandelbrot_space)\n\nprint(time)\n

    Your task is to parallelise this code using Dask Array functionality. Using the base python code above, extend it with Dask Array for the main arrays in the computation. Remember you need to specify a chunk size with Dask Arrays, and you will also need to call compute at some point to force Dask to actually undertake the computation. Note, depending on where you run this you may not see any actual speed up of the computation. You need access to extra resources (compute cores) for the calculation to go faster. If in doubt, submit a python script of your solution to the SDF compute nodes to see if you see speed up there. If you are struggling with this parallelisation exercise, there is a solution available for you here.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#pi-exercise","title":"Pi Exercise","text":"

    The code below calculates Pi using a function that can split it up into chunks and calculate each chunk separately. Currently it uses a single chunk to produce the final value of Pi, but that can be changed by calling pi_chunk multiple times with different inputs. This is not necessarily the most efficient method for calculating Pi in serial, but it does enable parallelisation of the calculation of Pi using multiple copies of pi_chunk called simultaneously.

    import time\nimport sys\n\n# Calculate pi in chunks\n\n# n     - total number of steps to be undertaken across all chunks\n# lower - the lowest number of this chunk\n# upper - the upper limit of this chunk such that i < upper\n\ndef pi_chunk(n, lower, upper):\n    step = 1.0 / n\n    p = step * sum(4.0/(1.0 + ((i + 0.5) * (i + 0.5) * step * step)) for i in range(lower, upper))\n    return p\n\n# Number of slices\n\nnum_steps = 10000000\n\nprint(\"Calculating PI using:\\n \" + str(num_steps) + \" slices\")\n\nstart = time.time()\n\n# Calculate using a single chunk containing all steps\n\np = pi_chunk(num_steps, 1, num_steps)\n\nstop = time.time()\n\nprint(\"Obtained value of Pi: \" + str(p))\n\nprint(\"Time taken: \" + str(stop - start) + \" seconds\")\n

    For this exercise, your task is to implemented the above code on the SDF, and then parallelise using Dask. There are a number of different ways you could parallelise this using Dask, but we suggest using the Futures map functionality to run the pi_chunk function on a range of different inputs. Futures map has the following definition:

    Client.map(func, *iterables[, key, workers, ...])\n

    Where func is the function you want to run, and then the subsequent arguments are inputs to that function. To utilise this for the Pi calculation, you will first need to setup and configure a Dask Client to use, and also create and populate lists or vectors of inputs to be passed to the pi_chunk function for each function run that Dask launches.

    If you run Dask with processes then it is possible that you will get errors about forking processes, such as these:

        An attempt has been made to start a new process before the current process has finished its bootstrapping phase.\n    This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:\n

    In that case you need to encapsulate your code within a main function, using something like this:

    if __name__ == \"__main__\":\n

    If you are struggling with this exercise then there is a solution available for you here.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#signposting","title":"Signposting","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/","title":"Parallelised R Analysis","text":"

    This lesson is adapted from a workshop introducing users to running R scripts on ARCHER2 as developed by Adrian Jackson.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#introduction","title":"Introduction","text":"

    In this exercise we are going to try different methods of parallelising R on the SDF. This will include single node parallelisation functionality (e.g. using threads or processes to use cores within a single node), and distributed memory functionality that enables the parallelisation of R programs across multiple nodes.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#example-parallelised-r-code","title":"Example parallelised R code","text":"

    You may need to activate an R conda environment.

    conda activate r-v4.2\n

    Try running the following R script using R on the SDF login node:

    n <- 8*2048\nA <- matrix( rnorm(n*n), ncol=n, nrow=n )\nB <- matrix( rnorm(n*n), ncol=n, nrow=n )\nC <- A %*% B\n

    You can run this as follows on the SDF (assuming you have saved the above code into a file named matrix.R):

    Rscript ./matrix.R\n

    You can check the resources used by R when running on the login node using this command:

    top -u $USER\n

    If you run the R script in the background using &, as follows, you can then monitor your run using the top command. You may notice when you run your program that at points R uses many more resources than a single core can provide, as demonstrated below:

        PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND\n    178357 adrianj 20 0 15.542 0.014t 13064 R 10862 2.773 9:01.66 R\n

    In the example above it can be seen that >10862% of a single core is being used by R. This is an example of R using automatic parallelisation. You can experiment with controlling the automatic parallelisation using the OMP_NUM_THREADS variable to restrict the number of cores available to R. Try using the following values:

    export OMP_NUM_THREADS=8\n\nexport OMP_NUM_THREADS=4\n\nexport OMP_NUM_THREADS=2\n

    You may also notice that not all the R script is parallelised. Only the actual matrix multiplication is undertaken in parallel, the initialisation/creation of the matrices is done in serial.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-datatables","title":"Parallelisation with data.tables","text":"

    We can also experiment with the implicit parallelism in other libraries, such as data.table. You will first need to install this library on the SDF. To do this you can simply run the following command:

    install.packages(data.table)\n

    Once you have installed data.table you can experiment with the following code:

    library(data.table)\nvenue_data <- data.table( ID = 1:50000000,\nCapacity = sample(100:1000, size = 50000000, replace = T), Code = sample(LETTERS, 50000000, replace = T),\nCountry = rep(c(\"England\",\"Scotland\",\"Wales\",\"NorthernIreland\"), 50000000))\nsystem.time(venue_data[, mean(Capacity), by = Country])\n

    This creates some random data in a large data table and then performs a calculation on it. Try running R with varying numbers of threads to see what impact that has on performance. Remember, you can vary the number of threads R uses by setting OMP_NUM_THREADS= before you run R. If you want to try easily varying the number of threads you can save the above code into a script and run it using Rscript, changing OMP_NUM_THREADS each time you run it, e.g.:

    export OMP_NUM_THREADS=1\n\nRscript ./data_table_test.R\n\nexport OMP_NUM_THREADS=2\n\nRscript ./data_table_test.R\n

    The elapsed time that is printed out when the calculation is run represents how long the script/program took to run. It\u2019s important to bear in mind that, as with the matrix multiplication exercise, not everything will be parallelised. Creating the data table is done in serial so does not benefit from the addition of more threads.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#loop-and-function-parallelism","title":"Loop and function parallelism","text":"

    R provides a number of different functions to run loops or functions in parallel. One of the most common functions is to use are the {X}apply functions:

    For example:

    res <- lapply(1:3, function(i) {\n    sqrt(i)*sqrt(i*2)\n    })\n

    The {X}apply functionality supports iteration over a dataset without requiring a loop to be constructed. However, the functions outlined above do not exploit parallelism, even if there is potential for parallelisation many operations that utilise them.

    There are a number of mechanisms that can be used to implement parallelism using the {X}apply functions. One of the simplest is using the parallel library, and the mclapply function:

    library(parallel)\nres <- mclapply(1:3, function(i) {\n    sqrt(i)\n})\n

    Try experimenting with the above functions on large numbers of iterations, both with lapply and mclapply. Can you achieve better performance using the MC_CORES environment variable to specify how many parallel processes R uses to complete these calculations? The default on the SDF is 2 cores, but you can increase this in the same way we did for OMP_NUM_THREADS, e.g.:

    export MC_CORES=16\n

    Try different numbers of iterations of the functions (e.g. change 1:3 in the code to something much larger), and different numbers of parallel processes, e.g.:

    export MC_CORES=2\n\nexport MC_CORES=8\n\nexport MC_CORES=16\n

    If you have separate functions then the above approach will provide a simple method for parallelising using the resources within a single node. However, if your functionality is more loop-based, then you may not wish to have to package this up into separate functions to parallelise.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-foreach","title":"Parallelisation with foreach","text":"

    The foreach package can be used to parallelise loops as well as functions. Consider a loop of the following form:

    main_list <- c()\nfor (i in 1:3) {\n    main_list <- c(main_list, sqrt(i))\n}\n

    This can be converted to foreach functionality as follows:

    main_list <- c()\nlibrary(foreach)\nforeach(i=1:3) %do% {\n    main_list <- c(main_list, sqrt(i))\n}\n

    Whilst this approach does not significantly change the performance or functionality of the code, it does let us then exploit parallel functionality in foreach. The %do% can be replaced with a %dopar% which will execute the code in parallel.

    To test this out we\u2019re going to try an example using the randomForest library. We can now run the following code in R:

    library(foreach)\nlibrary(randomForest)\nx <- matrix(runif(50000), 1000)\ny <- gl(2, 500)\nrf <- foreach(ntree=rep(250, 4), .combine=combine) %do%\nrandomForest(x, y, ntree=ntree)\nprint(rf)\n

    Implement the above code and run with a system.time to see how long it takes. Once you have done this you can change the %do% to a %dopar% and re-run. Does this provide any performance benefits?

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-doparallel","title":"Parallelisation with doParallel","text":"

    To exploit the parallelism with dopar we need to provide parallel execution functionality and configure it to use extra cores on the system. One method to do this is using the doParallel package.

    library(doParallel)\nregisterDoParallel(8)\n

    Does this now improve performance when running the randomForest example? Experiment with different numbers of workers by changing the number set in registerDoParallel(8) to see what kind of performance you can get. Note, you may also need to change the number of clusters used in the foreach, e.g. what is specified in the rep(250, 4) part of the code, to enable more than 4 different sets to be run at once if using more than 4 workers. The amount of parallel workers you can use is dependent on the hardware you have access to, the number of workers you specify when you setup your parallel backend, and the amount of chunks of work you have to distribute with your foreach configuration.

    "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#cluster-parallelism","title":"Cluster parallelism","text":"

    It is possible to use different parallel backends for foreach. The one we have used in the example above creates new worker processes to provide the parallelism, but you can also use larger numbers of workers through a parallel cluster, e.g.:

    my.cluster <- parallel::makeCluster(8)\nregisterDoParallel(cl = my.cluster)\n

    By default makeCluster creates a socket cluster, where each worker is a new independent process. This can enable running the same R program across a range of systems, as it works on Linux and Windows (and other clients). However, you can also fork the existing R process to create your new workers, e.g.:

    cl <-makeCluster(5, type=\"FORK\")\n

    This saves you from having to create the variables or objects that were setup in the R program/script prior to the creation of the cluster, as they are automatically copied to the workers when using this forking mode. However, it is limited to Linux style systems and cannot scale beyond a single node.

    Once you have finished using a parallel cluster you should shut it down to free up computational resources, using stopCluster, e.g.:

    stopCluster(cl)\n

    When using clusters without the forking approach, you need to distribute objects and variables from the main process to the workers using the clusterExport function, e.g.:

    library(parallel)\nvariableA <- 10\nvariableB <- 20\nmySum <- function(x) variableA + variableB + x\ncl <- makeCluster(4)\nres <- try(parSapply(cl=cl, 1:40, mySum))\n

    The program above will fail because variableA and variableB are not present on the cluster workers. Try the above on the SDF and see what result you get.

    To fix this issue you can modify the program using clusterExport to send variableA and variableB to the workers, prior to running the parSapply e.g.:

    clusterExport(cl=cl, c('variableA', 'variableB'))\n
    "},{"location":"services/","title":"EIDF Services","text":""},{"location":"services/#computing-services","title":"Computing Services","text":"

    Data Science Virtual Desktops

    Managed File Transfer

    Managed JupyterHub

    Cerebras CS-2

    Ultra2

    Graphcore Bow Pod64

    "},{"location":"services/#data-management-services","title":"Data Management Services","text":"

    Data Catalogue

    "},{"location":"services/cs2/","title":"Cerebras CS-2","text":"

    Get Access

    Running codes

    "},{"location":"services/cs2/access/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/access/#getting-access","title":"Getting Access","text":"

    Access to the Cerebras CS-2 system is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.

    "},{"location":"services/cs2/run/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/run/#introduction","title":"Introduction","text":"

    The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc.

    "},{"location":"services/cs2/run/#connecting-to-the-cluster","title":"Connecting to the cluster","text":"

    To gain access to the CS-2 WSC you need to login to the host system, Ultra2 (also called SDF-CS1). See the documentation for Ultra2.

    "},{"location":"services/cs2/run/#running-jobs","title":"Running Jobs","text":"

    All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.

    "},{"location":"services/cs2/run/#slurm-example","title":"SLURM example","text":"

    This is based on the sample job from the Cerebras documentation Cerebras documentation - Execute your job

    #!/bin/bash\n#SBATCH --job-name=Example        # Job name\n#SBATCH --cpus-per-task=2         # Request 2 cores\n#SBATCH --output=example_%j.log   # Standard output and error log\n#SBATCH --time=01:00:00           # Set time limit for this job to 1 hour\n#SBATCH --gres=cs:1               # Request CS-2 system\n\nsource venv_cerebras_pt/bin/activate\npython run.py \\\n       CSX \\\n       --params params.yaml \\\n       --num_csx=1 \\\n       --model_dir model_dir \\\n       --mode {train,eval,eval_all,train_and_eval} \\\n       --mount_dirs {paths to modelzoo and to data} \\\n       --python_paths {paths to modelzoo and other python code if used}\n

    See the 'Troubleshooting' section below for known issues.

    "},{"location":"services/cs2/run/#creating-an-environment","title":"Creating an environment","text":"

    To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this Cerebras setup environment docs however our host system is slightly different so we recommend the following:

    "},{"location":"services/cs2/run/#create-the-venv","title":"Create the venv","text":"
    python3.8 -m venv venv_cerebras_pt\n
    "},{"location":"services/cs2/run/#install-the-dependencies","title":"Install the dependencies","text":"
    source venv_cerebras_pt/bin/activate\npip install --upgrade pip\npip install cerebras_pytorch==2.1.1\n
    "},{"location":"services/cs2/run/#validate-the-setup","title":"Validate the setup","text":"
    source venv_cerebras_pt/bin/activate\ncerebras_install_check\n
    "},{"location":"services/cs2/run/#modify-venv-files-to-remove-clock-sync-check-on-epcc-system","title":"Modify venv files to remove clock sync check on EPCC system","text":"

    Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:

    "},{"location":"services/cs2/run/#from-within-your-python-venv-edit-the-libpython38site-packagescerebras_pytorchsaverstoragepy-file","title":"From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
    vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py\n
    ","text":""},{"location":"services/cs2/run/#navigate-to-line-530","title":"Navigate to line 530
    :530\n

    The section should look like this:

    if modified_time > self._last_modified:\n    raise RuntimeError(\n        f\"Attempting to materialize deferred tensor with key \"\n        f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n        f\"since been modified. The loaded tensor value may be \"\n        f\"different from originally loaded tensor. Please refrain \"\n        f\"from modifying the file while the run is in progress.\"\n    )\n
    ","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-modified_time-self_last_modified","title":"Comment out the section if modified_time > self._last_modified
     #if modified_time > self._last_modified:\n #    raise RuntimeError(\n #        f\"Attempting to materialize deferred tensor with key \"\n #       f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n #        f\"since been modified. The loaded tensor value may be \"\n #        f\"different from originally loaded tensor. Please refrain \"\n #        f\"from modifying the file while the run is in progress.\"\n        #    )\n
    ","text":""},{"location":"services/cs2/run/#navigate-to-line-774","title":"Navigate to line 774
    :774\n

    The section should look like this:

       if stat.st_mtime_ns > self._stat.st_mtime_ns:\n        raise RuntimeError(\n            f\"Attempting to {msg} deferred tensor with key \"\n            f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n            f\"since been modified. The loaded tensor value may be \"\n            f\"different from originally loaded tensor. Please refrain \"\n            f\"from modifying the file while the run is in progress.\"\n       )\n
    ","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-statst_mtime_ns-self_statst_mtime_ns","title":"Comment out the section if stat.st_mtime_ns > self._stat.st_mtime_ns
       #if stat.st_mtime_ns > self._stat.st_mtime_ns:\n   #     raise RuntimeError(\n   #         f\"Attempting to {msg} deferred tensor with key \"\n   #         f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n   #         f\"since been modified. The loaded tensor value may be \"\n   #         f\"different from originally loaded tensor. Please refrain \"\n   #         f\"from modifying the file while the run is in progress.\"\n   #    )\n
    ","text":""},{"location":"services/cs2/run/#save-the-file","title":"Save the file","text":""},{"location":"services/cs2/run/#run-jobs-as-per-existing-documentation","title":"Run jobs as per existing documentation","text":""},{"location":"services/cs2/run/#paths-pythonpath-and-mount_dirs","title":"Paths, PYTHONPATH and mount_dirs","text":"

    There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.

    "},{"location":"services/datacatalogue/","title":"EIDF Data Catalogue Information","text":"

    QuickStart

    Tutorial

    Documentation

    Metadata information

    "},{"location":"services/datacatalogue/docs/","title":"Service Documentation","text":""},{"location":"services/datacatalogue/docs/#metadata","title":"Metadata","text":"

    For more information on metadata, please read the following: Metadata

    "},{"location":"services/datacatalogue/docs/#online-support","title":"Online support","text":""},{"location":"services/datacatalogue/metadata/","title":"EIDF Metadata Information","text":""},{"location":"services/datacatalogue/metadata/#what-is-fair","title":"What is FAIR?","text":"

    FAIR stands for Findable, Accessible, Interoperable, and Reusable, and helps emphasise the best practices with publishing and sharing data (more details: FAIR Principles)

    "},{"location":"services/datacatalogue/metadata/#what-is-metadata","title":"What is metadata?","text":"

    Metadata is data about data, to help describe the dataset. Common metadata fields are things like the title of the dataset, who produced it, where it was generated (if relevant), when it was generated, and some key words describing it

    "},{"location":"services/datacatalogue/metadata/#what-is-ckan","title":"What is CKAN?","text":"

    CKAN is a metadata catalogue - i.e. it is a database for metadata rather than data. This will help with all aspects of FAIR:

    "},{"location":"services/datacatalogue/metadata/#what-metadata-will-we-need-to-provide","title":"What metadata will we need to provide?","text":""},{"location":"services/datacatalogue/metadata/#why-do-i-need-to-use-a-controlled-vocabulary","title":"Why do I need to use a controlled vocabulary?","text":"

    Using a standard vocabulary (such as the FAST Vocabulary) has many benefits:

    All of these advantages mean that we, as a project, don't need to think about this - there is no need to reinvent the wheel when other institutes (e.g. National Libraries) have created. You might recognise WorldCat - it is an organisation which manages a global catalogue of ~18000 libraries world-wide, so they are in a good position to generate a comprehensive vocabulary of academic topics!

    "},{"location":"services/datacatalogue/metadata/#what-about-licensing-what-does-cc-by-sa-40-mean","title":"What about licensing? (What does CC-BY-SA 4.0 mean?)","text":"

    The R in FAIR stands for reusable - more specifically it includes this subphrase: \"(Meta)data are released with a clear and accessible data usage license\". This means that we have to tell anyone else who uses the data what they're allowed to do with it - and, under the FAIR philosophy, more freedom is better.

    CC-BY-SA 4.0 allows anyone to remix, adapt, and build upon your work (even for commercial purposes), as long as they credit you and license their new creations under the identical terms. It also explicitly includes Sui Generis Database Rights, giving rights to the curation of a database even if you don't have the rights to the items in a database (e.g. a Spotify playlist, even though you don't own the rights to each track).

    Human readable summary: Creative Commons 4.0 Human Readable Full legal code: Creative Commons 4.0 Legal Code

    "},{"location":"services/datacatalogue/metadata/#im-stuck-how-do-i-get-help","title":"I'm stuck! How do I get help?","text":"

    Contact the EIDF Service Team via eidf@epcc.ed.ac.uk

    "},{"location":"services/datacatalogue/quickstart/","title":"Quickstart","text":""},{"location":"services/datacatalogue/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/datacatalogue/quickstart/#first-task","title":"First Task","text":""},{"location":"services/datacatalogue/quickstart/#further-information","title":"Further information","text":""},{"location":"services/datacatalogue/tutorial/","title":"Tutorial","text":""},{"location":"services/datacatalogue/tutorial/#first-query","title":"First Query","text":""},{"location":"services/gpuservice/","title":"Overview","text":"

    The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon Kubernetes.

    MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.

    The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.

    The service provides access to:

    The current full specification of the EIDF GPU Service as of 14 February 2024:

    Quotas

    This is the full configuration of the cluster.

    Each project will have access to a quota across this shared configuration.

    Changes to the default quota must be discussed and agreed with the EIDF Services team.

    NOTE

    If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. Please see Getting started with Kubernetes to learn about specifying GPU resources.

    "},{"location":"services/gpuservice/#service-access","title":"Service Access","text":"

    Users should have an EIDF Account.

    Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.

    Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available here.

    All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.

    EIDF GPU Service vs EIDF GPU-Enabled VMs

    The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.

    An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

    Projects do not have to apply for a GPU-enabled VM to access the GPU Service.

    "},{"location":"services/gpuservice/#project-quotas","title":"Project Quotas","text":"

    A standard project namespace has the following initial quota (subject to ongoing review):

    Quota is a maximum on a Shared Resource

    A project quota is the maximum proportion of the service available for use by that project.

    During periods of high demand, Jobs will be queued awaiting resource availability on the Service.

    This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

    "},{"location":"services/gpuservice/#project-queues","title":"Project Queues","text":"

    EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the Kueue.

    "},{"location":"services/gpuservice/#additional-service-policy-information","title":"Additional Service Policy Information","text":"

    Additional information on service policies can be found here.

    "},{"location":"services/gpuservice/#eidf-gpu-service-tutorial","title":"EIDF GPU Service Tutorial","text":"

    This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.

    Lesson Objective Getting started with Kubernetes a. What is Kubernetes?b. How to send a task to a GPU node.c. How to define the GPU resources needed. Requesting persistent volumes with Kubernetes a. What is a persistent volume? b. How to request a PV resource. Running a PyTorch task a. Accessing a Pytorch container.b. Submitting a PyTorch task to the cluster.c. Inspecting the results. Template workflow a. Loading large data sets asynchronously.b. Manually or automatically building Docker images.c. Iteratively changing and testing code in a job."},{"location":"services/gpuservice/#further-reading-and-help","title":"Further Reading and Help","text":""},{"location":"services/gpuservice/faq/","title":"GPU Service FAQ","text":""},{"location":"services/gpuservice/faq/#gpu-service-frequently-asked-questions","title":"GPU Service Frequently Asked Questions","text":""},{"location":"services/gpuservice/faq/#how-do-i-access-the-gpu-service","title":"How do I access the GPU Service?","text":"

    The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.

    "},{"location":"services/gpuservice/faq/#how-do-i-obtain-my-project-kubeconfig-file","title":"How do I obtain my project kubeconfig file?","text":"

    Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.

    "},{"location":"services/gpuservice/faq/#i-cant-mount-my-pvc-in-multiple-containers-or-pods-at-the-same-time","title":"I can't mount my PVC in multiple containers or pods at the same time","text":"

    The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.

    "},{"location":"services/gpuservice/faq/#how-many-gpus-can-i-use-in-a-pod","title":"How many GPUs can I use in a pod?","text":"

    The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.

    "},{"location":"services/gpuservice/faq/#why-did-a-validation-error-occur-when-submitting-a-pod-or-job-with-a-valid-specification-file","title":"Why did a validation error occur when submitting a pod or job with a valid specification file?","text":"

    If an error like the below occurs:

    error: error validating \"myjobfile.yml\": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false\n

    There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories.

    The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the Kubernetes Version Skew Policy.

    "},{"location":"services/gpuservice/faq/#insufficient-shared-memory-size","title":"Insufficient Shared Memory Size","text":"

    My SHM is very small, and it causes \"OSError: [Errno 28] No space left on device\" when I train a model using multi-GPU. How to increase SHM size?

    The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem:

       spec:\n     containers:\n       - name: [NAME]\n         image: [IMAGE]\n         volumeMounts:\n         - mountPath: /dev/shm\n           name: dshm\n     volumes:\n       - name: dshm\n         emptyDir:\n            medium: Memory\n
    "},{"location":"services/gpuservice/faq/#pytorch-slow-performance-issues","title":"Pytorch Slow Performance Issues","text":"

    Pytorch on Kubernetes may operate slower than expected - much slower than an equivalent VM setup.

    Pytorch defaults to auto-detecting the number of OMP Threads and it will report an incorrect number of potential threads compared to your requested CPU core count. This is a consequence in operating in a container environment, the CPU information is reported by standard libraries and tools will be the node level information rather than your container.

    To help correct this issue, the environment variable OMP_NUM_THREADS should be set in the job submission file to the number of cores requested or less.

    This has been tested using:

    Example fragment for a Bash command start:

      containers:\n    - args:\n        - >\n          export OMP_NUM_THREADS=1;\n          python mypytorchprogram.py;\n      command:\n        - /bin/bash\n        - '-c'\n        - '--'\n
    "},{"location":"services/gpuservice/faq/#my-large-number-of-gpus-job-takes-a-long-time-to-be-scheduled","title":"My large number of GPUs Job takes a long time to be scheduled","text":"

    When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.

    "},{"location":"services/gpuservice/kueue/","title":"Kueue","text":""},{"location":"services/gpuservice/kueue/#overview","title":"Overview","text":"

    Kueue is a native Kubernetes quota and job management system.

    This is the job queue system for the EIDF GPU Service, starting with February 2024.

    All users should submit jobs to their local namespace user queue, this queue will have the name eidf project namespace-user-queue.

    "},{"location":"services/gpuservice/kueue/#changes-to-job-specs","title":"Changes to Job Specs","text":"

    Jobs can be submitted as before but will require the addition of a metadata label:

       labels:\n      kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\n

    This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted.

    "},{"location":"services/gpuservice/kueue/#useful-commands-for-looking-at-your-local-queue","title":"Useful commands for looking at your local queue","text":""},{"location":"services/gpuservice/kueue/#kubectl-get-queue","title":"kubectl get queue","text":"

    This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start:

    NAME               CLUSTERQUEUE             PENDING WORKLOADS   ADMITTED WORKLOADS\neidf001-user-queue eidf001-project-gpu-cq   0                   2\n
    "},{"location":"services/gpuservice/kueue/#kubectl-describe-queue-queue","title":"kubectl describe queue <queue>","text":"

    This command will output more detailed information on the current resource usage in your queue:

    Name:         eidf001-user-queue\nNamespace:    eidf001\nLabels:       <none>\nAnnotations:  <none>\nAPI Version:  kueue.x-k8s.io/v1beta1\nKind:         LocalQueue\nMetadata:\n  Creation Timestamp:  2024-02-06T13:06:23Z\n  Generation:          1\n  Managed Fields:\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:spec:\n        .:\n        f:clusterQueue:\n    Manager:      kubectl-create\n    Operation:    Update\n    Time:         2024-02-06T13:06:23Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        .:\n        f:admittedWorkloads:\n        f:conditions:\n          .:\n          k:{\"type\":\"Active\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n        f:flavorUsage:\n          .:\n          k:{\"name\":\"default-flavor\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"cpu\"}:\n                .:\n                f:name:\n                f:total:\n              k:{\"name\":\"memory\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-1g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-3g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-80\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n        f:flavorsReservation:\n          .:\n          k:{\"name\":\"default-flavor\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"cpu\"}:\n                .:\n                f:name:\n                f:total:\n              k:{\"name\":\"memory\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-1g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-3g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-80\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n        f:pendingWorkloads:\n        f:reservingWorkloads:\n    Manager:         kueue\n    Operation:       Update\n    Subresource:     status\n    Time:            2024-02-14T10:54:20Z\n  Resource Version:  333898946\n  UID:               bca097e2-6c55-4305-86ac-d1bd3c767751\nSpec:\n  Cluster Queue:  eidf001-project-gpu-cq\nStatus:\n  Admitted Workloads:  2\n  Conditions:\n    Last Transition Time:  2024-02-06T13:06:23Z\n    Message:               Can submit new workloads to clusterQueue\n    Reason:                Ready\n    Status:                True\n    Type:                  Active\n  Flavor Usage:\n    Name:  gpu-a100\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  2\n    Name:     gpu-a100-3g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-1g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-80\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     default-flavor\n    Resources:\n      Name:   cpu\n      Total:  16\n      Name:   memory\n      Total:  256Gi\n  Flavors Reservation:\n    Name:  gpu-a100\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  2\n    Name:     gpu-a100-3g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-1g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-80\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     default-flavor\n    Resources:\n      Name:             cpu\n      Total:            16\n      Name:             memory\n      Total:            256Gi\n  Pending Workloads:    0\n  Reserving Workloads:  2\nEvents:                 <none>\n
    "},{"location":"services/gpuservice/kueue/#kubectl-get-workloads","title":"kubectl get workloads","text":"

    This command will return the list of workloads in the queue:

    NAME                QUEUE                ADMITTED BY              AGE\njob-jobtest-366ab   eidf001-user-queue   eidf001-project-gpu-cq   4h45m\njob-jobtest-34ba9   eidf001-user-queue   eidf001-project-gpu-cq   6h48m\n
    "},{"location":"services/gpuservice/kueue/#kubectl-describe-workload-workload","title":"kubectl describe workload <workload>","text":"

    This command will return a detailed summary of the workload including status and resource usage:

    Name:         job-pytorch-job-0b664\nNamespace:    t4\nLabels:       kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc\nAnnotations:  <none>\nAPI Version:  kueue.x-k8s.io/v1beta1\nKind:         Workload\nMetadata:\n  Creation Timestamp:  2024-02-14T15:22:16Z\n  Generation:          2\n  Managed Fields:\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        f:admission:\n          f:clusterQueue:\n          f:podSetAssignments:\n            k:{\"name\":\"main\"}:\n              .:\n              f:count:\n              f:flavors:\n                f:cpu:\n                f:memory:\n                f:nvidia.com/gpu:\n              f:name:\n              f:resourceUsage:\n                f:cpu:\n                f:memory:\n                f:nvidia.com/gpu:\n        f:conditions:\n          k:{\"type\":\"Admitted\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n          k:{\"type\":\"QuotaReserved\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n    Manager:      kueue-admission\n    Operation:    Apply\n    Subresource:  status\n    Time:         2024-02-14T15:22:16Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        f:conditions:\n          k:{\"type\":\"Finished\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n    Manager:      kueue-job-controller-Finished\n    Operation:    Apply\n    Subresource:  status\n    Time:         2024-02-14T15:25:06Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:metadata:\n        f:labels:\n          .:\n          f:kueue.x-k8s.io/job-uid:\n        f:ownerReferences:\n          .:\n          k:{\"uid\":\"33bc1e48-4dca-4252-9387-bf68b99759dc\"}:\n      f:spec:\n        .:\n        f:podSets:\n          .:\n          k:{\"name\":\"main\"}:\n            .:\n            f:count:\n            f:name:\n            f:template:\n              .:\n              f:metadata:\n                .:\n                f:labels:\n                  .:\n                  f:controller-uid:\n                  f:job-name:\n                f:name:\n              f:spec:\n                .:\n                f:containers:\n                f:dnsPolicy:\n                f:nodeSelector:\n                f:restartPolicy:\n                f:schedulerName:\n                f:securityContext:\n                f:terminationGracePeriodSeconds:\n                f:volumes:\n        f:priority:\n        f:priorityClassSource:\n        f:queueName:\n    Manager:    kueue\n    Operation:  Update\n    Time:       2024-02-14T15:22:16Z\n  Owner References:\n    API Version:           batch/v1\n    Block Owner Deletion:  true\n    Controller:            true\n    Kind:                  Job\n    Name:                  pytorch-job\n    UID:                   33bc1e48-4dca-4252-9387-bf68b99759dc\n  Resource Version:        270812029\n  UID:                     8cfa93ba-1142-4728-bc0c-e8de817e8151\nSpec:\n  Pod Sets:\n    Count:  1\n    Name:   main\n    Template:\n      Metadata:\n        Labels:\n          Controller - UID:  33bc1e48-4dca-4252-9387-bf68b99759dc\n          Job - Name:        pytorch-job\n        Name:                pytorch-pod\n      Spec:\n        Containers:\n          Args:\n            /mnt/ceph_rbd/example_pytorch_code.py\n          Command:\n            python3\n          Image:              pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n          Image Pull Policy:  IfNotPresent\n          Name:               pytorch-con\n          Resources:\n            Limits:\n              Cpu:             4\n              Memory:          4Gi\n              nvidia.com/gpu:  1\n            Requests:\n              Cpu:                     2\n              Memory:                  1Gi\n          Termination Message Path:    /dev/termination-log\n          Termination Message Policy:  File\n          Volume Mounts:\n            Mount Path:  /mnt/ceph_rbd\n            Name:        volume\n        Dns Policy:      ClusterFirst\n        Node Selector:\n          nvidia.com/gpu.product:  NVIDIA-A100-SXM4-40GB\n        Restart Policy:            Never\n        Scheduler Name:            default-scheduler\n        Security Context:\n        Termination Grace Period Seconds:  30\n        Volumes:\n          Name:  volume\n          Persistent Volume Claim:\n            Claim Name:   pytorch-pvc\n  Priority:               0\n  Priority Class Source:\n  Queue Name:             t4-user-queue\nStatus:\n  Admission:\n    Cluster Queue:  project-cq\n    Pod Set Assignments:\n      Count:  1\n      Flavors:\n        Cpu:             default-flavor\n        Memory:          default-flavor\n        nvidia.com/gpu:  gpu-a100\n      Name:              main\n      Resource Usage:\n        Cpu:             2\n        Memory:          1Gi\n        nvidia.com/gpu:  1\n  Conditions:\n    Last Transition Time:  2024-02-14T15:22:16Z\n    Message:               Quota reserved in ClusterQueue project-cq\n    Reason:                QuotaReserved\n    Status:                True\n    Type:                  QuotaReserved\n    Last Transition Time:  2024-02-14T15:22:16Z\n    Message:               The workload is admitted\n    Reason:                Admitted\n    Status:                True\n    Type:                  Admitted\n    Last Transition Time:  2024-02-14T15:25:06Z\n    Message:               Job finished successfully\n    Reason:                JobFinished\n    Status:                True\n    Type:                  Finished\n
    "},{"location":"services/gpuservice/policies/","title":"GPU Service Policies","text":""},{"location":"services/gpuservice/policies/#namespaces","title":"Namespaces","text":"

    Each project will be given a namespace which will have an applied quota.

    Default Quota:

    "},{"location":"services/gpuservice/policies/#kubeconfig","title":"Kubeconfig","text":"

    Each project will be assigned a kubeconfig file for access to the service which will allow operation in the assigned namespace and access to exposed service operators, for example the GPU and CephRBD operators.

    "},{"location":"services/gpuservice/policies/#kubernetes-job-time-to-live","title":"Kubernetes Job Time to Live","text":"

    All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via spec.ttlSecondsAfterFinished> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service.

    Important

    This policy is automated and does not require users to change their job specifications.

    "},{"location":"services/gpuservice/policies/#kubernetes-active-deadline-seconds","title":"Kubernetes Active Deadline Seconds","text":"

    All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via spec.spec.activeDeadlineSeconds automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service.

    Important

    This policy is automated and does not require users to change their job or pod specifications.

    "},{"location":"services/gpuservice/policies/#kueue","title":"Kueue","text":"

    All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload.

    Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata:

       labels:\n      kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\n

    Jobs without this queue name tag will be rejected.

    Pods bypassing the queue system will be deleted.

    "},{"location":"services/gpuservice/training/L1_getting_started/","title":"Getting started with Kubernetes","text":""},{"location":"services/gpuservice/training/L1_getting_started/#introduction","title":"Introduction","text":"

    Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications.

    Nvidia GPUs are supported through K8s native Nvidia GPU Operators.

    The use of K8s to manage the EIDF GPU Service provides two key advantages:

    "},{"location":"services/gpuservice/training/L1_getting_started/#interacting-with-a-k8s-cluster","title":"Interacting with a K8s cluster","text":"

    An overview of the key components of a K8s container can be seen on the Kubernetes docs website.

    The primary component of a K8s cluster is a pod.

    A pod is a set of one or more containers (and their storage volumes) that share resources.

    Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers to be ran in the pod by writing a yaml file.

    The pod definition yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran.

    A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs.

    Multiple pods can be defined and maintained using several different methods depending on purpose: deployments, services and jobs; see the K8s docs for more details.

    Users interact with the K8s API using the kubectl (short for kubernetes control) commands.

    Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces.

    Useful commands are:

    "},{"location":"services/gpuservice/training/L1_getting_started/#creating-your-first-job","title":"Creating your first job","text":"

    To access the GPUs on the service, it is recommended to start with one of the prebuild container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs.

    The list of Nvidia images is available on their website.

    The following example uses their CUDA sample code simulating nbody interactions.

    1. Open an editor of your choice and create the file test_NBody.yml
    2. Copy the following in to the file, replacing namespace-user-queue with -user-queue, e.g. eidf001ns-user-queue:

      apiVersion: batch/v1\nkind: Job\nmetadata:\n    generateName: jobtest-\n    labels:\n        kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: job-test\n        spec:\n            containers:\n            - name: cudasample\n              image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n              args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n                        nvidia.com/gpu: 1\n            restartPolicy: Never\n

      The pod resources are defined under the resources tags using the requests and limits tags.

      Resources defined under the requests tags are the reserved resources required for the pod to be scheduled.

      If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested.

      This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node.

      The limits tag specifies the maximum resources that can be assigned to a pod.

      The EIDF GPU Service requires all pods have requests and limits tags for CPU and memory defined in order to be accepted.

      GPU resources requests are optional and only an entry under the limits tag is needed to specify the use of a GPU, nvidia.com/gpu: 1. Without this no GPU will be available to the pod.

      The label kueue.x-k8s.io/queue-name specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users.

    3. Save the file and exit the editor

    4. Run kubectl create -f test_NBody.yml
    5. This will output something like:

      job.batch/jobtest-b92qg created\n
    6. Run kubectl get jobs

    7. This will output something like:

      NAME            COMPLETIONS   DURATION   AGE\njobtest-b92qg   3/3           48s        6m27s\njobtest-d45sr   5/5           15m        22h\njobtest-kwmwk   3/3           48s        29m\njobtest-kw22k   1/1           48s        29m\n

      This displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age.

    8. Describe your job using the command kubectl describe job jobtest-b92-qg, replacing the job name with your job name.

    9. This will output something like:

      Name:             jobtest-b92qg\nNamespace:        t4\nSelector:         controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\nLabels:           kueue.x-k8s.io/queue-name=t4-user-queue\nAnnotations:      batch.kubernetes.io/job-tracking:\nParallelism:      1\nCompletions:      3\nCompletion Mode:  NonIndexed\nStart Time:       Wed, 14 Feb 2024 14:07:44 +0000\nCompleted At:     Wed, 14 Feb 2024 14:08:32 +0000\nDuration:         48s\nPods Statuses:    0 Active (0 Ready) / 3 Succeeded / 0 Failed\nPod Template:\n    Labels:  controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\n            job-name=jobtest-b92qg\n    Containers:\n        cudasample:\n            Image:      nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n            Port:       <none>\n            Host Port:  <none>\n            Args:\n                -benchmark\n                -numbodies=512000\n                -fp64\n                -fullscreen\n            Limits:\n                cpu:             2\n                memory:          4Gi\n                nvidia.com/gpu:  1\n            Requests:\n                cpu:        2\n                memory:     1Gi\n            Environment:  <none>\n            Mounts:       <none>\n    Volumes:        <none>\nEvents:\nType    Reason            Age    From                        Message\n----    ------            ----   ----                        -------\nNormal  Suspended         8m1s   job-controller              Job suspended\nNormal  CreatedWorkload   8m1s   batch/job-kueue-controller  Created Workload: t4/job-jobtest-b92qg-3b890\nNormal  Started           8m1s   batch/job-kueue-controller  Admitted by clusterQueue project-cq\nNormal  SuccessfulCreate  8m     job-controller              Created pod: jobtest-b92qg-lh64s\nNormal  Resumed           8m     job-controller              Job resumed\nNormal  SuccessfulCreate  7m44s  job-controller              Created pod: jobtest-b92qg-xhvdm\nNormal  SuccessfulCreate  7m28s  job-controller              Created pod: jobtest-b92qg-lvmrf\nNormal  Completed         7m12s  job-controller              Job completed\n
    10. Run kubectl get pods

    11. This will output something like:

      NAME                  READY   STATUS      RESTARTS   AGE\njobtest-b92qg-lh64s   0/1     Completed   0          11m\njobtest-b92qg-lvmrf   0/1     Completed   0          10m\njobtest-b92qg-xhvdm   0/1     Completed   0          10m\njobtest-d45sr-8tf4d   0/1     Completed   0          22h\njobtest-d45sr-jjhgg   0/1     Completed   0          22h\njobtest-d45sr-n5w6c   0/1     Completed   0          22h\njobtest-d45sr-v9p4j   0/1     Completed   0          22h\njobtest-d45sr-xgq5s   0/1     Completed   0          22h\njobtest-kwmwk-cgwmf   0/1     Completed   0          33m\njobtest-kwmwk-mttdw   0/1     Completed   0          33m\njobtest-kwmwk-r2q9h   0/1     Completed   0          33m\n
    12. View the logs of a pod from the job you ran kubectl logs jobtest-b92qg-lh64s - note that the pods for the job in this case start with the job name.

    13. This will output something like:

      Run \"nbody -benchmark [-numbodies=<numBodies>]\" to measure performance.\n    -fullscreen       (run n-body simulation in fullscreen mode)\n    -fp64             (use double precision floating point values for simulation)\n    -hostmem          (stores simulation data in host memory)\n    -benchmark        (run benchmark to measure performance)\n    -numbodies=<N>    (number of bodies (>= 1) to run in simulation)\n    -device=<d>       (where d=0,1,2.... for the CUDA device to use)\n    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)\n    -compare          (compares simulation results running once on the default GPU and once on the CPU)\n    -cpu              (run n-body simulation on the CPU)\n    -tipsy=<file.bin> (load a tipsy model file for simulation)\n\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n\n> Fullscreen mode\n> Simulation data stored in video memory\n> Double precision floating point simulation\n> 1 Devices used for simulation\nGPU Device 0: \"Ampere\" with compute capability 8.0\n\n> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]\nnumber of bodies = 512000\n512000 bodies, total time for 10 iterations: 10570.778 ms\n= 247.989 billion interactions per second\n= 7439.679 double-precision GFLOP/s at 30 flops per interaction\n
    14. Delete your job with kubectl delete job jobtest-b92qg - this will delete the associated pods as well.

    15. "},{"location":"services/gpuservice/training/L1_getting_started/#specifying-gpu-requirements","title":"Specifying GPU requirements","text":"

      If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB].

      The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of whether what GPU type is present on the node.

      The GPU resource requests can be made more specific by adding the type of GPU product the pod is requesting to the node selector:

      "},{"location":"services/gpuservice/training/L1_getting_started/#example-yaml-file","title":"Example yaml file","text":"
      apiVersion: batch/v1\nkind: Job\nmetadata:\n    generateName: jobtest-\n    labels:\n        kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: job-test\n        spec:\n            containers:\n            - name: cudasample\n              image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n              args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n                        nvidia.com/gpu: 1\n            restartPolicy: Never\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n
      "},{"location":"services/gpuservice/training/L1_getting_started/#running-multiple-pods-with-k8s-jobs","title":"Running multiple pods with K8s jobs","text":"

      The recommended use of the EIDF GPU Service is to use a job request which wraps around a pod specification and provide several useful attributes.

      Firstly, if a pod is assigned to a node that dies then the pod itself will fail and the user has to manually restart it.

      Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod, if the restartPolicy is set.

      Jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate.

      Jobs allow for better scheduling of resources using the Kueue service implemented on the EIDF GPU Service. Pods which attempt to bypass the queue mechanism this provides will affect the experience of other project users.

      See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends.

      apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n    kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n completions: 3\n parallelism: 1\n template:\n  metadata:\n   name: job-test\n  spec:\n   containers:\n   - name: cudasample\n     image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n     args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n     resources:\n      requests:\n       cpu: 2\n       memory: '1Gi'\n      limits:\n       cpu: 2\n       memory: '4Gi'\n       nvidia.com/gpu: 1\n   restartPolicy: Never\n
      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/","title":"Requesting Persistent Volumes With Kubernetes","text":"

      Pods in the K8s EIDF GPU Service are intentionally ephemeral.

      They only last as long as required to complete the task that they were created for.

      Keeping pods ephemeral ensures the cluster resources are released for other users to request.

      However, this means the default storage volumes within a pod are temporary.

      If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance.

      K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs.

      These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash.

      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#submitting-a-persistent-volume-claim","title":"Submitting a Persistent Volume Claim","text":"

      Before a persistent volume can be mounted to a pod, the required storage resources need to be requested and reserved to your namespace.

      A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources.

      The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently.

      Example PVCs can be seen on the Kubernetes documentation page.

      All PVCs on the EIDF GPU Service must use the csi-rbd-sc storage class.

      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-persistentvolumeclaim","title":"Example PersistentVolumeClaim","text":"
      kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: test-ceph-pvc\nspec:\n accessModes:\n  - ReadWriteOnce\n resources:\n  requests:\n   storage: 2Gi\n storageClassName: csi-rbd-sc\n

      You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml kubectl create <PVC specification yaml> Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands:

      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#mounting-a-persistent-volume-to-a-pod","title":"Mounting a persistent Volume to a Pod","text":"

      Introducing a persistent volume to a pod requires the addition of a volumeMount option to the container and a volume option linking to the PVC in the pod specification yaml.

      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-pod-specification-yaml-with-mounted-persistent-volume","title":"Example pod specification yaml with mounted persistent volume","text":"
      apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: test-ceph-pvc-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: test-ceph-pvc-pod\n        spec:\n            containers:\n            - name: cudasample\n              image: busybox\n              args: [\"sleep\", \"infinity\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n              volumeMounts:\n                    - mountPath: /mnt/ceph_rbd\n                      name: volume\n            restartPolicy: Never\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: test-ceph-pvc\n
      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#accessing-the-persistent-volume-outside-a-pod","title":"Accessing the persistent volume outside a pod","text":"

      To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command.

      *** On Login Node - replacing pod name with your pod name ***\nkubectl cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd\n

      For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted.

      The bash command rsync can be amended to manage file transfers into the mounted PV following this GitHub repo.

      "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#clean-up","title":"Clean up","text":"
      kubectl delete job test-ceph-pvc-job\n\nkubectl delete pvc test-ceph-pvc\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/","title":"Running a PyTorch task","text":"

      In the following lesson, we'll build a NLP neural network and train it using the EIDF GPU Service.

      The model was taken from the PyTorch Tutorials.

      The lesson will be split into three parts:

      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#load-training-data-and-ml-code-into-a-persistent-volume","title":"Load training data and ML code into a persistent volume","text":""},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#create-a-persistent-volume","title":"Create a persistent volume","text":"

      Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).

      kubectl create -f <pvc-spec-yaml>\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-persistentvolumeclaim","title":"Example PyTorch PersistentVolumeClaim","text":"
      kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: pytorch-pvc\nspec:\n accessModes:\n  - ReadWriteOnce\n resources:\n  requests:\n   storage: 2Gi\n storageClassName: csi-rbd-sc\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#transfer-codedata-to-persistent-volume","title":"Transfer code/data to persistent volume","text":"
      1. Check PVC has been created

        kubectl get pvc <pv-name>\n
      2. Create a lightweight job with pod with PV mounted (example job below)

        kubectl create -f lightweight-pod-job.yaml\n
      3. Download the PyTorch code

        wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py\n
      4. Copy the Python script into the PV

        kubectl cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/\n
      5. Check whether the files were transferred successfully

        kubectl exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd\n
      6. Delete the lightweight job

        kubectl delete job lightweight-job-<identifier>\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-lightweight-job-specification","title":"Example lightweight job specification","text":"
      apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: lightweight-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: lightweight-pod\n        spec:\n            containers:\n            - name: data-loader\n              image: busybox\n              args: [\"sleep\", \"infinity\"]\n              resources:\n                    requests:\n                        cpu: 1\n                        memory: '1Gi'\n                    limits:\n                        cpu: 1\n                        memory: '1Gi'\n              volumeMounts:\n                    - mountPath: /mnt/ceph_rbd\n                      name: volume\n            restartPolicy: Never\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#creating-a-job-with-a-pytorch-container","title":"Creating a Job with a PyTorch container","text":"

      We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.

      The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.

      Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.

      kubectl create -f <pytorch-job-yaml>\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-job-specification-file","title":"Example PyTorch Job Specification File","text":"
      apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: pytorch-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: pytorch-pod\n        spec:\n            restartPolicy: Never\n            containers:\n            - name: pytorch-con\n              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n              command: [\"python3\"]\n              args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n              volumeMounts:\n                - mountPath: /mnt/ceph_rbd\n                  name: volume\n              resources:\n                requests:\n                  cpu: 2\n                  memory: \"1Gi\"\n                limits:\n                  cpu: 4\n                  memory: \"4Gi\"\n                  nvidia.com/gpu: 1\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#reviewing-the-results-of-the-pytorch-model","title":"Reviewing the results of the PyTorch model","text":"

      This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.

      1. Check that the model ran to completion

        kubectl logs <pytorch-pod-name>\n
      2. Spin up a lightweight pod to retrieve results

        kubectl create -f lightweight-pod-job.yaml\n
      3. Copy the trained model back to your access VM

        kubectl cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#using-a-kubernetes-job-to-train-the-pytorch-model-multiple-times","title":"Using a Kubernetes job to train the pytorch model multiple times","text":"

      A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.

      A Kubernetes job can create and manage multiple pods with identical or different initial parameters.

      NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.

      Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.

      apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: pytorch-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 3\n    template:\n        metadata:\n            name: pytorch-pod\n        spec:\n            restartPolicy: Never\n            containers:\n            - name: pytorch-con\n              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n              command: [\"python3\"]\n              args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n              volumeMounts:\n                - mountPath: /mnt/ceph_rbd\n                  name: volume\n              resources:\n                requests:\n                  cpu: 2\n                  memory: \"1Gi\"\n                limits:\n                  cpu: 4\n                  memory: \"4Gi\"\n                  nvidia.com/gpu: 1\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
      "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#clean-up","title":"Clean up","text":"
      kubectl delete pod pytorch-job\n\nkubectl delete pvc pytorch-pvc\n
      "},{"location":"services/gpuservice/training/L4_template_workflow/","title":"Template workflow","text":""},{"location":"services/gpuservice/training/L4_template_workflow/#requirements","title":"Requirements","text":"

      It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.

      "},{"location":"services/gpuservice/training/L4_template_workflow/#overview","title":"Overview","text":"

      An example workflow for code development using K8s is outlined below.

      In theory, users can create docker images with all the code, software and data included to complete their analysis.

      In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.

      Therefore, it is recommended to separate code, software, and data preparation into distinct steps:

      1. Data Loading: Loading large data sets asynchronously.

      2. Developing a Docker environment: Manually or automatically building Docker images.

      3. Code development with K8s: Iteratively changing and testing code in a job.

      The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.

      The three stages are interchangeable and may not be relevant to every project.

      Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).

      "},{"location":"services/gpuservice/training/L4_template_workflow/#data-loading","title":"Data loading","text":"

      The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.

      Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.

      Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.

      It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.

      Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.

      "},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-with-a-lightweight-job","title":"Asynchronous data downloading with a lightweight job","text":"
      1. Check a PVC has been created.

        kubectl -n <project-namespace> get pvc template-workflow-pvc\n
      2. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  metadata:\n   name: lightweight-job\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: data-loader\n     image: alpine/curl:latest\n     command: ['sh', '-c', \"cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"1Gi\"\n      limits:\n       cpu: 1\n       memory: \"1Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
      3. Run the data download job.

        kubectl -n <project-namespace> create -f lightweight-pod.yaml\n
      4. Check if the download has completed.

        kubectl -n <project-namespace> get jobs\n
      5. Delete the lightweight job once completed.

        kubectl -n <project-namespace> delete job lightweight-job\n
      "},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-within-a-screen-session","title":"Asynchronous data downloading within a screen session","text":"

      Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.

      Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.

      This allows you to start a task, such as downloading a data set, and check in on it asynchronously.

      Once you have started a screen session, you can create a new window with ctrl-a c, swap between windows with ctrl-a 0-9 and exit screen (but keep any task running) with ctrl-a d.

      Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.

      1. Start a screen session.

        screen\n
      2. Create an interactive lightweight job session.

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  metadata:\n   name: lightweight-pod\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: data-loader\n     image: alpine/curl:latest\n     command: ['sleep','infinity']\n     resources:\n      requests:\n       cpu: 1\n       memory: \"1Gi\"\n      limits:\n       cpu: 1\n       memory: \"1Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
      3. Download data set. Change the curl URL to your data set of interest.

        kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip\n
      4. Exit the remote session by either ending the session or ctrl-a d.

      5. Reconnect at a later time and reattach the screen window.

        screen -list\n\nscreen -r <session-name>\n
      6. Check the download was successful and delete the job.

        kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/\n\nkubectl -n <project-namespace> delete job lightweight-job\n
      7. Exit the screen session.

        exit\n
      "},{"location":"services/gpuservice/training/L4_template_workflow/#preparing-a-custom-docker-image","title":"Preparing a custom Docker image","text":"

      Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.

      It does not provide functionality to build images and create pods from docker files.

      However, use cases may require some custom modifications of a base image, such as adding a python library.

      These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.

      This is not an introduction to building docker images, please see the Docker tutorial for a general overview.

      "},{"location":"services/gpuservice/training/L4_template_workflow/#manually-building-a-docker-image-locally","title":"Manually building a Docker image locally","text":"
      1. Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.

      2. Create a Dockerfile to add any additional packages required to the base image.

        FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10\nRUN pip install pandas\nRUN pip install plotly\n
      3. Build the Docker container locally (You will need to install Docker)

        cd <dockerfile-folder>\n\ndocker build . -t <docker-hub-username>/template-docker-image:latest\n

      Building images for different CPU architectures

      Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.

      If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64 flag to the build function.

      1. Create a repository to hold the image on Docker Hub (You will need to create and setup an account).

      2. Push the Docker image to the repository.

        docker push <docker-hub-username>/template-docker-image:latest\n
      3. Finally, specify your Docker image in the image: tag of the job specification yaml file.

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n
      "},{"location":"services/gpuservice/training/L4_template_workflow/#automatically-building-docker-images-using-github-actions","title":"Automatically building docker images using GitHub Actions","text":"

      In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.

      A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.

      This process requires you to already have a GitHub and Docker Hub account.

      1. Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.

      2. Create two GitHub secrets to securely provide your Docker Hub username and access token.

      3. Add the dockerfile to a code/docker folder within an active GitHub repo.

      4. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.

        name: ci\non:\n  push:\n    paths:\n      - 'code/docker/**'\n\njobs:\n  docker:\n    runs-on: ubuntu-latest\n    steps:\n      -\n        name: Set up QEMU\n        uses: docker/setup-qemu-action@v3\n      -\n        name: Set up Docker Buildx\n        uses: docker/setup-buildx-action@v3\n      -\n        name: Login to Docker Hub\n        uses: docker/login-action@v3\n        with:\n          username: ${{ secrets.DOCKERHUB_USERNAME }}\n          password: ${{ secrets.DOCKERHUB_TOKEN }}\n      -\n        name: Build and push\n        uses: docker/build-push-action@v5\n        with:\n          context: \"{{defaultContext}}:code/docker\"\n          push: true\n          tags: <target-dockerhub-image-name>\n
      5. Push a change to the dockerfile and check the Docker Hub image is updated.

      "},{"location":"services/gpuservice/training/L4_template_workflow/#code-development-with-k8s","title":"Code development with K8s","text":"

      Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.

      However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.

      If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).

      A pod yaml file can be defined to automatically pull the latest code version before running any tests.

      Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create command.

      You must already have a GitHub account to follow this process.

      This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).

      A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.

      "},{"location":"services/gpuservice/training/L4_template_workflow/#create-a-job-that-downloads-and-runs-the-latest-code-version-at-runtime","title":"Create a job that downloads and runs the latest code version at runtime","text":"
      1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
      2. Add an initial container that runs before the main container to download the latest version of the code.

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n     - mountPath: /code\n       name: github-code\n   initContainers:\n   - name: lightweight-git-container\n     image: cicirello/alpine-plus-plus\n     command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /code\n       name: github-code\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n   - name: github-code\n     emptyDir:\n      sizeLimit: 1Gi\n
      3. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the initContainers: command: tag.

        apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: ['sh', '-c', \"python3 /code/<python-script>\"]\n     resources:\n      requests:\n       cpu: 10\n       memory: \"40Gi\"\n      limits:\n       cpu: 10\n       memory: \"80Gi\"\n       nvidia.com/gpu: 1\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n     - mountPath: /code\n       name: github-code\n   initContainers:\n   - name: lightweight-git-container\n     image: cicirello/alpine-plus-plus\n     command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /code\n       name: github-code\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n   - name: github-code\n     emptyDir:\n      sizeLimit: 1Gi\n
      4. Submit the yaml file to kubernetes

        kubectl -n <project-namespace> create -f <job-yaml-file>\n
      "},{"location":"services/graphcore/","title":"Overview","text":"

      EIDF hosts a Graphcore Bow Pod64 system for AI acceleration.

      The specification of the Bow Pod64 is:

      For more details about the IPU architecture, see documentation from Graphcore.

      The smallest unit of compute resource that can be requested is a single IPU.

      Similarly to the EIDF GPU Service, usage of the Graphcore is managed using Kubernetes.

      "},{"location":"services/graphcore/#service-access","title":"Service Access","text":"

      Access to the Graphcore accelerator is provisioning through the EIDF GPU Service.

      Users should apply for access to Graphcore via the EIDF GPU Service.

      "},{"location":"services/graphcore/#project-quotas","title":"Project Quotas","text":"

      Currently there is no active quota mechanism on the Graphcore accelerator. IPUJobs should be actively using partitions on the Graphcore.

      "},{"location":"services/graphcore/#graphcore-tutorial","title":"Graphcore Tutorial","text":"

      The following tutorial teaches users how to submit tasks to the Graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the GPU service tutorial. For more in-depth lessons about developing applications for Graphcore, see the general documentation and guide for creating IPU jobs via Kubernetes.

      Lesson Objective Getting started with IPU jobs a. How to send an IPUJob.b. Monitoring and Cancelling your IPUJob. Multi-IPU Jobs a. Using multiple IPUs for distributed training. Profiling with PopVision a. Enabling profiling in your code.b. Downloading the profile reports. Other Frameworks a. Using Tensorflow and PopART.b. Writing IPU programs with PopLibs (C++)."},{"location":"services/graphcore/#further-reading-and-help","title":"Further Reading and Help","text":""},{"location":"services/graphcore/faq/","title":"Graphcore FAQ","text":""},{"location":"services/graphcore/faq/#graphcore-questions","title":"Graphcore Questions","text":""},{"location":"services/graphcore/faq/#how-do-i-delete-a-runningterminated-pod","title":"How do I delete a running/terminated pod?","text":"

      IPUJobs manages the launcher and worker pods, therefore the pods will be deleted when the IPUJob is deleted, using kubectl delete ipujobs <IPUJob-name>. If only the pod is deleted via kubectl delete pod, the IPUJob may respawn the pod.

      To see running or terminated IPUJobs, run kubectl get ipujobs.

      "},{"location":"services/graphcore/faq/#my-ipujob-died-with-a-message-poptorch_cpp_error-failed-to-acquire-x-ipus-why","title":"My IPUJob died with a message: 'poptorch_cpp_error': Failed to acquire X IPU(s). Why?","text":"

      This error may appear when the IPUJob name is too long.

      We have identified that for IPUJobs with metadata:name length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters.

      "},{"location":"services/graphcore/training/L1_getting_started/","title":"Getting started with Graphcore IPU Jobs","text":"

      This guide assumes basic familiarity with Kubernetes (K8s) and usage of kubectl. See GPU service tutorial to get started.

      "},{"location":"services/graphcore/training/L1_getting_started/#introduction","title":"Introduction","text":"

      Graphcore provides prebuilt docker containers (full lists here) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs.

      In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs.

      "},{"location":"services/graphcore/training/L1_getting_started/#creating-your-first-ipu-job","title":"Creating your first IPU job","text":"

      For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (graphcore/pytorch:3.3.0) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided here. More applications can be found in the repository https://github.com/graphcore/examples.

      To get started:

      1. to specify the job - create the file mnist-training-ipujob.yaml, then copy and save the following content into the file:
      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: mnist-training-\nspec:\n  # jobInstances defines the number of job instances.\n  # More than 1 job instance is usually useful for inference jobs only.\n  jobInstances: 1\n  # ipusPerJobInstance refers to the number of IPUs required per job instance.\n  # A separate IPU partition of this size will be created by the IPU Operator\n  # for each job instance.\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: mnist-training\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/pytorch/mnist;\n              python -m pip install -r requirements.txt;\n              python mnist_poptorch_code_only.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n
      1. to submit the job - run kubectl create -f mnist-training-ipujob.yaml, which will give the following output:

        ipujob.graphcore.ai/mnist-training-<random string> created\n
      2. to monitor progress of the job - run kubectl get pods, which will give the following output

        NAME                      READY   STATUS      RESTARTS   AGE\nmnist-training-<random string>-worker-0   0/1     Completed   0          2m56s\n
      3. to read the result - run kubectl logs mnist-training-<random string>-worker-0, which will give the following output (or similar)

      ...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [00:23<00:00]\nEpochs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:34<00:00, 34.18s/it]\n...\nAccuracy on test set: 97.08%\n
      "},{"location":"services/graphcore/training/L1_getting_started/#monitoring-and-cancelling-your-ipu-job","title":"Monitoring and Cancelling your IPU job","text":"

      An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete IPUjobs, run kubectl get ipujobs, which will show:

      NAME             STATUS      CURRENT   DESIRED   LASTMESSAGE          AGE\nmnist-training   Completed   0         1         All instances done   10m\n

      To delete the IPUjob, run kubectl delete ipujobs <job-name>, e.g. kubectl delete ipujobs mnist-training-<random string>. This will also delete the associated worker pod mnist-training-<random string>-worker-0.

      Note: simply deleting the pod via kubectl delete pods mnist-training-<random-string>-worker-0 does not delete the IPU job, which will need to be deleted separately.

      Note: you can list all pods via kubectl get all or kubectl get pods, but they do not show the ipujobs. These can be obtained using kubectl get ipujobs.

      Note: kubectl describe <pod-name> provides verbose description of a specific pod.

      "},{"location":"services/graphcore/training/L1_getting_started/#description","title":"Description","text":"

      The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named IPUJob, which can be seen at the beginning of the included yaml file:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\n

      An IPUJob allows users to defineworkloads that can use IPUs. There are several fields specific to an IPUJob:

      job instances : This defines the number of jobs. In the case of training it should be 1.

      ipusPerJobInstance : This defines the size of IPU partition that will be created for each job instance.

      workers : This defines a Pod specification that will be used for Worker Pods, including the container image and commands.

      These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the next lesson.

      "},{"location":"services/graphcore/training/L1_getting_started/#additional-information","title":"Additional Information","text":"

      It is possible to further specify the restart policy (Always/OnFailure/Never/ExitCode) and clean up policy (Workers/All/None); see here.

      "},{"location":"services/graphcore/training/L2_multiple_IPU/","title":"Distributed training on multiple IPUs","text":"

      In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on https://github.com/graphcore/examples. These may require distributed training on multiple IPUs.

      The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64.

      "},{"location":"services/graphcore/training/L2_multiple_IPU/#first-example","title":"First example","text":"

      As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from https://github.com/graphcore/examples/tree/master/nlp/bert/pytorch.

      To get started, save and create an IPUJob with the following .yaml file:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: bert-training-multi-ipu-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"4\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: bert-training-multi-ipu\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/nlp/bert/pytorch;\n              apt update ;\n              apt upgrade -y;\n              DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n              pip3 install -r requirements.txt ;\n              python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      Running the above IPUJob and querying the log via kubectl logs pod/bert-training-multi-ipu-<random string>-worker-0 should give:

      ...\nData loaded in 8.559805537108332 secs\n-----------------------------------------------------------\n-------------------- Device Allocation --------------------\nEmbedding  --> IPU 0\nEncoder 0  --> IPU 1\nEncoder 1  --> IPU 1\nEncoder 2  --> IPU 1\nEncoder 3  --> IPU 1\nEncoder 4  --> IPU 2\nEncoder 5  --> IPU 2\nEncoder 6  --> IPU 2\nEncoder 7  --> IPU 2\nEncoder 8  --> IPU 3\nEncoder 9  --> IPU 3\nEncoder 10 --> IPU 3\nEncoder 11 --> IPU 3\nPooler     --> IPU 0\nClassifier --> IPU 0\n-----------------------------------------------------------\n---------- Compilation/Loading from Cache Started ---------\n\n...\n\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [08:02<00:00]\nCompiled/Loaded model in 500.756152929971 secs\n-----------------------------------------------------------\n--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %:   0%|          | 0/1 [00:16<?, ?it/s, throughput: 4035.0 samples/sec]\n-----------------------------------------------------------\n-------------------- Training Metrics ---------------------\nglobal_batch_size: 65536\ndevice_iterations: 1\ntraining_steps: 1\nTraining time: 16.245 secs\n-----------------------------------------------------------\n
      "},{"location":"services/graphcore/training/L2_multiple_IPU/#details","title":"Details","text":"

      In this example, we have requested 4 IPUs:

      ipusPerJobInstance: \"4\"\n

      The python flag --config pretrain_base_128_pod4 uses one of the preset configurations for this model with 4 IPUs. Here we also use the --datset generated flag to generate data rather than download the required dataset.

      To provided sufficient shm for the IPU pod, it may be necessary to mount /dev/shm as follows:

                volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      It is also required to set spec.hostIPC to true:

        hostIPC: true\n

      and add a securityContext to the container definition than enables the IPC_LOCK capability:

          securityContext:\n      capabilities:\n        add:\n        - IPC_LOCK\n

      Note: IPC_LOCK allows for the RDMA software stack to use pinned memory \u2014 which is particularly useful for PyTorch dataloaders, which can be very memory hungry. This is since all data going to the IPUs go via the network interfaces (via 100Gbps ethernet).

      "},{"location":"services/graphcore/training/L2_multiple_IPU/#memory-usage","title":"Memory usage","text":"

      In general, the graph compilation phase of running large models can require significant memory, and far less during the execution phase.

      In the example above, it is possible to explicitly request the memory via:

                resources:\n            limits:\n              memory: \"128Gi\"\n            requests:\n              memory: \"128Gi\"\n

      which will succeed. (The graph compilation fails if only 32Gi is requested.)

      As a general guideline, 128GB memory should be enough for the majority of tasks, and rarely exceed 200GB even for jobs with high IPU count. In the example .yaml script, we do not specifically request the memory.

      "},{"location":"services/graphcore/training/L2_multiple_IPU/#scaling-up-ipu-count-and-using-poprun","title":"Scaling up IPU count and using Poprun","text":"

      In the example above, python is launched directly in the pod. When scaling up the number of IPUs (e.g. above 8 IPUs), it may be possible to run into a CPU bottleneck. This may be observed when the throughput scales sub-linearly with the number of data-parallel replicas (i.e. when doubling the IPU count, the performance does not double). This can also be verified by profiling the application and observing a significant proportion of runtime spent on host CPU workload.

      In this case, Poprun can be used launch multiple instances. As an example, we will save the following .yaml configuratoin and run:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: bert-poprun-64ipus-\nspec:\n  jobInstances: 1\n  modelReplicasPerWorker: \"16\"\n  ipusPerJobInstance: \"64\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: bert-poprun-64ipus\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/nlp/bert/pytorch;\n              apt update ;\n              apt upgrade -y;\n              DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n              pip3 install -r requirements.txt ;\n              OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 OMPI_ALLOW_RUN_AS_ROOT=1 \\\n              poprun \\\n              --allow-run-as-root 1 \\\n              --vv \\\n              --num-instances 1 \\\n              --num-replicas 16 \\\n               --mpi-global-args=\"--tag-output\" \\\n              --ipus-per-replica 4 \\\n              python3 run_pretraining.py \\\n              --config pretrain_large_128_POD64 \\\n              --dataset generated --training-steps 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      Inspecting the log via kubectl logs <pod-name> should produce:

      ...\n ===========================================================================================\n|                                      poprun topology                                      |\n|===========================================================================================|\n10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE\n10:10:50.154 1 POPRUN [D] Target options from environment: {}\n| hosts     |                                   localhost                                   |\n|-----------|-------------------------------------------------------------------------------|\n| ILDs      |                                       0                                       |\n|-----------|-------------------------------------------------------------------------------|\n| instances |                                       0                                       |\n|-----------|-------------------------------------------------------------------------------|\n| replicas  | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |\n -------------------------------------------------------------------------------------------\n10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.154 1 POPRUN [D] Using target options: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting\n10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None\n10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output'\n'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x'\n'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x'\n'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x'\n'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0'\n'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS=\n{\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\n\"instanceSize\":\"64\"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1'\n10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346)\n[1,0]<stderr>:    Registered metric hook: total_compiling_time with object: <function get_results_for_compile_time at 0x7fe0a6e8af70>\n[1,0]<stderr>:Using config: pretrain_large_128_POD64\n...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [10:11<00:00][1,0]<stderr>:\n[1,0]<stderr>:Compiled/Loaded model in 683.6591004971415 secs\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?it/s, throughput: 17692.1 samples/sec][1,0]<stderr>:\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:-------------------- Training Metrics ---------------------\n[1,0]<stderr>:global_batch_size: 65536\n[1,0]<stderr>:device_iterations: 1\n[1,0]<stderr>:training_steps: 1\n[1,0]<stderr>:Training time: 3.718 secs\n[1,0]<stderr>:-----------------------------------------------------------\n
      "},{"location":"services/graphcore/training/L2_multiple_IPU/#notes-on-using-the-examples-respository","title":"Notes on using the examples respository","text":"

      Graphcore provides examples of a variety of models on Github https://github.com/graphcore/examples. When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel.

      "},{"location":"services/graphcore/training/L3_profiling/","title":"Profiling with PopVision","text":"

      Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see Profiling and Debugging and PopVision Graph Analyser User Guide.

      We will reuse the same PyTorch MNIST example from lesson 1 (from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist).

      To enable profiling and create IPU reports, we need to add the following line to the training script mnist_poptorch_code_only.py :

      training_opts = training_opts.enableProfiling()\n

      (for details the API, see API reference)

      Save and run kubectl create -f <yaml-file> on the following:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: mnist-training-profiling-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: mnist-training-profiling\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/pytorch/mnist;\n              python -m pip install -r requirements.txt;\n              sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py;\n              python mnist_poptorch_code_only.py --epochs 1;\n              echo 'RUNNING ls ./training';\n              ls training\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      After completion, using kubectl logs <pod-name>, we can see the following result

      ...\nAccuracy on test set: 96.69%\nRUNNING ls ./training\narchive.a\nprofile.pop\n

      We can see that the training has created two Poplar report files: archive.a which is an archive of the ELF executable files, one for each tile; and profile.pop, the poplar profile, which contains compile-time and execution information about the Poplar graph.

      "},{"location":"services/graphcore/training/L3_profiling/#downloading-the-profile-reports","title":"Downloading the profile reports","text":"

      To download the traing profiles to your local environment, you can use kubectl cp. For example, run

      kubectl cp <pod-name>:/root/build/examples/tutorials/simple_applications/pytorch/mnist/training .\n

      Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here https://www.graphcore.ai/developer/popvision-tools.

      From the Graph Analyser, you can analyse information including memory usage, execution trace and more.

      "},{"location":"services/graphcore/training/L4_other_frameworks/","title":"Other Frameworks","text":"

      In this tutorial we'll briefly cover running tensorflow and PopART for Machine Learning, and writing IPU programs directly via the PopLibs library in C++. Extra links and resources will be provided for more in-depth information.

      "},{"location":"services/graphcore/training/L4_other_frameworks/#terminology","title":"Terminology","text":"

      Within Graphcore, Poplar refers to the tools (e.g. Poplar Graph Engine or Poplar Graph Compiler) and libraries (PopLibs) for programming on IPUs.

      The Poplar SDK is a package of software development tools, including

      For more details see here.

      "},{"location":"services/graphcore/training/L4_other_frameworks/#other-ml-frameworks-tensorflow-and-popart","title":"Other ML frameworks: Tensorflow and PopART","text":"

      Besides being able to run PyTorch code, as demonstrated in the previous lessons, the Poplar SDK also supports running ML learning applications with tensorflow or PopART.

      "},{"location":"services/graphcore/training/L4_other_frameworks/#tensorflow","title":"Tensorflow","text":"

      The Poplar SDK includes implementation of TensorFlow and Keras for the IPU.

      For more information, refer to Targeting the IPU from TensorFlow 2 and TensorFlow 2 Quick Start.

      These are available from the image graphcore/tensorflow:2.

      For a quick example, we will run an example script from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/tensorflow2/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: tensorflow-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: tensorflow-example\n          image: graphcore/tensorflow:2\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              apt update;\n              apt upgrade -y;\n              apt install git -y;\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/tensorflow2/mnist;\n              python -m pip install -r requirements.txt;\n              python mnist_code_only.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      Running kubectl logs <pod> should show the results similar to the following

      ...\n2023-10-25 13:21:40.263823: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.2.0 (1513789a51) Poplar package: b82480c629\n2023-10-25 13:21:42.203515: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0\nDownloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n11493376/11490434 [==============================] - 0s 0us/step\n11501568/11490434 [==============================] - 0s 0us/step\n2023-10-25 13:21:43.789573: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)\n2023-10-25 13:21:44.164207: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n2023-10-25 13:21:57.935339: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.\nEpoch 1/4\n2000/2000 [==============================] - 17s 8ms/step - loss: 0.6188\nEpoch 2/4\n2000/2000 [==============================] - 1s 427us/step - loss: 0.3330\nEpoch 3/4\n2000/2000 [==============================] - 1s 371us/step - loss: 0.2857\nEpoch 4/4\n2000/2000 [==============================] - 1s 439us/step - loss: 0.2568\n
      "},{"location":"services/graphcore/training/L4_other_frameworks/#popart","title":"PopART","text":"

      The Poplar Advanced Run Time (PopART) enables importing and constructing ONNX graphs, and running graphs in inference, evaluation or training modes. PopART provides both a C++ and Python API.

      For more information, see the PopART User Guide

      PopART is available from the image graphcore/popart.

      For a quick example, we will run an example script from https://github.com/graphcore/tutorials/tree/sdk-release-3.1/simple_applications/popart/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: popart-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: popart-example\n          image: graphcore/popart:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/tutorials.git;\n              cd tutorials;\n              git checkout sdk-release-3.1;\n              cd simple_applications/popart/mnist;\n              python3 -m pip install -r requirements.txt;\n              ./get_data.sh;\n              python3 popart_mnist.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      Running kubectl logs <pod> should show the results similar to the following

      ...\nCreating ONNX model.\nCompiling the training graph.\nCompiling the validation graph.\nRunning training loop.\nEpoch #1\n   Loss=16.2605\n   Accuracy=88.88%\n
      "},{"location":"services/graphcore/training/L4_other_frameworks/#writing-ipu-programs-directly-with-poplibs","title":"Writing IPU programs directly with PopLibs","text":"

      The Poplar libraries are a set of C++ libraries consisting of the Poplar graph library and the open-source PopLibs libraries.

      The Poplar graph library provides direct access to the IPU by code written in C++. You can write complete programs using Poplar, or use it to write functions to be called from your application written in a higher-level framework such as TensorFlow.

      The PopLibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, element-wise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU.

      For more information, see Poplar Quick Start and Poplar and PopLibs User Guide.

      These are available from the image graphcore/poplar.

      When using the PopLibs libraries, you will have to include the include files in the include/popops directory, e.g.

      #include <include/popops/ElementWise.hpp>\n

      and to link the relevant PopLibs libraries, in addition to the Poplar library, e.g.

      g++ -std=c++11 my-program.cpp -lpoplar -lpopops\n

      For a quick example, we will run an example from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/poplar/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

      apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: poplib-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: poplib-example\n          image: graphcore/poplar:3.3.0\n          command: [\"bash\"]\n          args: [\"-c\", \"cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/poplar/mnist/ && ./get_data.sh && make &&  ./regression-demo -IPU 1 50\"]\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

      Running kubectl logs <pod> should show the results similar to the following

      ...\nUsing the IPU\nTrying to attach to IPU\nAttached to IPU 0\nTarget:\n  Number of IPUs:         1\n  Tiles per IPU:          1,472\n  Total Tiles:            1,472\n  Memory Per-Tile:        624.0 kB\n  Total Memory:           897.0 MB\n  Clock Speed (approx):   1,850.0 MHz\n  Number of Replicas:     1\n  IPUs per Replica:       1\n  Tiles per Replica:      1,472\n  Memory per Replica:     897.0 MB\n\nGraph:\n  Number of vertices:            5,466\n  Number of edges:              16,256\n  Number of variables:          41,059\n  Number of compute sets:           20\n\n...\n\nEpoch 1 (99%), accuracy 76%\n
      "},{"location":"services/jhub/","title":"EIDF Jupyterhub","text":"

      QuickStart

      Tutorial

      Documentation

      "},{"location":"services/jhub/docs/","title":"Service Documentation","text":""},{"location":"services/jhub/docs/#online-support","title":"Online support","text":""},{"location":"services/jhub/quickstart/","title":"Quickstart","text":""},{"location":"services/jhub/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/jhub/quickstart/#first-task","title":"First Task","text":""},{"location":"services/jhub/quickstart/#further-information","title":"Further information","text":""},{"location":"services/jhub/tutorial/","title":"Tutorial","text":""},{"location":"services/jhub/tutorial/#first-notebook","title":"First notebook","text":""},{"location":"services/mft/","title":"MFT","text":""},{"location":"services/mft/quickstart/","title":"Managed File Transfer","text":""},{"location":"services/mft/quickstart/#getting-to-the-mft","title":"Getting to the MFT","text":"

      The EIDF MFT can be accessed at https://eidf-mft.epcc.ed.ac.uk

      "},{"location":"services/mft/quickstart/#how-it-works","title":"How it works","text":"

      The MFT provides a 'drop' zone for the project. All users in a given project will have access to the same shared transfer area. They will have the ability to upload, download, and delete files from the project's transfer area. This area is linked to a directory within the projects space on the shared backend storage.

      Files which are uploaded are owned by the Linux user 'nobody' and the group ID of whatever project the file is being uploaded to. They have the permissions: Owner = rw Group = r Others = r

      Once the file is opened on the VM, the user that opened it will become the owner and they can make further changes.

      "},{"location":"services/mft/quickstart/#gaining-access-to-the-mft","title":"Gaining access to the MFT","text":"

      By default a project won't have access to the MFT, this has to be enabled. Currently this can be done by the PI sending a request to the EIDF Helpdesk. Once the project is enabled within the MFT, every user with the project will be able to log into the MFT using their usual EIDF credentials.

      "},{"location":"services/mft/sftp/","title":"SFTP","text":"

      Coming Soon

      "},{"location":"services/mft/using-the-mft/","title":"Using the MFT Web Portal","text":""},{"location":"services/mft/using-the-mft/#logging-in","title":"Logging in","text":"

      When you reach the MFT home page you can log in using your usual VM project credentials.

      You will then be asked what type of session you would like to start. Select New Web Client or Web Client and continue.

      "},{"location":"services/mft/using-the-mft/#file-ingress","title":"File Ingress","text":"

      Once logged in, all files currently in the projects transfer directory will be displayed. Click the 'Upload' button under the 'Home' title to open the dialogue for file upload. You can then drag and drop files in, or click 'Browse' to find them locally.

      Once uploaded, the file will be immediately accessible from the project area, and can be used within any EIDF service which has the filesystem mounted.

      "},{"location":"services/mft/using-the-mft/#file-egress","title":"File Egress","text":"

      File egress can be done in the reverse way. By placing the file into the project transfer directory, it will become available in the MFT portal.

      "},{"location":"services/mft/using-the-mft/#file-management","title":"File Management","text":"

      Directories can be created within the project transfer directory, for example with 'Import' and 'Export' to allow for better file management. Files deleted from either the MFT portal or from the VM itself will remove it from the other, as both locations point at the same file. It's only stored in one place, so modifications made from either place will remove the file.

      "},{"location":"services/rstudioserver/","title":"EIDF R Studio Server","text":"

      QuickStart

      Tutorial

      Documentation

      "},{"location":"services/rstudioserver/docs/","title":"Service Documentation","text":""},{"location":"services/rstudioserver/docs/#online-support","title":"Online support","text":""},{"location":"services/rstudioserver/quickstart/","title":"Quickstart","text":""},{"location":"services/rstudioserver/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/rstudioserver/quickstart/#first-task","title":"First Task","text":""},{"location":"services/rstudioserver/quickstart/#creating-a-new-r-script","title":"Creating a New R Script","text":"

      Your RStudio Server session has been initialised now. If you are participating in a workshop, then all the packages and data required for the workshop have been loaded into the workspace. All that remains is to create a new R script to contain your code!

      1. In the RStudio Server UI, open the File menu item at the far left of the main menu bar at the top of the page
      2. Hover over the \u2018New File\u2019 sub-menu item, then select \u2018R Script\u2019 from the expanded menu
      3. A new window pane will appear in the UI as shown below, and you are now ready to start adding the R code to your script! RStudio Server UI screen with new script
      "},{"location":"services/rstudioserver/quickstart/#further-information","title":"Further information","text":""},{"location":"services/rstudioserver/tutorial/","title":"Tutorial","text":""},{"location":"services/rstudioserver/tutorial/#first-notebook","title":"First notebook","text":""},{"location":"services/ultra2/","title":"Ultra2 Large Memory System","text":"

      Get Access

      Running codes

      "},{"location":"services/ultra2/access/","title":"Ultra2 Large Memory System","text":""},{"location":"services/ultra2/access/#getting-access","title":"Getting Access","text":"

      Access to the Ultra2 system (also referred to as the SDF-CS1 system) is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.

      "},{"location":"services/ultra2/run/","title":"Ultra2 High Memory System","text":""},{"location":"services/ultra2/run/#introduction","title":"Introduction","text":"

      The Ultra2 system (also called the SDF-CS1) system, is a single logical CPU system based at EPCC. It is suitable for running jobs which require large volumes of non-distributed memory (as opposed to a cluster).

      "},{"location":"services/ultra2/run/#specifications","title":"Specifications","text":"

      The system is a HPE SuperDome Flex containing 576 individual cores in a SMT-1 arrangement (1 thread per core). The system has 18TB of memory available to users. Home directories are network mounted from the EIDF e1000 Lustre filesystem, although some local NVMe storage is available for temporary file storage during runs.

      "},{"location":"services/ultra2/run/#login","title":"Login","text":"

      Login is via SSH only via ssh <username>@sdf-cs1.epcc.ed.ac.uk. See below for details on the credentials required to access the system.

      "},{"location":"services/ultra2/run/#access-credentials","title":"Access credentials","text":"

      To access Ultra2, you need to use two credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP).

      "},{"location":"services/ultra2/run/#ssh-key-pairs","title":"SSH Key Pairs","text":"

      You will need to generate an SSH key pair protected by a passphrase to access Ultra2.

      Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:

          $ ssh-keygen -t rsa -C \"your@email.com\"\n    ...\n    -bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\n    Generating public/private rsa key pair.\n    Enter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\n    Enter passphrase (empty for no passphrase): [Passphrase]\n    Enter same passphrase again: [Passphrase]\n    Your identification has been saved in /Home/user/.ssh/id_rsa.\n    Your public key has been saved in /Home/user/.ssh/id_rsa.pub.\n    The key fingerprint is:\n    03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\n    The key's randomart image is:\n    +--[ RSA 2048]----+\n    |    . ...+o++++. |\n    | . . . =o..      |\n    |+ . . .......o o |\n    |oE .   .         |\n    |o =     .   S    |\n    |.    +.+     .   |\n    |.  oo            |\n    |.  .             |\n    | ..              |\n    +-----------------+\n

      (remember to replace \"your@email.com\" with your e-mail address).

      "},{"location":"services/ultra2/run/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"

      You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:

      Login to SAFE. Then:

      1. Go to the Menu Login accounts and select the Ultra2 account you want to add the SSH key to
      2. On the subsequent Login account details page click the Add Credential button
      3. Select SSH public key as the Credential Type and click Next
      4. Either copy and paste the public part of your SSH key into the SSH Public key box or use the button to select the public key file on your computer.
      5. Click Add to associate the public SSH key part with your account

      Once you have done this, your SSH key will be added to your Ultra2 account.

      "},{"location":"services/ultra2/run/#time-based-one-time-password-totp","title":"Time-based one-time password (TOTP)","text":"

      Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to set up your TOTP before you can log into Ultra2.

      First Login

      When you first log into Ultra2, you will be prompted to change your initial password. This is a three step process:

      1. When promoted to enter your password: Enter the password which you retrieve from SAFE
      2. When prompted to enter your new password: type in a new password
      3. When prompted to re-enter the new password: re-enter the new password

      Your password has now been changed

      You will not use your password when logging on to Ultra2 after the initial logon.

      "},{"location":"services/ultra2/run/#ssh-login","title":"SSH Login","text":"

      To login to the host system, you will need to use the SSH Key and TOTP token you registered when creating the account SAFE, along with the SSH Key you registered when creating the account. For example, with the appropriate key loadedssh <username>@sdf-cs1.epcc.ed.ac.uk will then prompt you, roughly once per day, for your TOTP code.

      "},{"location":"services/ultra2/run/#software","title":"Software","text":"

      The primary software provided is Intel's OneAPI suite containing mpi compilers and runtimes, debuggers and the vTune performance analyser. Standard GNU compilers are also available. The OneAPI suite can be loaded by sourcing the shell script:

      source  /opt/intel/oneapi/setvars.sh\n
      "},{"location":"services/ultra2/run/#running-jobs","title":"Running Jobs","text":"

      All jobs must be run via SLURM to avoid inconveniencing other users of the system. Users should not run jobs directly. Note that the system has one logical processor with a large number of threads and thus appears to SLURM as a single node. This is intentional.

      "},{"location":"services/ultra2/run/#queue-limits","title":"Queue limits","text":"

      We kindly request that users limit their maximum total running job size to 288 cores and 4TB of memory, whether that be a divided into a single job, or a number of jobs. This may be enforced via SLURM in the future.

      "},{"location":"services/ultra2/run/#mpi-jobs","title":"MPI jobs","text":"

      An example script to run a multi-process MPI \"Hello world\" example is shown.

      #!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=4\n#SBATCH --nodelist=sdf-cs1\n#SBATCH --partition=standard\n##SBATCH --exclusive\n\n\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# Source oneAPI to ensure mpirun available\nif [[ -z \"${SETVARS_COMPLETED}\" ]]; then\nsource /opt/intel/oneapi/setvars.sh\nfi\n\n# mpirun invocation for Intel suite.\nmpirun -n ${mpi_threads} ./helloworld.exe\n
      "},{"location":"services/virtualmachines/","title":"Overview","text":"

      The EIDF Virtual Machine (VM) Service is the underlying infrastrcture upon which the EIDF Data Science Cloud (DSC) is built.

      The service currenly has a mixture of hardware node types which host VMs of various flavours:

      The shapes and sizes of the flavours are based on subdivisions of this hardware, noting that CPUs are 4x oversubscribed for mcomp nodes (general VM flavours).

      "},{"location":"services/virtualmachines/#service-access","title":"Service Access","text":"

      Users should have an EIDF account - EIDF Accounts.

      Project Leads will be able to have access to the DSC added to their project during the project application process or through a request to the EIDF helpdesk.

      "},{"location":"services/virtualmachines/#additional-service-policy-information","title":"Additional Service Policy Information","text":"

      Additional information on service policies can be found here.

      "},{"location":"services/virtualmachines/docs/","title":"Service Documentation","text":""},{"location":"services/virtualmachines/docs/#project-management-guide","title":"Project Management Guide","text":""},{"location":"services/virtualmachines/docs/#required-member-permissions","title":"Required Member Permissions","text":"

      VMs and user accounts can only be managed by project members with Cloud Admin permissions. This includes the principal investigator (PI) of the project and all project managers (PM). Through SAFE the PI can designate project managers and the PI and PMs can grant a project member the Cloud Admin role:

      1. Click \"Manage Project in SAFE\" at the bottom of the project page (opens a new tab)
      2. On the project management page in SAFE, scroll down to \"Manage Members\"
      3. Click Add project manager or Set member permissions

      For details please refer to the SAFE documentation: How can I designate a user as a project manager?

      "},{"location":"services/virtualmachines/docs/#create-a-vm","title":"Create a VM","text":"

      To create a new VM:

      1. Select the project from the list of your projects, e.g. eidfxxx
      2. Click on the 'New Machine' button
      3. Complete the 'Create Machine' form as follows:

        1. Provide an appropriate name, e.g. dev-01. The project code will be prepended automatically to your VM name, in this case your VM would be named eidfxxx-dev-01.
        2. Select a suitable operating system
        3. Select a machine specification that is suitable
        4. Choose the required disk size (in GB) or leave blank for the default
        5. Tick the checkbox \"Configure RDP access\" if you would like to install RDP and configure VDI connections via RDP for your VM.
        6. Select the package installations from the software catalogue drop-down list, or \"None\" if you don't require any pre-installed packages
      4. Click on 'Create'

      5. You should see the new VM listed under the 'Machines' table on the project page and the status as 'Creating'
      6. Wait while the job to launch the VM completes. This may take up to 10 minutes, depending on the configuration you requested. You have to reload the page to see updates.
      7. Once the job has completed successfully the status shows as 'Active' in the list of machines.

      You may wish to ensure that the machine size selected (number of CPUs and RAM) does not exceed your remaining quota before you press Create, otherwise the request will fail.

      In the list of 'Machines' in the project page in the portal, click on the name of new VM to see the configuration and properties, including the machine specification, its 10.24.*.* IP address and any configured VDI connections.

      "},{"location":"services/virtualmachines/docs/#quota-and-usage","title":"Quota and Usage","text":"

      Each project has a quota for the number of instances, total number of vCPUs, total RAM and storage. You will not be able to create a VM if it exceeds the quota.

      You can view and refresh the project usage compared to the quota in a table near the bottom of the project page. This table will be updated automatically when VMs are created or removed, and you can refresh it manually by pressing the \"Refresh\" button at the top of the table.

      Please contact the helpdesk if your quota requirements have changed.

      "},{"location":"services/virtualmachines/docs/#add-a-user-account","title":"Add a user account","text":"

      User accounts allow project members to log in to the VMs in a project. The Project PI and project managers manage user accounts for each member of the project. Users usually use one account (username and password) to log in to all the VMs in the same project that they can access, however a user may have multiple accounts in a project, for example for different roles.

      1. From the project page in the portal click on the 'Create account' button under the 'Project Accounts' table at the bottom
      2. Complete the 'Create User Account' form as follows:

        1. Choose 'Account user name': this could be something sensible like the first and last names concatenated (or initials) together with the project name. The username is unique across all EPCC systems so the user will not be able to reuse this name in another project once it has been assigned.
        2. Select the project member from the 'Account owner' drop-down field
        3. Click 'Create'

      The user can now set the password for their new account on the account details page.

      "},{"location":"services/virtualmachines/docs/#adding-access-to-the-vm-for-a-user","title":"Adding Access to the VM for a User","text":"

      User accounts can be granted or denied access to existing VMs.

      1. Click 'Manage' next to an existing user account in the 'Project Accounts' table on the project page, or click on the account name and then 'Manage' on the account details page
      2. Select the checkboxes in the column \"Access\" for the VMs to which this account should have access or uncheck the ones without access
      3. Click the 'Update' button
      4. After a few minutes, the job to give them access to the selected VMs will complete and the account status will show as \"Active\".

      If a user is logged in already to the VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi newly added connections may not appear in their connections list immediately. They must log out and log in again to refresh the connection information, or wait until the login token expires and is refreshed automatically - this might take a while.

      If a user only has one connection available in the VDI they will be automatically directed to the VM with the default connection.

      "},{"location":"services/virtualmachines/docs/#sudo-permissions","title":"Sudo permissions","text":"

      A project manager or PI may also grant sudo permissions to users on selected VMs. Management of sudo permissions must be requested in the project application - if it was not requested or the request was denied the functionality described below is not available.

      1. Click 'Manage' next to an existing user account in the 'Project Accounts' table on the project page
      2. Select the checkboxes in the column \"Sudo\" for the VMs on which this account is granted sudo permissions or uncheck to remove permissions
      3. Make sure \"Access\" is also selected for the sudo VMs to allow login
      4. Click the 'Update' button

      After a few minutes, the job to give the user account sudo permissions on the selected VMs will complete. On the account detail page a \"sudo\" badge will appear next to the selected VMs.

      Please contact the helpdesk if sudo permission management is required but is not available in your project.

      "},{"location":"services/virtualmachines/docs/#first-login","title":"First login","text":"

      A new user account must reset the password before they can log in for the first time.

      The user can reset the password in their account details page.

      "},{"location":"services/virtualmachines/docs/#updating-an-existing-machine","title":"Updating an existing machine","text":""},{"location":"services/virtualmachines/docs/#adding-rdp-access","title":"Adding RDP Access","text":"

      If you did not select RDP access when you created the VM you can add it later:

      1. Open the VM details page by selecting the name on the project page
      2. Click on 'Configure RDP'
      3. The configuration job runs for a few minutes.

      Once the RDP job is completed, all users that are allowed to access the VM will also be permitted to use the RDP connection.

      "},{"location":"services/virtualmachines/docs/#software-catalogue","title":"Software catalogue","text":"

      You can install packages from the software catalogue at a later time, even if you didn't select a package when first creating the machine.

      1. Open the VM details page by selecting the name on the project page
      2. Click on 'Software Catalogue'
      3. Select the configuration you wish to install and press 'Submit'
      4. The configuration job runs for a few minutes.
      "},{"location":"services/virtualmachines/flavours/","title":"Flavours","text":"

      These are the current Virtual Machine (VM) flavours (configurations) available on the the Virtual Desktop cloud service. Note that all VMs are built and configured using the EIDF Portal by PIs/Cloud Admins of projects, except GPU flavours which must be requested via the helpdesk or the support request form.

      Flavour Name vCPUs DRAM in GB Pinned Cores GPU general.v2.tiny 1 2 No No general.v2.small 2 4 No No general.v2.medium 4 8 No No general.v2.large 8 16 No No general.v2.xlarge 16 32 No No capability.v2.8cpu 8 112 Yes No capability.v2.16cpu 16 224 Yes No capability.v2.32cpu 32 448 Yes No capability.v2.48cpu 48 672 Yes No capability.v2.64cpu 64 896 Yes No gpu.v1.8cpu 8 128 Yes Yes gpu.v1.16cpu 16 256 Yes Yes gpu.v1.32cpu 32 512 Yes Yes gpu.v1.48cpu 48 768 Yes Yes"},{"location":"services/virtualmachines/policies/","title":"EIDF Data Science Cloud Policies","text":""},{"location":"services/virtualmachines/policies/#end-of-life-policy-for-user-accounts-and-projects","title":"End of Life Policy for User Accounts and Projects","text":""},{"location":"services/virtualmachines/policies/#what-happens-when-an-account-or-project-is-no-longer-required-or-a-user-leaves-a-project","title":"What happens when an account or project is no longer required, or a user leaves a project","text":"

      These situations are most likely to come about during one of the following scenarios:

      1. The retirement of project (usually one month after project end)
      2. A Principal Investigator (PI) tidying up a project requesting the removal of user(s) no longer working on the project
      3. A user wishing their own account to be removed
      4. A failure by a user to respond to the annual request to verify their email address held in the SAFE

      For each user account involved, assuming the relevant consent is given, the next step can be summarised as one of the following actions:

      It will be possible to have the account re-activated up until resources are removed (as outlined above); after this time it will be necessary to re-apply.

      A user's right to use EIDF is granted by a project. Our policy is to treat the account and associated data as the property of the PI as the owner of the project and its resources. It is the user's responsibility to ensure that any data they store on the EIDF DSC is handled appropriately and to copy off anything that they wish to keep to an appropriate location.

      A project manager or the PI can revoke a user's access accounts within their project at any time, by locking, removing or re-owning the account as appropriate.

      A user may give up access to an account and return it to the control of the project at any time.

      When a project is due to end, the PI will receive notification of the closure of the project and its accounts one month before all project accounts and DSC resources (VMs, data volumes) are closed and cleaned or removed.

      "},{"location":"services/virtualmachines/policies/#backup-policies","title":"Backup policies","text":"

      The current policy is:

      We strongly advise that you keep copies of any critical data on on an alternative system that is fully backed up.

      "},{"location":"services/virtualmachines/policies/#patching-of-user-vms","title":"Patching of User VMs","text":"

      The EIDF team updates and patches the hypervisors and the cloud management software as part of the EIDF Maintenance sessions. It is the responsibility of project PIs to keep the VMs in their projects up to date. VMs running the Ubuntu operating system automatically install security patches and alert users at log-on (via SSH) to reboot as necessary for the changes to take effect. It also encourages users to update packages.

      "},{"location":"services/virtualmachines/quickstart/","title":"Quickstart","text":"

      Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.

      Authentication is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

      "},{"location":"services/virtualmachines/quickstart/#accessing-your-projects","title":"Accessing your projects","text":"
      1. Log into the portal at https://portal.eidf.ac.uk/. The login will redirect you to the SAFE.

      2. View the projects that you have access to at https://portal.eidf.ac.uk/project/

      "},{"location":"services/virtualmachines/quickstart/#joining-a-project","title":"Joining a project","text":"
      1. Navigate to https://portal.eidf.ac.uk/project/ and click the link to \"Request access\", or choose \"Request Access\" in the \"Project\" menu.

      2. Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".

      Now you have to wait for your PI or project manager to accept your request to join.

      "},{"location":"services/virtualmachines/quickstart/#accessing-a-vm","title":"Accessing a VM","text":"
      1. Select a project and view your user accounts on the project page.

      2. Click on an account name to view details of the VMs that are you allowed to access with this account, and to change the password for this account.

      3. Before you log in for the first time with a new user account, you must change your password as described below.

      4. Follow the link to the Guacamole login or log in directly at https://eidf-vdi.epcc.ed.ac.uk/vdi/. Please see the VDI guide for more information.

      5. You can also log in via the EIDF Gateway Jump Host if this is available in your project.

      Warning

      You must set a password for a new account before you log in for the first time.

      "},{"location":"services/virtualmachines/quickstart/#set-or-change-the-password-for-a-user-account","title":"Set or change the password for a user account","text":"

      Follow these instructions to set a password for a new account before you log in for the first time. If you have forgotten your password you may reset the password as described here.

      1. Select a project and click the account name in the project page to view the account details.

      2. In the user account detail page, press the button \"Set Password\" and follow the instructions in the form.

      There may be a short delay while the change is implemented before the new password becomes usable.

      "},{"location":"services/virtualmachines/quickstart/#further-information","title":"Further information","text":"

      Managing VMs: Project management guide to creating, configuring and removing VMs and managing user accounts in the portal.

      Virtual Desktop Interface: Working with the VDI interface.

      EIDF Gateway: SSH access to VMs via the EIDF SSH Gateway jump host.

      "},{"location":"status/","title":"EIDF Service Status","text":"

      The table below represents the broad status of each EIDF service.

      Service Status EIDF Portal VM SSH Gateway VM VDI Gateway Virtual Desktops Cerebras CS-2 SuperDome Flex (SDF-CS1 / Ultra2)"},{"location":"status/#maintenance-sessions","title":"Maintenance Sessions","text":"

      There will be a service outage on the 3rd Thursday of every month from 9am to 5pm. We keep maintenance downtime to a minimum on the service but do occasionally need to perform essential work on the system. Maintenance sessions are used to ensure that:

      The service will be returned to service ahead of 5pm if all the work is completed early.

      "}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"EIDF User Documentation","text":"

      The Edinburgh International Data Facility (EIDF) is built and operated by EPCC at the University of Edinburgh. EIDF is a place to store, find and work with data of all kinds. You can find more information on the service and the research it supports on the EIDF website.

      For more information or for support with our services, please email eidf@epcc.ed.ac.uk in the first instance.

      "},{"location":"#what-the-documentation-covers","title":"What the documentation covers","text":"

      This documentation gives more in-depth coverage of current EIDF services. It is aimed primarily at developers or power users.

      "},{"location":"#contributing-to-the-documentation","title":"Contributing to the documentation","text":"

      The source for this documentation is publicly available in the EIDF documentation Github repository so that anyone can contribute to improve the documentation for the service. Contributions can be in the form of improvements or additions to the content and/or addition of Issues providing suggestions for how it can be improved.

      Full details of how to contribute can be found in the README.md file of the repository.

      This documentation set is a work in progress.

      "},{"location":"#credits","title":"Credits","text":"

      This documentation draws on the ARCHER2 National Supercomputing Service documentation.

      "},{"location":"access/","title":"Accessing EIDF","text":"

      Some EIDF services are accessed via a Web browser and some by \"traditional\" command-line ssh.

      All EIDF services use the EPCC SAFE service management back end, to ensure compatibility with other EPCC high-performance computing services.

      "},{"location":"access/#web-access-to-virtual-machines","title":"Web Access to Virtual Machines","text":"

      The Virtual Desktop VM service is browser-based, providing a virtual desktop interface (Apache Guacamole) for \"desktop-in-a-browser\" access. Applications to use the VM service are made through the EIDF Portal.

      EIDF Portal: how to ask to join an existing EIDF project and how to apply for a new project

      VDI access to virtual machines: how to connect to the virtual desktop interface.

      "},{"location":"access/#ssh-access-to-virtual-machines","title":"SSH Access to Virtual Machines","text":"

      Users with the appropriate permissions can also use ssh to login to Virtual Desktop VMs

      "},{"location":"access/#ssh-access-to-computing-services","title":"SSH Access to Computing Services","text":"

      Includes access to the following services:

      To login to most command-line services with ssh you should use the username and password you obtained from SAFE when you applied for access, along with the SSH Key you registered when creating the account. You can then login to the host following the appropriately linked instructions above.

      "},{"location":"access/project/","title":"EIDF Portal","text":"

      Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.

      The EIDF Portal uses EPCC's SAFE service management software to manage user accounts across all EPCC services. To log in to the Portal you will first be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

      "},{"location":"access/project/#how-to-request-to-join-a-project","title":"How to request to join a project","text":"

      Log in to the EIDF Portal and navigate to \"Projects\" and choose \"Request access\". Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".

      Now you have to wait for your PI or project manager to accept your request to register.

      "},{"location":"access/project/#how-to-apply-for-a-project-as-a-principal-investigator","title":"How to apply for a project as a Principal Investigator","text":""},{"location":"access/project/#create-a-new-project-application","title":"Create a new project application","text":"

      Navigate to the EIDF Portal and log in via SAFE if necessary (see above).

      Once you have logged in click on \"Applications\" in the menu and choose \"New Application\".

      1. Fill in the Application Title - this will be the name of the project once it is approved.
      2. Choose a start date and an end date for your project.
      3. Click \"Create\" to create your project application.

      Once the application has been created you see an overview of the form you are required to fill in. You can revisit the application at any time by clicking on \"Applications\" and choosing \"Your applications\" to display all your current and past applications and their status, or follow the link https://portal.eidf.ac.uk/proposal/.

      "},{"location":"access/project/#populate-a-project-application","title":"Populate a project application","text":"

      Fill in each section of the application as required:

      You can edit and save each section separately and revisit the application at a later time.

      "},{"location":"access/project/#datasets","title":"Datasets","text":"

      You are required to fill in a \"Dataset\" form for each dataset that you are planning to store and process as part of your project.

      We are required to ensure that projects involving \"sensitive\" data have the necessary permissions in place. The answers to these questions will enable us to decide what additional documentation we may need, and whether your project may need to be set up in an independently governed Safe Haven. There may be some projects we are simply unable to host for data protection reasons.

      "},{"location":"access/project/#resource-requirements","title":"Resource Requirements","text":"

      Add an estimate for each size and type of VM that is required.

      "},{"location":"access/project/#submission","title":"Submission","text":"

      When you are happy with your application, click \"Submit\". If there are missing fields that are required these are highlighted and your submission will fail.

      When your submission was successful the application status is marked as \"Submitted\" and now you have to wait while the EIDF approval team considers your application. You may be contacted if there are any questions regarding your application or further information is required, and you will be notified of the outcome of your application.

      "},{"location":"access/project/#approved-project","title":"Approved Project","text":"

      If your application was approved, refer to Data Science Virtual Desktops: Quickstart how to view your project and to Data Science Virtual Desktops: Managing VMs how to manage a project and how to create virtual machines and user accounts.

      "},{"location":"access/ssh/","title":"SSH Access to Virtual Machines using the EIDF-Gateway Jump Host","text":"

      The EIDF-Gateway is an SSH gateway suitable for accessing EIDF Services via a console or terminal. As the gateway cannot be 'landed' on, a user can only pass through it and so the destination (the VM IP) has to be known for the service to work. Users connect to their VM through the jump host using their given accounts. You will require three things to use the gateway:

      1. A user within a project allowed to access the gateway and a password set.
      2. An SSH-key linked to this account, used to authenticate against the gateway.
      3. Have MFA setup with your project account via SAFE.

      Steps to meet all of these requirements are explained below.

      "},{"location":"access/ssh/#generating-and-adding-an-ssh-key","title":"Generating and Adding an SSH Key","text":"

      In order to make use of the EIDF-Gateway, your EIDF account needs an SSH-Key associated with it. If you added one while creating your EIDF account, you can skip this step.

      "},{"location":"access/ssh/#check-for-an-existing-ssh-key","title":"Check for an existing SSH Key","text":"

      To check if you have an SSH Key associated with your account:

      1. Login to the Portal
      2. Select 'Your Projects'
      3. Select your project name
      4. Select your username

      If there is an entry under 'Credentials', then you're all setup. If not, you'll need to generate an SSH-Key, to do this:

      "},{"location":"access/ssh/#generate-a-new-ssh-key","title":"Generate a new SSH Key","text":"
      1. Open a new window of whatever terminal you will use to SSH to EIDF.
      2. Generate a new SSH Key:

        ssh-keygen\n
      3. It is fine to accept the default name and path for the key unless you manage a number of keys.

      4. Press enter to finish generating the key
      "},{"location":"access/ssh/#adding-the-new-ssh-key-to-your-account-via-the-portal","title":"Adding the new SSH Key to your account via the Portal","text":"
      1. Login into the Portal
      2. Select 'Your Projects'
      3. Select the relevant project
      4. Select your username
      5. Select the plus button under 'Credentials'
      6. Select 'Choose File' to upload the PUBLIC (.pub) ssh key generated in the last step, or open the .pub file you just created and copy its contents into the text box.
      7. Click 'Upload Credential' - it should look something like this:
      8. "},{"location":"access/ssh/#adding-a-new-ssh-key-via-safe","title":"Adding a new SSH Key via SAFE","text":"

        This should not be necessary for most users, so only follow this process if you have an issue or have been told to by the EPCC Helpdesk. If you need to add an SSH Key directly to SAFE, you can follow this guide. However, select your '[username]@EIDF' login account, not 'Archer2' as specified in that guide.

        "},{"location":"access/ssh/#enabling-mfa-via-the-portal","title":"Enabling MFA via the Portal","text":"

        A multi-factor Time-Based One-Time Password is now required to access the SSH Gateway.

        To enable this for your EIDF account:

        1. Login to the portal.
        2. Select 'Projects' then 'Your Projects'
        3. Select the project containing the account you'd like to add MFA to.
        4. Under 'Your Accounts', select the account you would like to add MFA to.
        5. Select 'Set MFA Token'
        6. Within your chosen MFA application, scan the QR Code or enter the key and add the token.
        7. Enter the code displayed in the app into the 'Verification Code' box and select 'Set Token'
        8. You will be redirected to the User Account page and a green 'Added MFA Token' message will confirm the token has been added successfully.

        Note

        TOTP is only required for the SSH Gateway, not to the VMs themselves, and not through the VDI. An MFA token will have to be set for each account you'd like to use to access the EIDF SSH Gateway.

        "},{"location":"access/ssh/#using-the-ssh-key-and-totp-code-to-access-eidf-windows-and-linux","title":"Using the SSH-Key and TOTP Code to access EIDF - Windows and Linux","text":"
        1. From your local terminal, import the SSH Key you generated above: ssh-add /path/to/ssh-key

        2. This should return \"Identity added [Path to SSH Key]\" if successful. You can then follow the steps below to access your VM.

        "},{"location":"access/ssh/#accessing-from-macoslinux","title":"Accessing From MacOS/Linux","text":"

        Warning

        If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.

        OpenSSH is installed on Linux and MacOS usually by default, so you can access the gateway natively from the terminal.

        Ensure you have created and added an ssh key as specified in the 'Generating and Adding an SSH Key' section above, then run the commands below:

        ssh-add /path/to/ssh-key\nssh -J [username]@eidf-gateway.epcc.ed.ac.uk [username]@[vm_ip]\n

        For example:

        ssh-add ~/.ssh/keys/id_ed25519\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n

        Info

        If the ssh-add command fails saying the SSH Agent is not running, run the below command:

        eval `ssh-agent`

        Then re-run the ssh-add command above.

        The -J flag is use to specify that we will access the second specified host by jumping through the first specified host.

        You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.

        "},{"location":"access/ssh/#accessing-from-windows","title":"Accessing from Windows","text":"

        Windows will require the installation of OpenSSH-Server to use SSH. Putty or MobaXTerm can also be used but won\u2019t be covered in this tutorial.

        "},{"location":"access/ssh/#installing-and-using-openssh","title":"Installing and using OpenSSH","text":"
        1. Click the \u2018Start\u2019 button at the bottom of the screen
        2. Click the \u2018Settings\u2019 cog icon
        3. Select 'System'
        4. Select the \u2018Optional Features\u2019 option at the bottom of the list
        5. If \u2018OpenSSH Client\u2019 is not under \u2018Installed Features\u2019, click the \u2018View Features\u2019 button
        6. Search \u2018OpenSSH Client\u2019
        7. Select the check box next to \u2018OpenSSH Client\u2019 and click \u2018Install\u2019
        "},{"location":"access/ssh/#accessing-eidf-via-a-terminal","title":"Accessing EIDF via a Terminal","text":"

        Warning

        If this is your first time connecting to EIDF using a new account, you have to set a password as described in Set or change the password for a user account.

        1. Open either Powershell or the Windows Terminal
        2. Import the SSH Key you generated above:

          ssh-add \\path\\to\\sshkey\n\nFor Example:\nssh-add .\\.ssh\\id_ed25519\n
        3. This should return \"Identity added [Path to SSH Key]\" if successful. If it doesn't, run the following in Powershell:

          Get-Service -Name ssh-agent | Set-Service -StartupType Manual\nStart-Service ssh-agent\nssh-add \\path\\to\\sshkey\n
        4. Login by jumping through the gateway.

          ssh -J [EIDF username]@eidf-gateway.epcc.ed.ac.uk [EIDF username]@[vm_ip]\n\nFor Example:\nssh -J alice@eidf-gateway.epcc.ed.ac.uk alice@10.24.1.1\n

        You will be prompted for a 'TOTP' code upon successful public key authentication to the gateway. At the TOTP prompt, enter the code displayed in your MFA Application.

        "},{"location":"access/ssh/#ssh-aliases","title":"SSH Aliases","text":"

        You can use SSH Aliases to access your VMs with a single word.

        1. Create a new entry for the EIDF-Gateway in your ~/.ssh/config file. Using the text editor of your choice (vi used as an example), edit the .ssh/config file:

          vi ~/.ssh/config\n
        2. Insert the following lines:

          Host eidf-gateway\n  Hostname eidf-gateway.epcc.ed.ac.uk\n  User <eidf project username>\n  IdentityFile /path/to/ssh/key\n

          For example:

          Host eidf-gateway\n  Hostname eidf-gateway.epcc.ed.ac.uk\n  User alice\n  IdentityFile ~/.ssh/id_ed25519\n
        3. Save and quit the file.

        4. Now you can ssh to your VM using the below command:

          ssh -J eidf-gateway [EIDF username]@[vm_ip] -i /path/to/ssh/key\n

          For Example:

          ssh -J eidf-gateway alice@10.24.1.1 -i ~/.ssh/id_ed25519\n
        5. You can add further alias options to make accessing your VM quicker. For example, if you use the below template to create an entry below the EIDF-Gateway entry in ~/.ssh/config, you can use the alias name to automatically jump through the EIDF-Gateway and onto your VM:

          Host <vm name/alias>\n  HostName 10.24.VM.IP\n  User <vm username>\n  IdentityFile /path/to/ssh/key\n  ProxyCommand ssh eidf-gateway -W %h:%p\n

          For Example:

          Host demo\n  HostName 10.24.1.1\n  User alice\n  IdentityFile ~/.ssh/id_ed25519\n  ProxyCommand ssh eidf-gateway -W %h:%p\n
        6. Now, by running ssh demo your ssh agent will automatically follow the 'ProxyCommand' section in the 'demo' alias and jump through the gateway before following its own instructions to reach your VM. Note for this setup, if your key is RSA, you will need to add the following line to the bottom of the 'demo' alias: HostKeyAlgorithms +ssh-rsa

        Info

        This has added an 'Alias' entry to your ssh config, so whenever you ssh to 'eidf-gateway' your ssh agent will automatically fill the hostname, your username and ssh key. This method allows for a much less complicated ssh command to reach your VMs. You can replace the alias name with whatever you like, just change the 'Host' line from saying 'eidf-gateway' to the alias you would like. The -J flag is use to specify that we will access the second specified host by jumping through the first specified host.

        "},{"location":"access/ssh/#first-password-setting-and-password-resets","title":"First Password Setting and Password Resets","text":"

        Before logging in for the first time you have to reset the password using the web form in the EIDF Portal following the instructions in Set or change the password for a user account.

        "},{"location":"access/virtualmachines-vdi/","title":"Virtual Machines (VMs) and the EIDF Virtual Desktop Interface (VDI)","text":"

        Using the EIDF VDI, members of EIDF projects can connect to VMs that they have been granted access to. The EIDF VDI is a web portal that displays the connections to VMs a user has available to them, and then those connections can be easily initiated by clicking on them in the user interface. Once connected to the target VM, all interactions are mediated through the user's web browser by the EIDF VDI.

        "},{"location":"access/virtualmachines-vdi/#login-to-the-eidf-vdi","title":"Login to the EIDF VDI","text":"

        Once your membership request to join the appropriate EIDF project has been approved, you will be able to login to the EIDF VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi.

        Authentication to the VDI is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

        "},{"location":"access/virtualmachines-vdi/#navigating-the-eidf-vdi","title":"Navigating the EIDF VDI","text":"

        After you have been authenticated through SAFE and logged into the EIDF VDI, if you have multiple connections available to you that have been associated with your user (typically in the case of research projects), you will be presented with the VDI home screen as shown below:

        VDI home page with list of available VM connections

        Adding connections

        Note that if a project manager has added a new connection for you it may not appear in the list of connections immediately. You must log out and log in again to refresh your connections list.

        "},{"location":"access/virtualmachines-vdi/#connecting-to-a-vm","title":"Connecting to a VM","text":"

        If you have only one connection associated with your VDI user account (typically in the case of workshops), you will be automatically connected to the target VM's virtual desktop. Once you are connected to the VM, you will be asked for your username and password as shown below (if you are participating in a workshop, then you may not be asked for credentials)

        VM virtual desktop connection user account login screen

        Once your credentials have been accepted, you will be connected to your VM's desktop environment. For instance, the screenshot below shows a resulting connection to a Xubuntu 20.04 VM with the Xfce desktop environment.

        VM virtual desktop

        "},{"location":"access/virtualmachines-vdi/#vdi-features-for-the-virtual-desktop","title":"VDI Features for the Virtual Desktop","text":"

        The EIDF VDI is an instance of the Apache Guacamole clientless remote desktop gateway. Since the connection to your VM virtual desktop is entirely managed through Guacamole in the web browser, there are some additional features to be aware of that may assist you when using the VDI.

        "},{"location":"access/virtualmachines-vdi/#the-vdi-menu","title":"The VDI Menu","text":"

        The Guacamole menu is a sidebar which is hidden until explicitly shown. On a desktop or other device which has a hardware keyboard, you can show this menu by pressing <Ctrl> + <Alt> + <Shift> on a Windows PC client, or <Ctrl> + <Command> + <Shift> on a Mac client. To hide the menu, you press the same key combination once again. The menu provides various options, including:

        • Reading from (and writing to) the clipboard of the remote desktop
        • Zooming in and out of the remote display
        "},{"location":"access/virtualmachines-vdi/#clipboard-copy-and-paste-functionality","title":"Clipboard Copy and Paste Functionality","text":"

        After you have activated the Guacamole menu using the key combination above, at the top of the menu is a text area labeled \u201cclipboard\u201d along with some basic instructions:

        Text copied/cut within Guacamole will appear here. Changes to the text below will affect the remote clipboard.

        The text area functions as an interface between the remote clipboard and the local clipboard. Text from the local clipboard can be pasted into the text area, causing that text to be sent to the clipboard of the remote desktop. Similarly, if you copy or cut text within the remote desktop, you will see that text within the text area, and can manually copy it into the local clipboard if desired.

        You can use the standard keyboard shortcuts to copy text from your client PC or Mac to the Guacamole menu clipboard, then again copy that text from the Guacamole menu clipboard into an application or CLI terminal on the VM's remote desktop. An example of using the copy and paste clipboard is shown in the screenshot below.

        The EIDF VDI Clipboard

        "},{"location":"access/virtualmachines-vdi/#keyboard-language-and-layout-settings","title":"Keyboard Language and Layout Settings","text":"

        For users who do not have standard English (UK) keyboard layouts, key presses can have unexpected translations as they are transmitted to your VM. Please contact the EIDF helpdesk at eidf@epcc.ed.ac.uk if you are experiencing difficulties with your keyboard mapping, and we will help to resolve this by changing some settings in the Guacamole VDI connection configuration.

        "},{"location":"bespoke/","title":"Bespoke Services","text":"

        Ed-DaSH

        "},{"location":"bespoke/eddash/","title":"EIDFWorkshops","text":"

        Ed-DaSH Notebook Service

        Ed-DaSH Virtual Machines

        JupyterHub Notebook Service Access

        "},{"location":"bespoke/eddash/jhub-git/","title":"EIDF JupyterHub Notebook Service Access","text":"

        Using the EIDF JupyterHub, users can access a range of services including standard interactive Python notebooks as well as RStudio Server.

        "},{"location":"bespoke/eddash/jhub-git/#ed-dash-workshops","title":"Ed-DaSH Workshops","text":""},{"location":"bespoke/eddash/jhub-git/#accessing","title":"Accessing","text":"

        In order to access the EIDF JupyterHub, authentication is through GitHub, so you must have an account on https://github.com and that account must be a member of the appropriate organization in GitHub. Please ask your project admin or workshop instructor for the workshop GitHub organization details. Please follow the relevant steps listed below to prepare.

        1. If you do not have a GitHub account associated with the email you registered for the workshop with, follow the steps described in Step 1: Creating a GitHub Account
        2. If you do already have a GitHub account associated with the email address you registered for the workshop with, follow the steps described in Step 2: Registering with the Workshop GitHub Organization
        "},{"location":"bespoke/eddash/jhub-git/#step-1-creating-a-github-account","title":"Step 1: Creating a GitHub Account","text":"
        1. Visit https://github.com/signup in your browser
        2. Enter the email address that you used to register for the workshop
        3. Complete the remaining steps of the GitHub registration process
        4. Send an email to ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop GitHub organization
        5. Wait for an email from GitHub inviting you to join the organization, then follow the steps in Step 2: Registering with the Workshop GitHub Organization
        "},{"location":"bespoke/eddash/jhub-git/#step-2-registering-with-the-workshop-github-organization","title":"Step 2: Registering With the Workshop GitHub Organization","text":"
        1. If you already have a GitHub account associated with the email address that you registered for the workshop with, you should have received an email inviting you to join the relevant GitHub organization. If you have not, email ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop GitHub organization
        2. Once you have been invited to the GitHub organization, you will receive an email with the invitation; click on the button as shown Invitation to join the workshop GitHub organization
        3. Clicking on the button in the email will open a new web page with another form as shown below Form to accept the invitation to join the GitHub organization
        4. Again, click on the button to confirm, then the Ed-DaSH-Training GitHub organization page will open
        "},{"location":"bespoke/eddash/safe-registration/","title":"Accessing","text":"

        In order to access the EIDF VDI and connect to EIDF data science cloud VMs, you need to have an active SAFE account. If you already have a SAFE account, you can skip ahead to the Request Project Membership instructions. Otherwise, follow the Register Account in EPCC SAFE instructions immediately below to create the account.

        Info

        Please also see Register and Join a project in the SAFE documentation for more information.

        "},{"location":"bespoke/eddash/safe-registration/#step-1-register-account-in-epcc-safe","title":"Step 1: Register Account in EPCC SAFE","text":"
        1. Go to SAFE signup and complete the registration form
          1. Mandatory fields are: Email, Nationality, First name, Last name, Institution for reporting, Department, and Gender
          2. Your Email should be the one you used to register for the EIDF service (or Ed-DaSH workshop)
          3. If you are unsure, enter 'University of Edinburgh' for Institution for reporting and 'EIDF' for Department SAFE registration form
        2. Submit the form, then accept the SAFE Acceptable Use policy on the next page SAFE User Access Agreement
        3. After you have completed the registration form and accepted the policy, you will receive an email from support@archer2.ac.uk with a password reset URL
        4. Visit the link in the email and generate a new password, then submit the form
        5. You will now be logged into your new account in SAFE
        "},{"location":"bespoke/eddash/safe-registration/#step-2-request-project-membership","title":"Step 2: Request Project Membership","text":"
        1. While logged into SAFE, select the \u2018Request Access\u2019 menu item from the 'Projects' menu in the top menu bar
        2. This will open the 'Apply for project membership' page
        3. Enter the appropriate project ID into the \u2018Project\u2019 field and click the \u2018Next\u2019 button Apply for project membership in SAFE
        4. In the 'Access route' drop down field that appears, select 'Request membership' (not 'Request machine account') Request project membership in SAFE
        5. The project owner will then receive notification of the application and accept your request
        "},{"location":"bespoke/eddash/workshops/","title":"Workshop Setup","text":"

        Please follow the instructions in JupyterHub Notebook Service Access to arrange access to the EIDF Notebook service before continuing. The table below provides the login URL and the relevant GitHub organization to register with.

        Workshop Login URL GitHub Organization Ed-DaSH Introduction to Statistics https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training Ed-DaSH High-Dimensional Statistics https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training Ed-DaSH Introduction to Machine Learning with Python https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training N8 CIR Introduction to Artificial Neural Networks in Python https://secure.epcc.ed.ac.uk/ed-dash-hub Ed-DaSH-Training

        Please follow the sequence of instructions described in the sections below to get ready for the workshop:

        1. Step 1: Accessing the EIDF Notebook Service for the First Time
        2. Step 2: Login to EIDF JupyterHub
        3. Step 3: Creating a New R Script
        "},{"location":"bespoke/eddash/workshops/#step-1-accessing-the-eidf-notebook-service-for-the-first-time","title":"Step 1: Accessing the EIDF Notebook Service for the First Time","text":"

        We will be using the Notebook service provided by the Edinburgh International Data Facility (EIDF). Follow the steps listed below to gain access.

        • Visit https://secure.epcc.ed.ac.uk/ed-dash-hub in your browser

        Warning

        If you are receiving an error response such as '403: Forbidden' when you try to access https://secure.epcc.ed.ac.uk/ed-dash-hub, please send an email to ed-dash-support@mlist.is.ed.ac.uk to request access and also include your IP address which you can find by visiting https://whatismyipaddress.com/ in your browser. Please be aware that if you are accessing the service from outside of the UK, your access might be blocked until you have emailed us with your IP address.

        1. Click on the button
        2. You will be asked to sign in to GitHub, as shown in the form below GitHub sign in form for access to EIDF Notebook Service
        3. Enter your GitHub credentials, or click on the \u2018Create an account\u2019 link if you do not already have one, and follow the prerequisite instructions to register with GitHub and join the workshop organization
        4. Click on the \u2018Sign in\u2019 button
        5. On the next page, you will be asked whether to authorize the workshop organization to access your GitHub account as shown below GitHub form requesting authorization for the workshop organization
        6. Click on the button
        7. At this point, you will receive an email to the email address that you registered with in GitHub, stating that \u201cA third-party OAuth application has been added to your account\u201d for the workshop
        8. If you receive a \u2018403 : Forbidden\u2019 error message on the next screen (if you did not already do so as in step 4 of the prerequisites section) send an email to ed-dash-support@mlist.is.ed.ac.uk from your GitHub registered email address, including your GitHub username, and ask for an invitation to the workshop organization. Otherwise, skip to the next step. N.B. If you are accessing the service from outside of the UK, you may see this error; if so, please contact ed-dash-support@mlist.is.ed.ac.uk to enable access
        9. If you receive a \u2018400 : Bad Request\u2019 error message, you need to accept the invitation that has been emailed to you to join the workshop organization as in the prerequisite instructions
        "},{"location":"bespoke/eddash/workshops/#step-2-login-to-the-eidf-notebook-service","title":"Step 2: Login to the EIDF Notebook Service","text":"

        Now that you have completed registration with the workshop GitHub organization, you can access the workshop RStudio Server in EIDF.

        1. Return to the https://secure.epcc.ed.ac.uk/ed-dash-hub
        2. You will be presented with a choice of server as a list of radio buttons. Select the appropriate one as labelled for your workshop and press the orange 'Start' button
        3. You will now be redirected to the hub spawn pending page for your individual server instance
        4. You will see a message stating that your server is launching. If the page has not updated after 10 seconds, simply refresh the page with the <CTRL> + R or <F5> keys in Windows, or <CMD> + R in macOS
        5. Finally, you will be redirected to either the RStudio Server if it's a statistics workshop, or the Jupyter Lab dashboard otherwise, as shown in the screenshots below The RStudio Server UI The Jupyter Lab Dashboard
        "},{"location":"bespoke/eddash/workshops/#step-3-creating-a-new-r-script","title":"Step 3: Creating a New R Script","text":"

        Follow these quickstart instructions to create your first R script in RStudio Server!

        "},{"location":"faq/","title":"FAQ","text":""},{"location":"faq/#eidf-frequently-asked-questions","title":"EIDF Frequently Asked Questions","text":""},{"location":"faq/#how-do-i-contact-the-eidf-helpdesk","title":"How do I contact the EIDF Helpdesk?","text":"

        Submit a query in the EIDF Portal by selecting \"Submit a Support Request\" in the \"Help and Support\" menu and filling in this form.

        You can also email us at eidf@epcc.ed.ac.uk.

        "},{"location":"faq/#how-do-i-request-more-resources-for-my-project-can-i-extend-my-project","title":"How do I request more resources for my project? Can I extend my project?","text":"

        Submit a support request: In the form select the project that your request relates to and select \"EIDF Project extension: duration and quota\" from the dropdown list of categories. Then enter the new quota or extension date in the description text box below and submit the request.

        The EIDF approval team will consider the extension and you will be notified of the outcome.

        "},{"location":"faq/#new-vms-and-vdi-connections","title":"New VMs and VDI connections","text":"

        My project manager gave me access to a VM but the connection doesn't show up in the VDI connections list?

        This may happen when a machine/VM was added to your connections list while you were logged in to the VDI. Please refresh the connections list by logging out and logging in again, and the new connections should appear.

        "},{"location":"faq/#non-default-ssh-keys","title":"Non-default SSH Keys","text":"

        I have different SSH keys for the SSH gateway and my VM, or I use a key which does not have the default name (~/.ssh/id_rsa) and I cannot login.

        The command syntax shown in our SSH documentation (using the -J <username>@eidf-gateway stanza) makes assumptions about SSH keys and their naming. You should try the full version of the command:

        ssh -o ProxyCommand=\"ssh -i ~/.ssh/<gateway_private_key> -W %h:%p <gateway_username>@eidf-gateway.epcc.ed.ac.uk\" -i ~/.ssh/<vm_private_key> <vm_username>@<vm_ip>\n

        Note that for the majority of users, gateway_username and vm_username are the same, as are gateway_private_key and vm_private_key

        "},{"location":"faq/#username-policy","title":"Username Policy","text":"

        I already have an EIDF username for project Y, can I use this for project X?

        We mandate that every username must be unique across our estate. EPCC machines including EIDF services such as the SDF and DSC VMs, and HPC services such as Cirrus require you to create a new machine account with a unique username for each project you work on. Usernames cannot be used on multiple projects, even if the previous project has finished. However, some projects span multiple machines so you may be able to login to multiple machines with the same username.

        "},{"location":"known-issues/","title":"Known Issues","text":""},{"location":"known-issues/#virtual-desktops","title":"Virtual desktops","text":"

        No known issues.

        "},{"location":"overview/","title":"A Unique Service for Academia and Industry","text":"

        The Edinburgh International Data Facility (EIDF) is a growing set of data and compute services developed to support the Data Driven Innovation Programme at the University of Edinburgh.

        Our goal is to support learners, researchers and innovators across the spectrum, with services from data discovery through simple learn-as-you-play-with-data notebooks to GPU-enabled machine-learning platforms for driving AI application development.

        "},{"location":"overview/#eidf-and-the-data-driven-innovation-initiative","title":"EIDF and the Data-Driven Innovation Initiative","text":"

        Launched at the end of 2018, the Data-Driven Innovation (DDI) programme is one of six funded within the Edinburgh & South-East Scotland City Region Deal. The DDI programme aims to make Edinburgh the \u201cData Capital of Europe\u201d, with ambitious targets to support, enhance and improve talent, research, commercial adoption and entrepreneurship across the region through better use of data.

        The programme targets ten industry sectors, with interactions managed through five DDI Hubs: the Bayes Centre, the Usher Institute, Edinburgh Futures Institute, the National Robotarium, and Easter Bush. The activities of these Hubs are underpinned by EIDF.

        "},{"location":"overview/acknowledgements/","title":"Acknowledging EIDF","text":"

        If you make use of EIDF services in your work, we encourage you to acknowledge us in any publications.

        Acknowledgement of using the facility in publications can be used as an identifiable metric to evaluate the scientific support provided, and helps promote the impact of the wider DDI Programme.

        We encourage our users to ensure that an acknowledgement of EIDF is included in the relevant section of their manuscript. We would suggest:

        This work was supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh.

        "},{"location":"overview/contacts/","title":"Contact","text":"

        The Edinburgh International Data Facility is located at the Advanced Computing Facility of EPCC, the supercomputing centre based at the University of Edinburgh.

        "},{"location":"overview/contacts/#email-us","title":"Email us","text":"

        Email EIDF: eidf@epcc.ed.ac.uk

        "},{"location":"overview/contacts/#sign-up","title":"Sign up","text":"

        Join our mailing list to receive updates about EIDF.

        "},{"location":"safe-haven-services/network-access-controls/","title":"Safe Haven Network Access Controls","text":"

        The TRE Safe Haven services are protected against open, global access by IPv4 source address filtering. These network access controls ensure that connections are permitted only from Safe Haven controller partner networks and collaborating research institutions.

        The network access controls are managed by the Safe Haven service controllers who instruct EPCC to add and remove the IPv4 addresses allowed to connect to the service gateway. Researchers must connect to the Safe Haven service by first connecting to their institution or corporate VPN and then connecting to the Safe Haven.

        The Safe Haven IG controller and research project co-ordination teams must submit and confirm IPv4 address filter changes to their service help desk via email.

        "},{"location":"safe-haven-services/overview/","title":"Safe Haven Services","text":"

        The EIDF Trusted Research Environment (TRE) hosts several Safe Haven services that enable researchers to work with sensitive data in a secure environment. These services are operated by EPCC in partnership with Safe Haven controllers who manage the Information Governance (IG) appropriate for the research activities and the data access of their Safe Haven service.

        It is the responsibility of EPCC as the Safe Haven operator to design, implement and administer the technical controls required to deliver the Safe Haven security regime demanded by the Safe Haven controller.

        The role of the Safe Haven controller is to satisfy the needs of the researchers and the data suppliers. The controller is responsible for guaranteeing the confidentiality needs of the data suppliers and matching these with the availability needs of the researchers.

        The service offers secure data sharing and analysis environments allowing researchers access to sensitive data under the terms and conditions prescribed by the data providers. The service prioritises the requirements of the data provider over the demands of the researcher and is an academic TRE operating under the guidance of the Five Safes framework.

        The TRE has dedicated, private cloud infrastructure at EPCC's Advanced Computing Facility (ACF) data centre and has its own HPC cluster and high-performance file systems. When a new Safe Haven service is commissioned in the TRE it is created in a new virtual private cloud providing the Safe Haven service controller with an independent IG domain separate from other Safe Havens in the TRE. All TRE service infrastructure and all TRE project data are hosted at ACF.

        If you have any questions about the EIDF TRE or about Safe Haven services, please contact us.

        "},{"location":"safe-haven-services/safe-haven-access/","title":"Safe Haven Service Access","text":"

        Safe Haven services are accessed from a registered network connection address using a browser. The service URL will be \"https://shs.epcc.ed.ac.uk/<service>\" where <service> is the Safe Haven service name.

        The Safe Haven access process is in three stages from multi-factor authentication to project desktop login.

        Researchers who are active in many research projects and in more than one Safe Haven will need to pay attention to the service they connect to, the project desktop they login to, and the accounts and identities they are using.

        "},{"location":"safe-haven-services/safe-haven-access/#safe-haven-login","title":"Safe Haven Login","text":"

        The first step in the process prompts the user for a Safe Haven username and then for a session PIN code sent via SMS text to the mobile number registered for the username.

        Valid PIN code entry allows the user access to all of the Safe Haven service remote desktop gateways for up to 24 hours without entry of a new PIN code. A user who has successfully entered a PIN code once can access shs.epcc.ed.ac.uk/haven1 and shs.epcc.ed.ac.uk/haven2 without repeating PIN code identity verification.

        When a valid PIN code is accepted, the user is prompted to accept the service use terms and conditions.

        Registration of the user mobile phone number is managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

        "},{"location":"safe-haven-services/safe-haven-access/#remote-desktop-gateway-login","title":"Remote Desktop Gateway Login","text":"

        The second step in the access process is for the user to login to the Safe Haven service remote desktop gateway so that a project desktop connection can be chosen. The user is prompted for a Safe Haven service account identity.

        VDI Safe Haven Service Login Page

        Safe Haven accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

        "},{"location":"safe-haven-services/safe-haven-access/#project-desktop-connection","title":"Project Desktop Connection","text":"

        The third stage in the process is to select the virtual connection from those available on the account's home page. An example home page is shown below offering two connection options to the same virtual machine. Remote desktop connections will have an _rdp suffix and SSH terminal connections have an _ssh suffix. The most recently used connections are shown as screen thumbnails at the top of the page and all the connections available to the user are shown in a tree list below this.

        VM connections available home page

        The remote desktop gateway software used in the Safe Haven services in the TRE is the Apache Guacamole web application. Users new to this application can find the user manual here. It is recommended that users read this short guide, but note that the data sharing features such as copy and paste, connection sharing, and file transfers are disabled on all connections in the TRE Safe Havens.

        A remote desktop or SSH connection is used to access data provided for a specific research project. If a researcher is working on multiple projects within a Safe Haven they can only login to one project at a time. Some connections may allow the user to login to any project and some connections will only allow the user to login into one specific project. This depends on project IG restrictions specified by the Safe Haven and project controllers.

        Project desktop accounts are managed by the Safe Haven IG controller and research project co-ordination teams by submitting and confirming user account changes through the dedicated service help desk via email.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/","title":"Using the TRE HPC Cluster","text":""},{"location":"safe-haven-services/using-the-hpc-cluster/#introduction","title":"Introduction","text":"

        The TRE HPC system, also called the SuperDome Flex, is a single node, large memory HPC system. It is provided for compute and data intensive workloads that require more CPU, memory, and better IO performance than can be provided by the project VMs, which have the performance equivalent of small rack mount servers.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#specifications","title":"Specifications","text":"

        The system is an HPE SuperDome Flex configured with 1152 hyper-threaded cores (576 physical cores) and 18TB of memory, of which 17TB is available to users. User home and project data directories are on network mounted storage pods running the BeeGFS parallel filesystem. This storage is built in blocks of 768TB per pod. Multiple pods are available in the TRE for use by the HPC system and the total storage available will vary depending on the project configuration.

        The HPC system runs Red Hat Enterprise Linux, which is not the same flavour of Linux as the Ubuntu distribution running on the desktop VMs. However, most jobs in the TRE run Python and R, and there are few issues moving between the two version of Linux. Use of virtual environments is strongly encouraged to ensure there are no differences between the desktop and HPC runtimes.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#software-management","title":"Software Management","text":"

        All system level software installed and configured on the TRE HPC system is managed by the TRE admin team. Software installation requests may be made by the Safe Haven IG controllers, research project co-ordinators, and researchers by submitting change requests through the dedicated service help desk via email.

        Minor software changes will be made as soon as admin effort can be allocated. Major changes are likely to be scheduled for the TRE monthly maintenance session on the first Thursday of each month.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#hpc-login","title":"HPC Login","text":"

        Login to the HPC system is from the project VM using SSH and is not direct from the VDI. The HPC cluster accounts are the same accounts used on the project VMs, with the same username and password. All project data access on the HPC system is private to the project accounts as it is on the VMs, but it is important to understand that the TRE HPC cluster is shared by projects in other TRE Safe Havens.

        To login to the HPC cluster from the project VMs use ssh shs-sdf01 from an xterm. If you wish to avoid entry of the account password for every SSH session or remote command execution you can use SSH key authentication by following the SSH key configuration instructions here. SSH key passphrases are not strictly enforced within the Safe Haven but are strongly encouraged.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#running-jobs","title":"Running Jobs","text":"

        To use the HPC system fully and fairly, all jobs must be run using the SLURM job manager. More information about SLURM, running batch jobs and running interactive jobs can be found here. Please read this carefully before using the cluster if you have not used SLURM before. The SLURM site also has a set of useful tutorials on HPC clusters and job scheduling.

        All analysis and processing jobs must be run via SLURM. SLURM manages access to all the cores on the system beyond the first 32. If SLURM is not used and programs are run directly from the command line, then there are only 32 cores available, and these are shared by the other users. Normal code development, short test runs, and debugging can be done from the command line without using SLURM.

        There is only one node

        The HPC system is a single node with all cores sharing all the available memory. SLURM jobs should always specify '#SBATCH --nodes=1' to run correctly.

        SLURM email alerts for job status change events are not supported in the TRE.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#resource-limits","title":"Resource Limits","text":"

        There are no resource constraints imposed on the default SLURM partition at present. There are user limits (see the output of ulimit -a). If a project has a requirement for more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours, a resource reservation request should be made by the researchers through email to the service help desk.

        There are no storage quotas enforced in the HPC cluster storage at present. The project storage requirements are negotiated, and space allocated before the project accounts are released. Storage use is monitored, and guidance will be issued before quotas are imposed on projects.

        The HPC system is managed in the spirit of utilising it as fully as possible and as fairly as possible. This approach works best when researchers are aware of their project workload demands and cooperate rather than compete for cluster resources.

        "},{"location":"safe-haven-services/using-the-hpc-cluster/#python-jobs","title":"Python Jobs","text":"

        A basic script to run a Python job in a virtual environment is shown below.

        #!/bin/bash\n#\n#SBATCH --export=ALL                  # Job inherits all env vars\n#SBATCH --job-name=my_job_name        # Job name\n#SBATCH --mem=512G                    # Job memory request\n#SBATCH --output=job-%j.out           # Standard output file\n#SBATCH --error=job-%j.err            # Standard error file\n#SBATCH --nodes=1                     # Run on a single node\n#SBATCH --ntasks=1                    # Run one task per node\n#SBATCH --time=02:00:00               # Time limit hrs:min:sec\n#SBATCH --partition standard          # Run on partition (queue)\n\npwd\nhostname\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\necho \"Running job on a single CPU core\"\n\n# Create the job\u2019s virtual environment\nsource ${HOME}/my_venv/bin/activate\n\n# Run the job code\npython3 ${HOME}/my_job.py\n\ndate \"+DATE: %d/%m/%Y TIME: %H:%M:%S\"\n
        "},{"location":"safe-haven-services/using-the-hpc-cluster/#mpi-jobs","title":"MPI Jobs","text":"

        An example script for a multi-process MPI example is shown. The system currently supports MPICH MPI.

        #!/bin/bash\n#\n#SBATCH --export=ALL\n#SBATCH --job-name=mpi_test\n#SBATCH --output=job-%j.out\n#SBATCH --error=job-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=5\n#SBATCH --time=05:00\n#SBATCH --partition standard\n\necho \"Submitted Open MPI job\"\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# load Open MPI module\nmodule purge\nmodule load mpi/mpich-x86_64\n\n# run mpi program\nmpirun ${HOME}/test_mpi\n
        "},{"location":"safe-haven-services/using-the-hpc-cluster/#managing-files-and-data","title":"Managing Files and Data","text":"

        There are three file systems to manage in the VM and HPC environment.

        1. The desktop VM /home file system. This can only be used when you login to the VM remote desktop. This file system is local to the VM and is not backed up.
        2. The HPC system /home file system. This can only be used when you login to the HPC system using SSH from the desktop VM. This file system is local to the HPC cluster and is not backed up.
        3. The project file and data space in the /safe_data file system. This file system can only be used when you login to a VM remote desktop session. This file system is backed up.

        The /safe_data file system with the project data cannot be used by the HPC system. The /safe_data file system has restricted access and a relatively slow IO performance compared to the parallel BeeGFS file system storage on the HPC system.

        The process to use the TRE HPC service is to copy and synchronise the project code and data files on the /safe_data file system with the HPC /home file system before and after login sessions and job runs on the HPC cluster. Assuming all the code and data required for the job is in a directory 'current_wip' on the project VM, the workflow is as follows:

        1. Copy project code and data to the HPC cluster (from the desktop VM) rsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:
        2. Run jobs/tests/analysis ssh shs-sdf01, cd current_wip, sbatch/srun my_job
        3. Copy any changed project code and data back to /safe_data (from the desktop VM) rsync -avPz -e ssh shs-sdf01:current_wip /safe_data/my_project
        4. Optionally delete the code and data from the HPC cluster working directory.
        "},{"location":"safe-haven-services/virtual-desktop-connections/","title":"Virtual Machine Connections","text":"

        Sessions on project VMs may be either remote desktop (RDP) logins or SSH terminal logins. Most users will prefer to use the remote desktop connections, but the SSH terminal connection is useful when remote network performance is poor and it must be used for account password changes.

        "},{"location":"safe-haven-services/virtual-desktop-connections/#first-time-login-and-account-password-changes","title":"First Time Login and Account Password Changes","text":"

        Account Password Changes

        Note that first time account login cannot be through RDP as a password change is required. Password reset logins must be SSH terminal sessions as password changes can only be made through SSH connections.

        "},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-ssh-session","title":"Connecting to a Remote SSH Session","text":"

        When a VM SSH connection is selected the browser screen becomes a text terminal and the user is prompted to \"Login as: \" with a project account name, and then prompted for the account password. This connection type is equivalent to a standard xterm SSH session.

        "},{"location":"safe-haven-services/virtual-desktop-connections/#connecting-to-a-remote-desktop-session","title":"Connecting to a Remote Desktop Session","text":"

        Remote desktop connections work best by first placing the browser in Full Screen mode and leaving it in this mode for the entire duration of the Safe Haven session.

        When a VM RDP connection is selected the browser screen becomes a remote desktop presenting the login screen shown below.

        VM virtual desktop connection user account login screen

        Once the project account credentials have been accepted, a remote dekstop similar to the one shown below is presented. The default VM environment in the TRE is Ubuntu 22.04 with the Xfce desktop.

        VM virtual desktop

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/","title":"Accessing the Superdome Flex inside the EPCC Trusted Research Environment","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#what-is-the-superdome-flex","title":"What is the Superdome Flex?","text":"

        The Superdome Flex (SDF) is a high-performance computing cluster manufactured by Hewlett Packard Enterprise. It has been designed to handle multi-core, high-memory tasks in environments where security is paramount. The hardware specifications of the SDF within the Trusted Research Environment (TRE) are as follows:

        • 576 physical cores (1152 hyper-threaded cores)
        • 18TB of dynamic memory (17 TB available to users)
        • 768TB or more of permanent memory

        The software specification of the SDF are:

        • Red Hat Enterprise Linux (v8.7 as of 27/03/23)
        • Slurm job manager
        • Access to local copies of R (CRAN) and python (conda) repositories
        • Singularity container platform
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#key-point","title":"Key Point","text":"

        The SDF is within the TRE. Therefore, the same restrictions apply, i.e. the SDF is isolated from the internet (no downloading code from public GitHub repos) and copying/recording/extracting anything on the SDF outside of the TRE is strictly prohibited unless through approved processes.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#accessing-the-sdf","title":"Accessing the SDF","text":"

        Users can only access the SDF by ssh-ing into it via their VM desktop.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#hello-world","title":"Hello world","text":"
        **** On the VM desktop terminal ****\n\nssh shs-sdf01\n<Enter VM password>\n\necho \"Hello World\"\n\nexit\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#sdf-vs-vm-file-systems","title":"SDF vs VM file systems","text":"

        The SDF file system is separate from the VM file system, which is again separate from the project data space. Files need to be transferred between the three systems for any analysis to be completed within the SDF.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-showing-separate-sdf-and-vm-file-systems","title":"Example showing separate SDF and VM file systems","text":"
        **** On the VM desktop terminal ****\n\ncd ~\ntouch test.txt\nls\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is not here\nexit\n\nscp test.txt shs-sdf01:/home/<USERNAME>/\n\nssh shs-sdf01\n<Enter VM password>\n\nls # test.txt is here\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L1_Accessing_the_SDF_Inside_the_EPCC_TRE/#example-copying-data-between-project-data-space-and-sdf","title":"Example copying data between project data space and SDF","text":"

        Transferring and synchronising data sets between the project data space and the SDF is easier with the rsync command (rather than manually checking and copying files/folders with scp). rsync only transfers files that are different between the two targets, more details in its manual.

        **** On the VM desktop terminal ****\n\nman rsync # check instructions for using rsync\n\nrsync -avPz -e ssh /safe_data/my_project/ shs-sdf01:/home/<USERNAME>/my_project/ # sync project folder and SDF home folder\n\nssh shs-sdf01\n<Enter VM password>\n\n*** Conduct analysis on SDF ***\n\nexit\n\nrsync -avPz -e ssh /safe_data/my_project/current_wip shs-sdf01:/home/<USERNAME>/my_project/ # sync project file and ssh home page # re-syncronise project folder and SDF home folder\n\n*** Optionally remove the project folder on SDF ***\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/","title":"Running R/Python Scripts","text":"

        Running analysis scripts on the SDF is slightly different to running scripts on the Desktop VMs. The Linux distribution differs between the two with the SDF using Red Hat Enterprise Linux (RHEL) and the Desktop VMs using Ubuntu. Therefore, it is highly advisable to use virtual environments (e.g. conda environments) to complete any analysis and aid the transition between the two distributions. Conda should run out of the box on the Desktop VMs, but some configuration is required on the SDF.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#setting-up-conda-environments-on-you-first-connection-to-the-sdf","title":"Setting up conda environments on you first connection to the SDF","text":"
        *** SDF Terminal ***\n\nconda activate base # Test conda environment\n\n# Conda command will not be found. There is no need to install!\n\neval \"$(/opt/anaconda3/bin/conda shell.bash hook)\" # Tells your terminal where conda is\n\nconda init # changes your .bashrc file so conda is automatically available in the future\n\nconda config --set auto_activate_base false # stop conda base from being activated on startup\n\npython # note python version\n\nexit()\n

        The base conda environment is now available but note that the python and gcc compilers are not the latest (Python 3.9.7 and gcc 7.5.0).

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#getting-an-up-to-date-python-version","title":"Getting an up-to-date python version","text":"

        In order to get an up-to-date python version we first need to use an updated gcc version. Fortunately, conda has an updated gcc toolset that can be installed.

        *** SDF Terminal ***\n\nconda activate base # If conda isn't already active\n\nconda create -n python-v3.11 gcc_linux-64=11.2.0 python=3.11.3\n\nconda activate python-v3.11\n\npython\n\nexit()\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#running-r-scripts-on-the-sdf","title":"Running R scripts on the SDF","text":"

        The default version of R available on the SDF is v4.1.2. Alternative R versions can be installed using conda similar to the python conda environment above.

        conda create -n r-v4.3 gcc_linux-64=11.2.0 r-base=4.3\n\nconda activate r-v4.3\n\nR\n\nq()\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L2_running_R_Python_analysis_scripts/#final-points","title":"Final points","text":"
        • The SDF, like the rest of the SHS, is separated from the internet. The installation of python/R libraries to your environment is from a local copy of the respective conda/CRAN library repositories. Therefore, not all packages may be available and not all package versions may be available.

        • It is discouraged to run extensive python/R analyses without submitting them as job requests using Slurm.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/","title":"Submitting Scripts to Slurm","text":""},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#what-is-slurm","title":"What is Slurm?","text":"

        Slurm is a workload manager that schedules jobs submitted to a shared resource. Slurm is a well-developed tool that can manage large computing clusters, such as ARCHER2, with thousands of users each with different priorities and allocated computing hours. Inside the TRE, Slurm is used to help ensure all users of the SDF get equitable access. Therefore, users who are submitting jobs with high resource requirements (>80 cores, >1TB of memory) may have to wait longer for resource allocation to enable users with lower resource demands to continue their work.

        Slurm is currently set up so all users have equal priority and there is no limit to the total number of CPU hours allocated to a user per month. However, there are limits to the maximum amount of resources that can be allocated to an individual job. Jobs that require more than 200 cores, more than 4TB of memory, or an elapsed runtime of more than 96 hours will be rejected. If users need to submit jobs with large resource demand, they need to submit a resource reservation request by emailing their project's service desk.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#why-do-you-need-to-use-slurm","title":"Why do you need to use Slurm?","text":"

        The SDF is a resource shared across all projects within the TRE and all users should have equal opportunity to use the SDF to complete resource-intense tasks appropriate to their projects. Users of the SDF are required to consider the needs of the wider community by:

        • requesting resources appropriate to their intended task and timeline.

        • submitting resource requests via Slurm to enable automatic scheduling and fair allocation alongside other user requests.

        Users can develop code, complete test runs, and debug from the SDF command line without using Slurm. However, only 32 of the 512 cores are accessible without submitting a job request to Slurm. These cores are accessible to all users simultaneously.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#slurm-basics","title":"Slurm basics","text":"

        Slurm revolves around four main entities: nodes, partitions, jobs and job steps. Nodes and partitions are relevant for more complex distributed computing clusters so Slurm can allocate appropriate resources to jobs across multiple pieces of hardware. Jobs are requests for resources and job steps are what need to be completed once the resources have been allocated (completed in sequence or parallel). Job steps can be further broken down into tasks.

        There are four key commands for Slurm users:

        • squeue: get details on a job or job step, i.e. has a job been allocated resources or is it still pending?

        • srun: initiate a job step or execute a job. A versatile function that can initiate job steps as part of a larger batch job or submit a job itself to get resources and run a job step. This is useful for testing job steps, experimenting with different resource allocations or running job steps that require large resources but are relatively easy to define in a line or two (i.e. running a sequence alignment).

        • scancel: stop a job from continuing.

        • sbatch: submit a job script containing multiple steps (i.e. srun) to be completed with the defined resources. This is the typical function for submitting jobs to Slurm.

        More details on these functions (and several not mentioned here) can be seen on the Slurm website.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-simple-job","title":"Submitting a simple job","text":"
        *** SDF Terminal ***\n\nsqueue -u $USER # Check if there are jobs already queued or running for you\n\nsrun --job-name=my_first_slurm_job --nodes 1 --ntasks 10 --cpus-per-task 2 echo 'Hello World'\n\nsqueue -u $USER --state=CD # List all completed jobs\n

        In this instance, the srun command completes two steps: job submission and job step execution. First, it submits a job request to be allocated 10 CPUs (1 CPU for each of the 10 tasks). Once the resources are available, it executes the job step consisting of 10 tasks each running the 'echo \"Hello World\"' function.

        srun accepts a wide variety of options to specify the resources required to complete its job step. Within the SDF, you must always request 1 node (as there is only one node) and never use the --exclusive option (as no one will have exclusive access to this shared resource). Notice that running srun blocks your terminal from accepting any more commands and the output from each task in the job step, i.e. Hello World in the above example, outputs to your terminal. We will compare this to running a sbatch command.\u0011

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-a-batch-job","title":"Submitting a batch job","text":"

        Batch jobs are incredibly useful because they run in the background without blocking your terminal. Batch jobs also output the results to a log file rather than straight to your terminal. This allows you to check a job was completed successfully at a later time so you can move on to other things whilst waiting for a job to complete.

        A batch job can be submitted to Slurm by passing a job script to the sbatch command. The first few lines of a job script outline the resources to be requested as part of the job. The remainder of a job script consists of one or more srun commands outlining the job steps that need to be completed (in sequence or parallel) once the resources are available. There are numerous options for defining the resource requirements of a job including:

        • The number of CPUs available for each active task: --ncpus-per-task
        • The amount of memory available per CPU (in MB by default but can be in GB if G is appended to the number): --mem-per-cpu
        • The total amount of memory (in MB by default but can be in GB if G is appended to the number): --mem
        • The maximum number of tasks invoked at one time: --ntasks
        • The number of nodes (Always 1 when using SDF): --nodes
        • If the job requires exclusive access to all the resources of a node (never part of job request, but required for parallel job steps within batch scripts): --exclusive

        More information on the various options are in the sbatch documentation.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script","title":"Example Job Script","text":"
        #!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=10\n#SBATCH --cpus-per-task=2\n\n% Run echo task in sequence\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task A. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\nsrun --ntasks 5 --cpus-per-task 2 echo \"Series Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n\n% Run echo task in parallel with the ampersand character\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task A. Time: \" $(date +\u201d%H:%M:%S\u201d) &\n\nsrun --exclusive --ntasks 5 --cpus-per-task 2 echo \"Parallel Task B. Time: \" $(date +\u201d%H:%M:%S\u201d)\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#example-job-script-submission","title":"Example job script submission","text":"
        *** SDF Terminal ***\n\nnano example_job_script.sh\n\n*** Copy example job script above ***\n\nsbatch example_job_script.sh\n\nsqueue -u $USER -r 5\n\n*** Wait for the batch job to be completed ***\n\ncat example_job_script.log # The series tasks should be grouped together and the parallel tasks interspersed.\n

        The example batch job is intended to show two things: 1) the usefulness of the sbatch command and 2) the versatility of a job script. As the sbatch command allows you to submit scripts and check their outcome at your own discretion, it is the most common way of interacting with Slurm. Meanwhile, the job script command allows you to specify one global resource request and break it up into multiple job steps with different resource demands that can be completed in parallel or in sequence.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#submitting-pythonr-code-to-slurm","title":"Submitting python/R code to Slurm","text":"

        Although submitting job steps containing python/R analysis scripts can be done with srun directly, as below, it is more common to submit bash scripts that call the analysis scripts after setting up the environment (i.e. after calling conda activate).

        **** Python code job submission ****\n\nsrun --job-name=my_first_python_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G python3 example_script.py\n\n**** R code job submission ****\n\nsrun --job-name=my_first_r_job --nodes 1 --ntasks 10 --cpus-per-task 2 --mem 10G Rscript example_script.R\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L3_submitting_scripts_to_slurm/#signposting","title":"Signposting","text":"

        Useful websites for learning more about Slurm:

        • The Slurm documentation website

        • The Introduction to HPC carpentries lesson on Slurm

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/","title":"Parallelised Python analysis with Dask","text":"

        This lesson is adapted from a workshop introducing users to running python scripts on ARCHER2 as developed by Adrian Jackson.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#introduction","title":"Introduction","text":"

        Python does not have native support for parallelisation. Python contains a Global Interpreter Lock (GIL) which means the python interpreter only allows one thread to execute at a time. The advantage of the GIL is that C libraries can be easily integrated into Python scripts without checking if they are thread-safe. However, this means that most common python modules cannot be easily parallelised. Fortunately, there are now several re-implementations of common python modules that work around the GIL and are therefore parallelisable. Dask is a python module that contains a parallelised version of the pandas data frame as well as a general format for parallelising any python code.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask","title":"Dask","text":"

        Dask enables thread-safe parallelised python execution by creating task graphs (a graph of the dependencies of the inputs and outputs of each function) and then deducing which ones can be run separately. This lesson introduces some general concepts required for programming using Dask. There are also some exercises with example answers to help you write your first parallelised python scripts.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#arrays-data-frames-and-bags","title":"Arrays, data frames and bags","text":"

        Dask contains three data objects to enable parallelised analysis of large data sets in a way familiar to most python programmers. If the same operations are being applied to a large data set then Dask can split up the data set and apply the operations in parallel. The three data objects that Dask can easily split up are:

        • Arrays: Contains large numbers of elements in multiple dimensions, but each element must be of the same type. Each element has a unique index that allows users to specify changes to individual elements.

        • Data frames: Contains large numbers of elements which are typically highly structured with multiple object types allowed together. Each element has a unique index that allows users to specify changes to individual elements.

        • Bags: Contains large numbers of elements which are semi/un-structured. Elements are immutable once inside the bag. Bags are useful for conducting initial analysis/wrangling of raw data before more complex analysis is performed.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#example-dask-array","title":"Example Dask array","text":"

        You may need to install dask or create a new conda environment with it in.

        conda create -n dask-env gcc_linux-64=11.2.0 python=3.11.3 dask\n\nconda activate dask-env\n

        Try running the following Python using dask:

        import dask.array as da\n\nx = da.random.random((10000, 10000), chunks=(1000, 1000))\n\nprint(x)\n\nprint(x.compute())\n\nprint(x.sum())\n\nprint(x.sum().compute())\n

        This should demonstrate that dask is both straightforward to implement simple parallelism, but also lazy in that it does not compute anything until you force it to with the .compute() function.

        You can also try out dask DataFrames, using the following code:

        import dask.dataframe as dd\n\ndf = dd.read_csv('surveys.csv')\n\ndf.head()\ndf.tail()\n\ndf.weight.max().compute()\n

        You can try using different blocksizes when reading in the csv file, and then undertaking an operation on the data, as follows: Experiment with varying blocksizes, although you should be aware that making your block size too small is likely to cause poor performance (the blocksize affects the number of bytes read in at each operation).

        df = dd.read_csv('surveys.csv', blocksize=\"10000\")\ndf.weight.max().compute()\n

        You can also experiment with Dask Bags to see how that functionality works:

        import dask.bag as db\nfrom operator import add\nb = db.from_sequence([1, 2, 3, 4, 5], npartitions=2)\nprint(b.compute())\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#dask-delayed","title":"Dask Delayed","text":"

        Dask delayed lets you construct your own task graphs/parallelism from Python functions. You can find out more about dask delayed from the dask documentation Try parallelising the code below using the .delayed function or the @delayed decorator, an example answer can be found here.

        def inc(x):\n    return x + 1\n\ndef double(x):\n    return x * 2\n\ndef add(x, y):\n    return x + y\n\ndata = [1, 2, 3, 4, 5]\n\noutput = []\nfor x in data:\n    a = inc(x)\n    b = double(x)\n    c = add(a, b)\n    output.append(c)\n\ntotal = sum(output)\n\nprint(total)\n
        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#mandelbrot-exercise","title":"Mandelbrot Exercise","text":"

        The code below calculates the members of a Mandelbrot set using Python functions:

        import sys\nimport time\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef mandelbrot(h, w, maxit=20, r=2):\n    \"\"\"Returns an image of the Mandelbrot fractal of size (h,w).\"\"\"\n    start = time.time()\n\n    x = np.linspace(-2.5, 1.5, 4*h+1)\n\n    y = np.linspace(-1.5, 1.5, 3*w+1)\n\n    A, B = np.meshgrid(x, y)\n\n    C = A + B*1j\n\n    z = np.zeros_like(C)\n\n    divtime = maxit + np.zeros(z.shape, dtype=int)\n\n    for i in range(maxit):\n        z = z**2 + C\n        diverge = abs(z) > r # who is diverging\n        div_now = diverge & (divtime == maxit) # who is diverging now\n        divtime[div_now] = i # note when\n        z[diverge] = r # avoid diverging too much\n\n    end = time.time()\n\n    return divtime, end-start\n\nh = 2000\nw = 2000\n\nmandelbrot_space, time = mandelbrot(h, w)\n\nplt.imshow(mandelbrot_space)\n\nprint(time)\n

        Your task is to parallelise this code using Dask Array functionality. Using the base python code above, extend it with Dask Array for the main arrays in the computation. Remember you need to specify a chunk size with Dask Arrays, and you will also need to call compute at some point to force Dask to actually undertake the computation. Note, depending on where you run this you may not see any actual speed up of the computation. You need access to extra resources (compute cores) for the calculation to go faster. If in doubt, submit a python script of your solution to the SDF compute nodes to see if you see speed up there. If you are struggling with this parallelisation exercise, there is a solution available for you here.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#pi-exercise","title":"Pi Exercise","text":"

        The code below calculates Pi using a function that can split it up into chunks and calculate each chunk separately. Currently it uses a single chunk to produce the final value of Pi, but that can be changed by calling pi_chunk multiple times with different inputs. This is not necessarily the most efficient method for calculating Pi in serial, but it does enable parallelisation of the calculation of Pi using multiple copies of pi_chunk called simultaneously.

        import time\nimport sys\n\n# Calculate pi in chunks\n\n# n     - total number of steps to be undertaken across all chunks\n# lower - the lowest number of this chunk\n# upper - the upper limit of this chunk such that i < upper\n\ndef pi_chunk(n, lower, upper):\n    step = 1.0 / n\n    p = step * sum(4.0/(1.0 + ((i + 0.5) * (i + 0.5) * step * step)) for i in range(lower, upper))\n    return p\n\n# Number of slices\n\nnum_steps = 10000000\n\nprint(\"Calculating PI using:\\n \" + str(num_steps) + \" slices\")\n\nstart = time.time()\n\n# Calculate using a single chunk containing all steps\n\np = pi_chunk(num_steps, 1, num_steps)\n\nstop = time.time()\n\nprint(\"Obtained value of Pi: \" + str(p))\n\nprint(\"Time taken: \" + str(stop - start) + \" seconds\")\n

        For this exercise, your task is to implemented the above code on the SDF, and then parallelise using Dask. There are a number of different ways you could parallelise this using Dask, but we suggest using the Futures map functionality to run the pi_chunk function on a range of different inputs. Futures map has the following definition:

        Client.map(func, *iterables[, key, workers, ...])\n

        Where func is the function you want to run, and then the subsequent arguments are inputs to that function. To utilise this for the Pi calculation, you will first need to setup and configure a Dask Client to use, and also create and populate lists or vectors of inputs to be passed to the pi_chunk function for each function run that Dask launches.

        If you run Dask with processes then it is possible that you will get errors about forking processes, such as these:

            An attempt has been made to start a new process before the current process has finished its bootstrapping phase.\n    This probably means that you are not using fork to start your child processes and you have forgotten to use the proper idiom in the main module:\n

        In that case you need to encapsulate your code within a main function, using something like this:

        if __name__ == \"__main__\":\n

        If you are struggling with this exercise then there is a solution available for you here.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L4_parallelised_python_analysis/#signposting","title":"Signposting","text":"
        • More information on parallelised python code can be found in the carpentries lesson

        • Dask itself has several detailed tutorials

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/","title":"Parallelised R Analysis","text":"

        This lesson is adapted from a workshop introducing users to running R scripts on ARCHER2 as developed by Adrian Jackson.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#introduction","title":"Introduction","text":"

        In this exercise we are going to try different methods of parallelising R on the SDF. This will include single node parallelisation functionality (e.g. using threads or processes to use cores within a single node), and distributed memory functionality that enables the parallelisation of R programs across multiple nodes.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#example-parallelised-r-code","title":"Example parallelised R code","text":"

        You may need to activate an R conda environment.

        conda activate r-v4.2\n

        Try running the following R script using R on the SDF login node:

        n <- 8*2048\nA <- matrix( rnorm(n*n), ncol=n, nrow=n )\nB <- matrix( rnorm(n*n), ncol=n, nrow=n )\nC <- A %*% B\n

        You can run this as follows on the SDF (assuming you have saved the above code into a file named matrix.R):

        Rscript ./matrix.R\n

        You can check the resources used by R when running on the login node using this command:

        top -u $USER\n

        If you run the R script in the background using &, as follows, you can then monitor your run using the top command. You may notice when you run your program that at points R uses many more resources than a single core can provide, as demonstrated below:

            PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND\n    178357 adrianj 20 0 15.542 0.014t 13064 R 10862 2.773 9:01.66 R\n

        In the example above it can be seen that >10862% of a single core is being used by R. This is an example of R using automatic parallelisation. You can experiment with controlling the automatic parallelisation using the OMP_NUM_THREADS variable to restrict the number of cores available to R. Try using the following values:

        export OMP_NUM_THREADS=8\n\nexport OMP_NUM_THREADS=4\n\nexport OMP_NUM_THREADS=2\n

        You may also notice that not all the R script is parallelised. Only the actual matrix multiplication is undertaken in parallel, the initialisation/creation of the matrices is done in serial.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-datatables","title":"Parallelisation with data.tables","text":"

        We can also experiment with the implicit parallelism in other libraries, such as data.table. You will first need to install this library on the SDF. To do this you can simply run the following command:

        install.packages(data.table)\n

        Once you have installed data.table you can experiment with the following code:

        library(data.table)\nvenue_data <- data.table( ID = 1:50000000,\nCapacity = sample(100:1000, size = 50000000, replace = T), Code = sample(LETTERS, 50000000, replace = T),\nCountry = rep(c(\"England\",\"Scotland\",\"Wales\",\"NorthernIreland\"), 50000000))\nsystem.time(venue_data[, mean(Capacity), by = Country])\n

        This creates some random data in a large data table and then performs a calculation on it. Try running R with varying numbers of threads to see what impact that has on performance. Remember, you can vary the number of threads R uses by setting OMP_NUM_THREADS= before you run R. If you want to try easily varying the number of threads you can save the above code into a script and run it using Rscript, changing OMP_NUM_THREADS each time you run it, e.g.:

        export OMP_NUM_THREADS=1\n\nRscript ./data_table_test.R\n\nexport OMP_NUM_THREADS=2\n\nRscript ./data_table_test.R\n

        The elapsed time that is printed out when the calculation is run represents how long the script/program took to run. It\u2019s important to bear in mind that, as with the matrix multiplication exercise, not everything will be parallelised. Creating the data table is done in serial so does not benefit from the addition of more threads.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#loop-and-function-parallelism","title":"Loop and function parallelism","text":"

        R provides a number of different functions to run loops or functions in parallel. One of the most common functions is to use are the {X}apply functions:

        • apply Apply a function over a matrix or data frame

        • lapply Apply a function over a list, vector, or data frame

        • sapply Same as lapply but returns a vector

        • vapply Same as sapply but with a specified return type that improves safety and can improve speed

        For example:

        res <- lapply(1:3, function(i) {\n    sqrt(i)*sqrt(i*2)\n    })\n

        The {X}apply functionality supports iteration over a dataset without requiring a loop to be constructed. However, the functions outlined above do not exploit parallelism, even if there is potential for parallelisation many operations that utilise them.

        There are a number of mechanisms that can be used to implement parallelism using the {X}apply functions. One of the simplest is using the parallel library, and the mclapply function:

        library(parallel)\nres <- mclapply(1:3, function(i) {\n    sqrt(i)\n})\n

        Try experimenting with the above functions on large numbers of iterations, both with lapply and mclapply. Can you achieve better performance using the MC_CORES environment variable to specify how many parallel processes R uses to complete these calculations? The default on the SDF is 2 cores, but you can increase this in the same way we did for OMP_NUM_THREADS, e.g.:

        export MC_CORES=16\n

        Try different numbers of iterations of the functions (e.g. change 1:3 in the code to something much larger), and different numbers of parallel processes, e.g.:

        export MC_CORES=2\n\nexport MC_CORES=8\n\nexport MC_CORES=16\n

        If you have separate functions then the above approach will provide a simple method for parallelising using the resources within a single node. However, if your functionality is more loop-based, then you may not wish to have to package this up into separate functions to parallelise.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-foreach","title":"Parallelisation with foreach","text":"

        The foreach package can be used to parallelise loops as well as functions. Consider a loop of the following form:

        main_list <- c()\nfor (i in 1:3) {\n    main_list <- c(main_list, sqrt(i))\n}\n

        This can be converted to foreach functionality as follows:

        main_list <- c()\nlibrary(foreach)\nforeach(i=1:3) %do% {\n    main_list <- c(main_list, sqrt(i))\n}\n

        Whilst this approach does not significantly change the performance or functionality of the code, it does let us then exploit parallel functionality in foreach. The %do% can be replaced with a %dopar% which will execute the code in parallel.

        To test this out we\u2019re going to try an example using the randomForest library. We can now run the following code in R:

        library(foreach)\nlibrary(randomForest)\nx <- matrix(runif(50000), 1000)\ny <- gl(2, 500)\nrf <- foreach(ntree=rep(250, 4), .combine=combine) %do%\nrandomForest(x, y, ntree=ntree)\nprint(rf)\n

        Implement the above code and run with a system.time to see how long it takes. Once you have done this you can change the %do% to a %dopar% and re-run. Does this provide any performance benefits?

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#parallelisation-with-doparallel","title":"Parallelisation with doParallel","text":"

        To exploit the parallelism with dopar we need to provide parallel execution functionality and configure it to use extra cores on the system. One method to do this is using the doParallel package.

        library(doParallel)\nregisterDoParallel(8)\n

        Does this now improve performance when running the randomForest example? Experiment with different numbers of workers by changing the number set in registerDoParallel(8) to see what kind of performance you can get. Note, you may also need to change the number of clusters used in the foreach, e.g. what is specified in the rep(250, 4) part of the code, to enable more than 4 different sets to be run at once if using more than 4 workers. The amount of parallel workers you can use is dependent on the hardware you have access to, the number of workers you specify when you setup your parallel backend, and the amount of chunks of work you have to distribute with your foreach configuration.

        "},{"location":"safe-haven-services/superdome-flex-tutorial/L5_parallelised_r_analysis/#cluster-parallelism","title":"Cluster parallelism","text":"

        It is possible to use different parallel backends for foreach. The one we have used in the example above creates new worker processes to provide the parallelism, but you can also use larger numbers of workers through a parallel cluster, e.g.:

        my.cluster <- parallel::makeCluster(8)\nregisterDoParallel(cl = my.cluster)\n

        By default makeCluster creates a socket cluster, where each worker is a new independent process. This can enable running the same R program across a range of systems, as it works on Linux and Windows (and other clients). However, you can also fork the existing R process to create your new workers, e.g.:

        cl <-makeCluster(5, type=\"FORK\")\n

        This saves you from having to create the variables or objects that were setup in the R program/script prior to the creation of the cluster, as they are automatically copied to the workers when using this forking mode. However, it is limited to Linux style systems and cannot scale beyond a single node.

        Once you have finished using a parallel cluster you should shut it down to free up computational resources, using stopCluster, e.g.:

        stopCluster(cl)\n

        When using clusters without the forking approach, you need to distribute objects and variables from the main process to the workers using the clusterExport function, e.g.:

        library(parallel)\nvariableA <- 10\nvariableB <- 20\nmySum <- function(x) variableA + variableB + x\ncl <- makeCluster(4)\nres <- try(parSapply(cl=cl, 1:40, mySum))\n

        The program above will fail because variableA and variableB are not present on the cluster workers. Try the above on the SDF and see what result you get.

        To fix this issue you can modify the program using clusterExport to send variableA and variableB to the workers, prior to running the parSapply e.g.:

        clusterExport(cl=cl, c('variableA', 'variableB'))\n
        "},{"location":"services/","title":"EIDF Services","text":""},{"location":"services/#computing-services","title":"Computing Services","text":"

        Data Science Virtual Desktops

        Managed File Transfer

        Managed JupyterHub

        Cerebras CS-2

        Ultra2

        Graphcore Bow Pod64

        "},{"location":"services/#data-management-services","title":"Data Management Services","text":"

        Data Catalogue

        "},{"location":"services/cs2/","title":"Cerebras CS-2","text":"

        Get Access

        Running codes

        "},{"location":"services/cs2/access/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/access/#getting-access","title":"Getting Access","text":"

        Access to the Cerebras CS-2 system is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.

        "},{"location":"services/cs2/run/","title":"Cerebras CS-2","text":""},{"location":"services/cs2/run/#introduction","title":"Introduction","text":"

        The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc.

        "},{"location":"services/cs2/run/#connecting-to-the-cluster","title":"Connecting to the cluster","text":"

        To gain access to the CS-2 WSC you need to login to the host system, Ultra2 (also called SDF-CS1). See the documentation for Ultra2.

        "},{"location":"services/cs2/run/#running-jobs","title":"Running Jobs","text":"

        All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.

        "},{"location":"services/cs2/run/#slurm-example","title":"SLURM example","text":"

        This is based on the sample job from the Cerebras documentation Cerebras documentation - Execute your job

        #!/bin/bash\n#SBATCH --job-name=Example        # Job name\n#SBATCH --cpus-per-task=2         # Request 2 cores\n#SBATCH --output=example_%j.log   # Standard output and error log\n#SBATCH --time=01:00:00           # Set time limit for this job to 1 hour\n#SBATCH --gres=cs:1               # Request CS-2 system\n\nsource venv_cerebras_pt/bin/activate\npython run.py \\\n       CSX \\\n       --params params.yaml \\\n       --num_csx=1 \\\n       --model_dir model_dir \\\n       --mode {train,eval,eval_all,train_and_eval} \\\n       --mount_dirs {paths to modelzoo and to data} \\\n       --python_paths {paths to modelzoo and other python code if used}\n

        See the 'Troubleshooting' section below for known issues.

        "},{"location":"services/cs2/run/#creating-an-environment","title":"Creating an environment","text":"

        To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this Cerebras setup environment docs however our host system is slightly different so we recommend the following:

        "},{"location":"services/cs2/run/#create-the-venv","title":"Create the venv","text":"
        python3.8 -m venv venv_cerebras_pt\n
        "},{"location":"services/cs2/run/#install-the-dependencies","title":"Install the dependencies","text":"
        source venv_cerebras_pt/bin/activate\npip install --upgrade pip\npip install cerebras_pytorch==2.1.1\n
        "},{"location":"services/cs2/run/#validate-the-setup","title":"Validate the setup","text":"
        source venv_cerebras_pt/bin/activate\ncerebras_install_check\n
        "},{"location":"services/cs2/run/#modify-venv-files-to-remove-clock-sync-check-on-epcc-system","title":"Modify venv files to remove clock sync check on EPCC system","text":"

        Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:

        "},{"location":"services/cs2/run/#from-within-your-python-venv-edit-the-libpython38site-packagescerebras_pytorchsaverstoragepy-file","title":"From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
        vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py\n
        ","text":""},{"location":"services/cs2/run/#navigate-to-line-530","title":"Navigate to line 530
        :530\n

        The section should look like this:

        if modified_time > self._last_modified:\n    raise RuntimeError(\n        f\"Attempting to materialize deferred tensor with key \"\n        f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n        f\"since been modified. The loaded tensor value may be \"\n        f\"different from originally loaded tensor. Please refrain \"\n        f\"from modifying the file while the run is in progress.\"\n    )\n
        ","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-modified_time-self_last_modified","title":"Comment out the section if modified_time > self._last_modified
         #if modified_time > self._last_modified:\n #    raise RuntimeError(\n #        f\"Attempting to materialize deferred tensor with key \"\n #       f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n #        f\"since been modified. The loaded tensor value may be \"\n #        f\"different from originally loaded tensor. Please refrain \"\n #        f\"from modifying the file while the run is in progress.\"\n        #    )\n
        ","text":""},{"location":"services/cs2/run/#navigate-to-line-774","title":"Navigate to line 774
        :774\n

        The section should look like this:

           if stat.st_mtime_ns > self._stat.st_mtime_ns:\n        raise RuntimeError(\n            f\"Attempting to {msg} deferred tensor with key \"\n            f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n            f\"since been modified. The loaded tensor value may be \"\n            f\"different from originally loaded tensor. Please refrain \"\n            f\"from modifying the file while the run is in progress.\"\n       )\n
        ","text":""},{"location":"services/cs2/run/#comment-out-the-section-if-statst_mtime_ns-self_statst_mtime_ns","title":"Comment out the section if stat.st_mtime_ns > self._stat.st_mtime_ns
           #if stat.st_mtime_ns > self._stat.st_mtime_ns:\n   #     raise RuntimeError(\n   #         f\"Attempting to {msg} deferred tensor with key \"\n   #         f\"\\\"{self._key}\\\" from file {self._filepath}, but the file has \"\n   #         f\"since been modified. The loaded tensor value may be \"\n   #         f\"different from originally loaded tensor. Please refrain \"\n   #         f\"from modifying the file while the run is in progress.\"\n   #    )\n
        ","text":""},{"location":"services/cs2/run/#save-the-file","title":"Save the file","text":""},{"location":"services/cs2/run/#run-jobs-as-per-existing-documentation","title":"Run jobs as per existing documentation","text":""},{"location":"services/cs2/run/#paths-pythonpath-and-mount_dirs","title":"Paths, PYTHONPATH and mount_dirs","text":"

        There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. Python, paths and mount directories.

        "},{"location":"services/datacatalogue/","title":"EIDF Data Catalogue Information","text":"

        QuickStart

        Tutorial

        Documentation

        Metadata information

        "},{"location":"services/datacatalogue/docs/","title":"Service Documentation","text":""},{"location":"services/datacatalogue/docs/#metadata","title":"Metadata","text":"

        For more information on metadata, please read the following: Metadata

        "},{"location":"services/datacatalogue/docs/#online-support","title":"Online support","text":""},{"location":"services/datacatalogue/metadata/","title":"EIDF Metadata Information","text":""},{"location":"services/datacatalogue/metadata/#what-is-fair","title":"What is FAIR?","text":"

        FAIR stands for Findable, Accessible, Interoperable, and Reusable, and helps emphasise the best practices with publishing and sharing data (more details: FAIR Principles)

        "},{"location":"services/datacatalogue/metadata/#what-is-metadata","title":"What is metadata?","text":"

        Metadata is data about data, to help describe the dataset. Common metadata fields are things like the title of the dataset, who produced it, where it was generated (if relevant), when it was generated, and some key words describing it

        "},{"location":"services/datacatalogue/metadata/#what-is-ckan","title":"What is CKAN?","text":"

        CKAN is a metadata catalogue - i.e. it is a database for metadata rather than data. This will help with all aspects of FAIR:

        • it will be a signposting portal for where the data actually resides
        • it will ensure that at least metadata (even if not the data) is in a format which is easily retrievable via an identifier
        • the metadata (and hopefully data) use terms from vocabularies that are widely recognised in the relevant field
        • the metadata has lots of attributes to help others use it, and there are clear licence conditions where necessary
        "},{"location":"services/datacatalogue/metadata/#what-metadata-will-we-need-to-provide","title":"What metadata will we need to provide?","text":"
        • the title of the dataset; if a short title is not particularly descriptive, then please add a longer, separate, description too.
        • the name of the person who created the dataset
        • if it has spatial relevance, the latitude and longitude of the location where the dataset was generated, if possible (e.g. if a sensor has collected data, then it should be straightforward to know the lat and long)
        • the temporal period that the dataset covers
        • it is important to standardise the licencing for all data and we will use Creative Commons 4.0 by default. If you want a different licence, please come and talk to us.
        • If the dataset is from a third party, you must tell us the licence of that dataset
        • As well as the theme you've picked for your WP directory, you can add other themes in the metadata file. For example, you might have decided your WP theme is geophysics, but a dataset is also related to geodesy. Again, please check that this term is in the FAST vocabulary.
        • if there is likely to be more than 1 way that the data could be made available (e.g. netCDF and csv)
        "},{"location":"services/datacatalogue/metadata/#why-do-i-need-to-use-a-controlled-vocabulary","title":"Why do I need to use a controlled vocabulary?","text":"

        Using a standard vocabulary (such as the FAST Vocabulary) has many benefits:

        • the terms are managed by an external body
        • the hierarchy has been agreed (e.g. you will see for that for geophysics, it has \"skos broader\" topics of \"physics\" and \"earth sciences\", which I hope you agree with! Don't worry what \"skos\" means)
        • using controlled vocabularies means that everybody who uses it knows they are using the same definitions as everybody else using it
        • the vocabulary is updated at given intervals

        All of these advantages mean that we, as a project, don't need to think about this - there is no need to reinvent the wheel when other institutes (e.g. National Libraries) have created. You might recognise WorldCat - it is an organisation which manages a global catalogue of ~18000 libraries world-wide, so they are in a good position to generate a comprehensive vocabulary of academic topics!

        "},{"location":"services/datacatalogue/metadata/#what-about-licensing-what-does-cc-by-sa-40-mean","title":"What about licensing? (What does CC-BY-SA 4.0 mean?)","text":"

        The R in FAIR stands for reusable - more specifically it includes this subphrase: \"(Meta)data are released with a clear and accessible data usage license\". This means that we have to tell anyone else who uses the data what they're allowed to do with it - and, under the FAIR philosophy, more freedom is better.

        CC-BY-SA 4.0 allows anyone to remix, adapt, and build upon your work (even for commercial purposes), as long as they credit you and license their new creations under the identical terms. It also explicitly includes Sui Generis Database Rights, giving rights to the curation of a database even if you don't have the rights to the items in a database (e.g. a Spotify playlist, even though you don't own the rights to each track).

        Human readable summary: Creative Commons 4.0 Human Readable Full legal code: Creative Commons 4.0 Legal Code

        "},{"location":"services/datacatalogue/metadata/#im-stuck-how-do-i-get-help","title":"I'm stuck! How do I get help?","text":"

        Contact the EIDF Service Team via eidf@epcc.ed.ac.uk

        "},{"location":"services/datacatalogue/quickstart/","title":"Quickstart","text":""},{"location":"services/datacatalogue/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/datacatalogue/quickstart/#first-task","title":"First Task","text":""},{"location":"services/datacatalogue/quickstart/#further-information","title":"Further information","text":""},{"location":"services/datacatalogue/tutorial/","title":"Tutorial","text":""},{"location":"services/datacatalogue/tutorial/#first-query","title":"First Query","text":""},{"location":"services/gpuservice/","title":"Overview","text":"

        The EIDF GPU Service (EIDF GPU Service) provides access to a range of Nvidia GPUs, in both full GPU and MIG variants. The EIDF GPU Service is built upon Kubernetes.

        MIG (Multi-instance GPU) allow a single GPU to be split into multiple isolated smaller GPUs. This means that multiple users can access a portion of the GPU without being able to access what others are running on their portion.

        The EIDF GPU Service hosts 3G.20GB and 1G.5GB MIG variants which are approximately 1/2 and 1/7 of a full Nvidia A100 40 GB GPU respectively.

        The service provides access to:

        • Nvidia A100 40GB
        • Nvidia A100 80GB
        • Nvidia MIG A100 1G.5GB
        • Nvidia MIG A100 3G.20GB
        • Nvidia H100 80GB

        The current full specification of the EIDF GPU Service as of 14 February 2024:

        • 4912 CPU Cores (AMD EPYC and Intel Xeon)
        • 23 TiB Memory
        • Local Disk Space (Node Image Cache and Local Workspace) - 40 TiB
        • Ceph Persistent Volumes (Long Term Data) - up to 100TiB
        • 112 Nvidia A100 40 GB
        • 39 Nvidia A100 80 GB
        • 16 Nvidia A100 3G.20GB
        • 56 Nvidia A100 1G.5GB
        • 32 Nvidia H100 80 GB

        Quotas

        This is the full configuration of the cluster.

        Each project will have access to a quota across this shared configuration.

        Changes to the default quota must be discussed and agreed with the EIDF Services team.

        NOTE

        If you request a GPU on the EIDF GPU Service you will be assigned one at random unless you specify a GPU type. Please see Getting started with Kubernetes to learn about specifying GPU resources.

        "},{"location":"services/gpuservice/#service-access","title":"Service Access","text":"

        Users should have an EIDF Account as the EIDF GPU Service is only accessible through EIDF Virtual Machines.

        Existing projects can request access to the EIDF GPU Service through a service request to the EIDF helpdesk or emailing eidf@epcc.ed.ac.uk .

        New projects wanting to using the GPU Service should include this in their EIDF Project Application.

        Each project will be given a namespace within the EIDF GPU service to operate in.

        This namespace will normally be the EIDF Project code appended with \u2019ns\u2019, i.e. eidf989ns for a project with code 'eidf989'.

        Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available here.

        All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool.

        The VM does not require to be GPU-enabled.

        A quick check to see if a VM has access to the EIDF GPU service can be completed by typing kubectl -n <project-namespace> get jobs in to the command line.

        If this is first time you have connected to the GPU service the response should be No resources found in <project-namespace> namespace.

        EIDF GPU Service vs EIDF GPU-Enabled VMs

        The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs.

        This allows a project to access multiple GPUs of different types.

        An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

        Projects do not have to apply for a GPU-enabled VM to access the GPU Service.

        "},{"location":"services/gpuservice/#project-quotas","title":"Project Quotas","text":"

        A standard project namespace has the following initial quota (subject to ongoing review):

        • CPU: 100 Cores
        • Memory: 1TiB
        • GPU: 12

        Quota is a maximum on a Shared Resource

        A project quota is the maximum proportion of the service available for use by that project.

        Any submitted job requests that would exceed the total project quota will be queued.

        "},{"location":"services/gpuservice/#project-queues","title":"Project Queues","text":"

        EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the Kueue.

        Job Queuing

        During periods of high demand, jobs will be queued awaiting resource availability on the Service.

        As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated.

        GPUs in high demand, such as Nvidia H100s, typically have longer wait times.

        Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

        "},{"location":"services/gpuservice/#additional-service-policy-information","title":"Additional Service Policy Information","text":"

        Additional information on service policies can be found here.

        "},{"location":"services/gpuservice/#eidf-gpu-service-tutorial","title":"EIDF GPU Service Tutorial","text":"

        This tutorial teaches users how to submit tasks to the EIDF GPU Service, but it is not a comprehensive overview of Kubernetes.

        Lesson Objective Getting started with Kubernetes a. What is Kubernetes?b. How to send a task to a GPU node.c. How to define the GPU resources needed. Requesting persistent volumes with Kubernetes a. What is a persistent volume? b. How to request a PV resource. Running a PyTorch task a. Accessing a Pytorch container.b. Submitting a PyTorch task to the cluster.c. Inspecting the results. Template workflow a. Loading large data sets asynchronously.b. Manually or automatically building Docker images.c. Iteratively changing and testing code in a job."},{"location":"services/gpuservice/#further-reading-and-help","title":"Further Reading and Help","text":"
        • The Nvidia developers blog provides several examples of how to run ML tasks on a Kubernetes GPU cluster.

        • Kubernetes documentation has a useful kubectl cheat sheet.

        • More detailed use cases for the kubectl can be found in the Kubernetes documentation.

        "},{"location":"services/gpuservice/faq/","title":"GPU Service FAQ","text":""},{"location":"services/gpuservice/faq/#gpu-service-frequently-asked-questions","title":"GPU Service Frequently Asked Questions","text":""},{"location":"services/gpuservice/faq/#how-do-i-access-the-gpu-service","title":"How do I access the GPU Service?","text":"

        The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.

        "},{"location":"services/gpuservice/faq/#how-do-i-obtain-my-project-kubeconfig-file","title":"How do I obtain my project kubeconfig file?","text":"

        Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.

        "},{"location":"services/gpuservice/faq/#access-to-gpu-service-resources-in-default-namespace-is-forbidden","title":"Access to GPU Service resources in default namespace is 'Forbidden'","text":"
        Error from server (Forbidden): error when creating \"myjobfile.yml\": jobs is forbidden: User <user> cannot create resource \"jobs\" in API group \"\" in the namespace \"default\"\n

        Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the \"default\" namespace which you do not have permissions to use. Resubmitting the job/pod with kubectl -n <project-namespace> create \"myjobfile.yml\" should solve the issue.

        "},{"location":"services/gpuservice/faq/#i-cant-mount-my-pvc-in-multiple-containers-or-pods-at-the-same-time","title":"I can't mount my PVC in multiple containers or pods at the same time","text":"

        The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.

        "},{"location":"services/gpuservice/faq/#how-many-gpus-can-i-use-in-a-pod","title":"How many GPUs can I use in a pod?","text":"

        The current limit is 8 GPUs per pod. Each underlying host node has either 4 or 8 GPUs. If you request 8 GPUs, you will be placed in a queue until a node with 8 GPUs is free or other jobs to run. If you request 4 GPUs this could run on a node with 4 or 8 GPUs.

        "},{"location":"services/gpuservice/faq/#why-did-a-validation-error-occur-when-submitting-a-pod-or-job-with-a-valid-specification-file","title":"Why did a validation error occur when submitting a pod or job with a valid specification file?","text":"

        If an error like the below occurs:

        error: error validating \"myjobfile.yml\": error validating data: the server does not allow access to the requested resource; if you choose to ignore these errors, turn validation off with --validate=false\n

        There may be an issue with the kubectl version that is being run. This can occur if installing in virtual environments or from packages repositories.

        The current version verified to operate with the GPU Service is v1.24.10. kubectl and the Kubernetes API version can suffer from version skew if not with a defined number of releases. More information can be found on this under the Kubernetes Version Skew Policy.

        "},{"location":"services/gpuservice/faq/#insufficient-shared-memory-size","title":"Insufficient Shared Memory Size","text":"

        My SHM is very small, and it causes \"OSError: [Errno 28] No space left on device\" when I train a model using multi-GPU. How to increase SHM size?

        The default size of SHM is only 64M. You can mount an empty dir to /dev/shm to solve this problem:

           spec:\n     containers:\n       - name: [NAME]\n         image: [IMAGE]\n         volumeMounts:\n         - mountPath: /dev/shm\n           name: dshm\n     volumes:\n       - name: dshm\n         emptyDir:\n            medium: Memory\n
        "},{"location":"services/gpuservice/faq/#pytorch-slow-performance-issues","title":"Pytorch Slow Performance Issues","text":"

        Pytorch on Kubernetes may operate slower than expected - much slower than an equivalent VM setup.

        Pytorch defaults to auto-detecting the number of OMP Threads and it will report an incorrect number of potential threads compared to your requested CPU core count. This is a consequence in operating in a container environment, the CPU information is reported by standard libraries and tools will be the node level information rather than your container.

        To help correct this issue, the environment variable OMP_NUM_THREADS should be set in the job submission file to the number of cores requested or less.

        This has been tested using:

        • OMP_NUM_THREADS=1
        • OMP_NUM_THREADS=(number of requested cores).

        Example fragment for a Bash command start:

          containers:\n    - args:\n        - >\n          export OMP_NUM_THREADS=1;\n          python mypytorchprogram.py;\n      command:\n        - /bin/bash\n        - '-c'\n        - '--'\n
        "},{"location":"services/gpuservice/faq/#my-large-number-of-gpus-job-takes-a-long-time-to-be-scheduled","title":"My large number of GPUs Job takes a long time to be scheduled","text":"

        When requesting a large number of GPUs for a job, this may require an entire node to be free. This could take some time to become available, the default scheduling algorithm in the queues in place is Best Effort FIFO - this means that large jobs will not block small jobs from running if there is sufficient quota and space available.

        "},{"location":"services/gpuservice/kueue/","title":"Kueue","text":""},{"location":"services/gpuservice/kueue/#overview","title":"Overview","text":"

        Kueue is a native Kubernetes quota and job management system.

        This is the job queue system for the EIDF GPU Service, starting with February 2024.

        All users should submit jobs to their local namespace user queue, this queue will have the name eidf project namespace-user-queue.

        "},{"location":"services/gpuservice/kueue/#changes-to-job-specs","title":"Changes to Job Specs","text":"

        Jobs can be submitted as before but will require the addition of a metadata label:

           labels:\n      kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\n

        This is the only change required to make Jobs Kueue functional. A policy will be in place that will stop jobs without this label being accepted.

        "},{"location":"services/gpuservice/kueue/#useful-commands-for-looking-at-your-local-queue","title":"Useful commands for looking at your local queue","text":""},{"location":"services/gpuservice/kueue/#kubectl-get-queue","title":"kubectl get queue","text":"

        This command will output the high level status of your namespace queue with the number of workloads currently running and the number waiting to start:

        NAME               CLUSTERQUEUE             PENDING WORKLOADS   ADMITTED WORKLOADS\neidf001-user-queue eidf001-project-gpu-cq   0                   2\n
        "},{"location":"services/gpuservice/kueue/#kubectl-describe-queue-queue","title":"kubectl describe queue <queue>","text":"

        This command will output more detailed information on the current resource usage in your queue:

        Name:         eidf001-user-queue\nNamespace:    eidf001\nLabels:       <none>\nAnnotations:  <none>\nAPI Version:  kueue.x-k8s.io/v1beta1\nKind:         LocalQueue\nMetadata:\n  Creation Timestamp:  2024-02-06T13:06:23Z\n  Generation:          1\n  Managed Fields:\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:spec:\n        .:\n        f:clusterQueue:\n    Manager:      kubectl-create\n    Operation:    Update\n    Time:         2024-02-06T13:06:23Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        .:\n        f:admittedWorkloads:\n        f:conditions:\n          .:\n          k:{\"type\":\"Active\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n        f:flavorUsage:\n          .:\n          k:{\"name\":\"default-flavor\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"cpu\"}:\n                .:\n                f:name:\n                f:total:\n              k:{\"name\":\"memory\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-1g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-3g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-80\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n        f:flavorsReservation:\n          .:\n          k:{\"name\":\"default-flavor\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"cpu\"}:\n                .:\n                f:name:\n                f:total:\n              k:{\"name\":\"memory\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-1g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-3g\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n          k:{\"name\":\"gpu-a100-80\"}:\n            .:\n            f:name:\n            f:resources:\n              .:\n              k:{\"name\":\"nvidia.com/gpu\"}:\n                .:\n                f:name:\n                f:total:\n        f:pendingWorkloads:\n        f:reservingWorkloads:\n    Manager:         kueue\n    Operation:       Update\n    Subresource:     status\n    Time:            2024-02-14T10:54:20Z\n  Resource Version:  333898946\n  UID:               bca097e2-6c55-4305-86ac-d1bd3c767751\nSpec:\n  Cluster Queue:  eidf001-project-gpu-cq\nStatus:\n  Admitted Workloads:  2\n  Conditions:\n    Last Transition Time:  2024-02-06T13:06:23Z\n    Message:               Can submit new workloads to clusterQueue\n    Reason:                Ready\n    Status:                True\n    Type:                  Active\n  Flavor Usage:\n    Name:  gpu-a100\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  2\n    Name:     gpu-a100-3g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-1g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-80\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     default-flavor\n    Resources:\n      Name:   cpu\n      Total:  16\n      Name:   memory\n      Total:  256Gi\n  Flavors Reservation:\n    Name:  gpu-a100\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  2\n    Name:     gpu-a100-3g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-1g\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     gpu-a100-80\n    Resources:\n      Name:   nvidia.com/gpu\n      Total:  0\n    Name:     default-flavor\n    Resources:\n      Name:             cpu\n      Total:            16\n      Name:             memory\n      Total:            256Gi\n  Pending Workloads:    0\n  Reserving Workloads:  2\nEvents:                 <none>\n
        "},{"location":"services/gpuservice/kueue/#kubectl-get-workloads","title":"kubectl get workloads","text":"

        This command will return the list of workloads in the queue:

        NAME                QUEUE                ADMITTED BY              AGE\njob-jobtest-366ab   eidf001-user-queue   eidf001-project-gpu-cq   4h45m\njob-jobtest-34ba9   eidf001-user-queue   eidf001-project-gpu-cq   6h48m\n
        "},{"location":"services/gpuservice/kueue/#kubectl-describe-workload-workload","title":"kubectl describe workload <workload>","text":"

        This command will return a detailed summary of the workload including status and resource usage:

        Name:         job-pytorch-job-0b664\nNamespace:    t4\nLabels:       kueue.x-k8s.io/job-uid=33bc1e48-4dca-4252-9387-bf68b99759dc\nAnnotations:  <none>\nAPI Version:  kueue.x-k8s.io/v1beta1\nKind:         Workload\nMetadata:\n  Creation Timestamp:  2024-02-14T15:22:16Z\n  Generation:          2\n  Managed Fields:\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        f:admission:\n          f:clusterQueue:\n          f:podSetAssignments:\n            k:{\"name\":\"main\"}:\n              .:\n              f:count:\n              f:flavors:\n                f:cpu:\n                f:memory:\n                f:nvidia.com/gpu:\n              f:name:\n              f:resourceUsage:\n                f:cpu:\n                f:memory:\n                f:nvidia.com/gpu:\n        f:conditions:\n          k:{\"type\":\"Admitted\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n          k:{\"type\":\"QuotaReserved\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n    Manager:      kueue-admission\n    Operation:    Apply\n    Subresource:  status\n    Time:         2024-02-14T15:22:16Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:status:\n        f:conditions:\n          k:{\"type\":\"Finished\"}:\n            .:\n            f:lastTransitionTime:\n            f:message:\n            f:reason:\n            f:status:\n            f:type:\n    Manager:      kueue-job-controller-Finished\n    Operation:    Apply\n    Subresource:  status\n    Time:         2024-02-14T15:25:06Z\n    API Version:  kueue.x-k8s.io/v1beta1\n    Fields Type:  FieldsV1\n    fieldsV1:\n      f:metadata:\n        f:labels:\n          .:\n          f:kueue.x-k8s.io/job-uid:\n        f:ownerReferences:\n          .:\n          k:{\"uid\":\"33bc1e48-4dca-4252-9387-bf68b99759dc\"}:\n      f:spec:\n        .:\n        f:podSets:\n          .:\n          k:{\"name\":\"main\"}:\n            .:\n            f:count:\n            f:name:\n            f:template:\n              .:\n              f:metadata:\n                .:\n                f:labels:\n                  .:\n                  f:controller-uid:\n                  f:job-name:\n                f:name:\n              f:spec:\n                .:\n                f:containers:\n                f:dnsPolicy:\n                f:nodeSelector:\n                f:restartPolicy:\n                f:schedulerName:\n                f:securityContext:\n                f:terminationGracePeriodSeconds:\n                f:volumes:\n        f:priority:\n        f:priorityClassSource:\n        f:queueName:\n    Manager:    kueue\n    Operation:  Update\n    Time:       2024-02-14T15:22:16Z\n  Owner References:\n    API Version:           batch/v1\n    Block Owner Deletion:  true\n    Controller:            true\n    Kind:                  Job\n    Name:                  pytorch-job\n    UID:                   33bc1e48-4dca-4252-9387-bf68b99759dc\n  Resource Version:        270812029\n  UID:                     8cfa93ba-1142-4728-bc0c-e8de817e8151\nSpec:\n  Pod Sets:\n    Count:  1\n    Name:   main\n    Template:\n      Metadata:\n        Labels:\n          Controller - UID:  33bc1e48-4dca-4252-9387-bf68b99759dc\n          Job - Name:        pytorch-job\n        Name:                pytorch-pod\n      Spec:\n        Containers:\n          Args:\n            /mnt/ceph_rbd/example_pytorch_code.py\n          Command:\n            python3\n          Image:              pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n          Image Pull Policy:  IfNotPresent\n          Name:               pytorch-con\n          Resources:\n            Limits:\n              Cpu:             4\n              Memory:          4Gi\n              nvidia.com/gpu:  1\n            Requests:\n              Cpu:                     2\n              Memory:                  1Gi\n          Termination Message Path:    /dev/termination-log\n          Termination Message Policy:  File\n          Volume Mounts:\n            Mount Path:  /mnt/ceph_rbd\n            Name:        volume\n        Dns Policy:      ClusterFirst\n        Node Selector:\n          nvidia.com/gpu.product:  NVIDIA-A100-SXM4-40GB\n        Restart Policy:            Never\n        Scheduler Name:            default-scheduler\n        Security Context:\n        Termination Grace Period Seconds:  30\n        Volumes:\n          Name:  volume\n          Persistent Volume Claim:\n            Claim Name:   pytorch-pvc\n  Priority:               0\n  Priority Class Source:\n  Queue Name:             t4-user-queue\nStatus:\n  Admission:\n    Cluster Queue:  project-cq\n    Pod Set Assignments:\n      Count:  1\n      Flavors:\n        Cpu:             default-flavor\n        Memory:          default-flavor\n        nvidia.com/gpu:  gpu-a100\n      Name:              main\n      Resource Usage:\n        Cpu:             2\n        Memory:          1Gi\n        nvidia.com/gpu:  1\n  Conditions:\n    Last Transition Time:  2024-02-14T15:22:16Z\n    Message:               Quota reserved in ClusterQueue project-cq\n    Reason:                QuotaReserved\n    Status:                True\n    Type:                  QuotaReserved\n    Last Transition Time:  2024-02-14T15:22:16Z\n    Message:               The workload is admitted\n    Reason:                Admitted\n    Status:                True\n    Type:                  Admitted\n    Last Transition Time:  2024-02-14T15:25:06Z\n    Message:               Job finished successfully\n    Reason:                JobFinished\n    Status:                True\n    Type:                  Finished\n
        "},{"location":"services/gpuservice/policies/","title":"GPU Service Policies","text":""},{"location":"services/gpuservice/policies/#namespaces","title":"Namespaces","text":"

        Each project will be given a namespace which will have an applied quota.

        Default Quota:

        • CPU: 100 Cores
        • Memory: 1TiB
        • GPU: 12
        "},{"location":"services/gpuservice/policies/#kubeconfig","title":"Kubeconfig","text":"

        Each project will be assigned a kubeconfig file for access to the service which will allow operation in the assigned namespace and access to exposed service operators, for example the GPU and CephRBD operators.

        "},{"location":"services/gpuservice/policies/#kubernetes-job-time-to-live","title":"Kubernetes Job Time to Live","text":"

        All Kubernetes Jobs submitted to the service will have a Time to Live (TTL) applied via spec.ttlSecondsAfterFinished> automatically. The default TTL for jobs using the service will be 1 week (604800 seconds). A completed job (in success or error state) will be deleted from the service once one week has elapsed after execution has completed. This will reduce excessive object accumulation on the service.

        Important

        This policy is automated and does not require users to change their job specifications.

        "},{"location":"services/gpuservice/policies/#kubernetes-active-deadline-seconds","title":"Kubernetes Active Deadline Seconds","text":"

        All Kubernetes User Pods submitted to the service will have an Active Deadline Seconds (ADS) applied via spec.spec.activeDeadlineSeconds automatically. The default ADS for pods using the service will be 5 days (432000 seconds). A pod will be terminated 5 days after execution has begun. This will reduce the number of unused pods remaining on the service.

        Important

        This policy is automated and does not require users to change their job or pod specifications.

        "},{"location":"services/gpuservice/policies/#kueue","title":"Kueue","text":"

        All jobs will be managed through the Kueue scheduling system. All pods will be required to be owned by a Kubernetes workload.

        Each project will have a local user queue in their namespace. This will provide access to their cluster queue. To enable the use of the queue in your job definitions, the following will need to be added to the job specification file as part of the metadata:

           labels:\n      kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\n

        Jobs without this queue name tag will be rejected.

        Pods bypassing the queue system will be deleted.

        "},{"location":"services/gpuservice/training/L1_getting_started/","title":"Getting started with Kubernetes","text":""},{"location":"services/gpuservice/training/L1_getting_started/#introduction","title":"Introduction","text":"

        Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications.

        Nvidia GPUs are supported through K8s native Nvidia GPU Operators.

        The use of K8s to manage the EIDF GPU Service provides two key advantages:

        • support for containers enabling reproducible analysis whilst minimising demand on system admin.
        • automated resource allocation management for GPUs and storage volumes that are shared across multiple users.
        "},{"location":"services/gpuservice/training/L1_getting_started/#interacting-with-a-k8s-cluster","title":"Interacting with a K8s cluster","text":"

        An overview of the key components of a K8s container can be seen on the Kubernetes docs website.

        The primary component of a K8s cluster is a pod.

        A pod is a set of one or more containers (and their storage volumes) that share resources.

        Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers to be ran in the pod by writing a yaml file.

        The pod definition yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran.

        A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs.

        Multiple pods can be defined and maintained using several different methods depending on purpose: deployments, services and jobs; see the K8s docs for more details.

        Users interact with the K8s API using the kubectl (short for kubernetes control) commands.

        Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces.

        Useful commands are:

        • kubectl create -f <job definition yaml>: Create a new job with requested resources. Returns an error if a job with the same name already exists.
        • kubectl apply -f <job definition yaml>: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml.
        • kubectl delete pod <pod name>: Delete a pod from the cluster.
        • kubectl get pods: Summarise all pods the namespace has active (or pending).
        • kubectl describe pods: Verbose description of all pods the namespace has active (or pending).
        • kubectl describe pod <pod name>: Verbose summary of the specified pod.
        • kubectl logs <pod name>: Retrieve the log files associated with a running pod.
        • kubectl get jobs: List all jobs the namespace has active (or pending).
        • kubectl describe job <job name>: Verbose summary of the specified job.
        • kubectl delete job <job name>: Delete a job from the cluster.
        "},{"location":"services/gpuservice/training/L1_getting_started/#creating-your-first-job","title":"Creating your first job","text":"

        To access the GPUs on the service, it is recommended to start with one of the prebuild container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs.

        The list of Nvidia images is available on their website.

        The following example uses their CUDA sample code simulating nbody interactions.

        1. Open an editor of your choice and create the file test_NBody.yml
        2. Copy the following in to the file, replacing namespace-user-queue with -user-queue, e.g. eidf001ns-user-queue:

          apiVersion: batch/v1\nkind: Job\nmetadata:\n    generateName: jobtest-\n    labels:\n        kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: job-test\n        spec:\n            containers:\n            - name: cudasample\n              image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n              args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n                        nvidia.com/gpu: 1\n            restartPolicy: Never\n

          The pod resources are defined under the resources tags using the requests and limits tags.

          Resources defined under the requests tags are the reserved resources required for the pod to be scheduled.

          If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested.

          This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node.

          The limits tag specifies the maximum resources that can be assigned to a pod.

          The EIDF GPU Service requires all pods have requests and limits tags for CPU and memory defined in order to be accepted.

          GPU resources requests are optional and only an entry under the limits tag is needed to specify the use of a GPU, nvidia.com/gpu: 1. Without this no GPU will be available to the pod.

          The label kueue.x-k8s.io/queue-name specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users.

        3. Save the file and exit the editor

        4. Run kubectl create -f test_NBody.yml
        5. This will output something like:

          job.batch/jobtest-b92qg created\n
        6. Run kubectl get jobs

        7. This will output something like:

          NAME            COMPLETIONS   DURATION   AGE\njobtest-b92qg   3/3           48s        6m27s\njobtest-d45sr   5/5           15m        22h\njobtest-kwmwk   3/3           48s        29m\njobtest-kw22k   1/1           48s        29m\n

          This displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age.

        8. Describe your job using the command kubectl describe job jobtest-b92-qg, replacing the job name with your job name.

        9. This will output something like:

          Name:             jobtest-b92qg\nNamespace:        t4\nSelector:         controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\nLabels:           kueue.x-k8s.io/queue-name=t4-user-queue\nAnnotations:      batch.kubernetes.io/job-tracking:\nParallelism:      1\nCompletions:      3\nCompletion Mode:  NonIndexed\nStart Time:       Wed, 14 Feb 2024 14:07:44 +0000\nCompleted At:     Wed, 14 Feb 2024 14:08:32 +0000\nDuration:         48s\nPods Statuses:    0 Active (0 Ready) / 3 Succeeded / 0 Failed\nPod Template:\n    Labels:  controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3\n            job-name=jobtest-b92qg\n    Containers:\n        cudasample:\n            Image:      nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n            Port:       <none>\n            Host Port:  <none>\n            Args:\n                -benchmark\n                -numbodies=512000\n                -fp64\n                -fullscreen\n            Limits:\n                cpu:             2\n                memory:          4Gi\n                nvidia.com/gpu:  1\n            Requests:\n                cpu:        2\n                memory:     1Gi\n            Environment:  <none>\n            Mounts:       <none>\n    Volumes:        <none>\nEvents:\nType    Reason            Age    From                        Message\n----    ------            ----   ----                        -------\nNormal  Suspended         8m1s   job-controller              Job suspended\nNormal  CreatedWorkload   8m1s   batch/job-kueue-controller  Created Workload: t4/job-jobtest-b92qg-3b890\nNormal  Started           8m1s   batch/job-kueue-controller  Admitted by clusterQueue project-cq\nNormal  SuccessfulCreate  8m     job-controller              Created pod: jobtest-b92qg-lh64s\nNormal  Resumed           8m     job-controller              Job resumed\nNormal  SuccessfulCreate  7m44s  job-controller              Created pod: jobtest-b92qg-xhvdm\nNormal  SuccessfulCreate  7m28s  job-controller              Created pod: jobtest-b92qg-lvmrf\nNormal  Completed         7m12s  job-controller              Job completed\n
        10. Run kubectl get pods

        11. This will output something like:

          NAME                  READY   STATUS      RESTARTS   AGE\njobtest-b92qg-lh64s   0/1     Completed   0          11m\njobtest-b92qg-lvmrf   0/1     Completed   0          10m\njobtest-b92qg-xhvdm   0/1     Completed   0          10m\njobtest-d45sr-8tf4d   0/1     Completed   0          22h\njobtest-d45sr-jjhgg   0/1     Completed   0          22h\njobtest-d45sr-n5w6c   0/1     Completed   0          22h\njobtest-d45sr-v9p4j   0/1     Completed   0          22h\njobtest-d45sr-xgq5s   0/1     Completed   0          22h\njobtest-kwmwk-cgwmf   0/1     Completed   0          33m\njobtest-kwmwk-mttdw   0/1     Completed   0          33m\njobtest-kwmwk-r2q9h   0/1     Completed   0          33m\n
        12. View the logs of a pod from the job you ran kubectl logs jobtest-b92qg-lh64s - note that the pods for the job in this case start with the job name.

        13. This will output something like:

          Run \"nbody -benchmark [-numbodies=<numBodies>]\" to measure performance.\n    -fullscreen       (run n-body simulation in fullscreen mode)\n    -fp64             (use double precision floating point values for simulation)\n    -hostmem          (stores simulation data in host memory)\n    -benchmark        (run benchmark to measure performance)\n    -numbodies=<N>    (number of bodies (>= 1) to run in simulation)\n    -device=<d>       (where d=0,1,2.... for the CUDA device to use)\n    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)\n    -compare          (compares simulation results running once on the default GPU and once on the CPU)\n    -cpu              (run n-body simulation on the CPU)\n    -tipsy=<file.bin> (load a tipsy model file for simulation)\n\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n\n> Fullscreen mode\n> Simulation data stored in video memory\n> Double precision floating point simulation\n> 1 Devices used for simulation\nGPU Device 0: \"Ampere\" with compute capability 8.0\n\n> Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]\nnumber of bodies = 512000\n512000 bodies, total time for 10 iterations: 10570.778 ms\n= 247.989 billion interactions per second\n= 7439.679 double-precision GFLOP/s at 30 flops per interaction\n
        14. Delete your job with kubectl delete job jobtest-b92qg - this will delete the associated pods as well.

        15. "},{"location":"services/gpuservice/training/L1_getting_started/#specifying-gpu-requirements","title":"Specifying GPU requirements","text":"

          If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB].

          The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of whether what GPU type is present on the node.

          The GPU resource requests can be made more specific by adding the type of GPU product the pod is requesting to the node selector:

          • nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
          • nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
          • nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'
          • nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'
          • nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
          "},{"location":"services/gpuservice/training/L1_getting_started/#example-yaml-file","title":"Example yaml file","text":"
          apiVersion: batch/v1\nkind: Job\nmetadata:\n    generateName: jobtest-\n    labels:\n        kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: job-test\n        spec:\n            containers:\n            - name: cudasample\n              image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n              args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n                        nvidia.com/gpu: 1\n            restartPolicy: Never\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n
          "},{"location":"services/gpuservice/training/L1_getting_started/#running-multiple-pods-with-k8s-jobs","title":"Running multiple pods with K8s jobs","text":"

          The recommended use of the EIDF GPU Service is to use a job request which wraps around a pod specification and provide several useful attributes.

          Firstly, if a pod is assigned to a node that dies then the pod itself will fail and the user has to manually restart it.

          Wrapping a pod within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod, if the restartPolicy is set.

          Jobs allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate.

          Jobs allow for better scheduling of resources using the Kueue service implemented on the EIDF GPU Service. Pods which attempt to bypass the queue mechanism this provides will affect the experience of other project users.

          See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends.

          apiVersion: batch/v1\nkind: Job\nmetadata:\n generateName: jobtest-\n labels:\n    kueue.x-k8s.io/queue-name:  namespace-user-queue\nspec:\n completions: 3\n parallelism: 1\n template:\n  metadata:\n   name: job-test\n  spec:\n   containers:\n   - name: cudasample\n     image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1\n     args: [\"-benchmark\", \"-numbodies=512000\", \"-fp64\", \"-fullscreen\"]\n     resources:\n      requests:\n       cpu: 2\n       memory: '1Gi'\n      limits:\n       cpu: 2\n       memory: '4Gi'\n       nvidia.com/gpu: 1\n   restartPolicy: Never\n
          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/","title":"Requesting Persistent Volumes With Kubernetes","text":"

          Pods in the K8s EIDF GPU Service are intentionally ephemeral.

          They only last as long as required to complete the task that they were created for.

          Keeping pods ephemeral ensures the cluster resources are released for other users to request.

          However, this means the default storage volumes within a pod are temporary.

          If multiple pods require access to the same large data set or they output large files, then computationally costly file transfers need to be included in every pod instance.

          K8s allows you to request persistent volumes that can be mounted to multiple pods to share files or collate outputs.

          These persistent volumes will remain even if the pods they are mounted to are deleted, are updated or crash.

          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#submitting-a-persistent-volume-claim","title":"Submitting a Persistent Volume Claim","text":"

          Before a persistent volume can be mounted to a pod, the required storage resources need to be requested and reserved to your namespace.

          A PersistentVolumeClaim (PVC) needs to be submitted to K8s to request the storage resources.

          The storage resources are held on a Ceph server which can accept requests up to 100 TiB. Currently, each PVC can only be accessed by one pod at a time, this limitation is being addressed in further development of the EIDF GPU Service. This means at this stage, pods can mount the same PVC in sequence, but not concurrently.

          Example PVCs can be seen on the Kubernetes documentation page.

          All PVCs on the EIDF GPU Service must use the csi-rbd-sc storage class.

          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-persistentvolumeclaim","title":"Example PersistentVolumeClaim","text":"
          kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: test-ceph-pvc\nspec:\n accessModes:\n  - ReadWriteOnce\n resources:\n  requests:\n   storage: 2Gi\n storageClassName: csi-rbd-sc\n

          You create a persistent volume by passing the yaml file to kubectl like a pod specification yaml kubectl create <PVC specification yaml> Once you have successfully created a persistent volume you can interact with it using the standard kubectl commands:

          • kubectl delete pvc <PVC name>
          • kubectl get pvc <PVC name>
          • kubectl apply -f <PVC specification yaml>
          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#mounting-a-persistent-volume-to-a-pod","title":"Mounting a persistent Volume to a Pod","text":"

          Introducing a persistent volume to a pod requires the addition of a volumeMount option to the container and a volume option linking to the PVC in the pod specification yaml.

          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#example-pod-specification-yaml-with-mounted-persistent-volume","title":"Example pod specification yaml with mounted persistent volume","text":"
          apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: test-ceph-pvc-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: test-ceph-pvc-pod\n        spec:\n            containers:\n            - name: cudasample\n              image: busybox\n              args: [\"sleep\", \"infinity\"]\n              resources:\n                    requests:\n                        cpu: 2\n                        memory: '1Gi'\n                    limits:\n                        cpu: 2\n                        memory: '4Gi'\n              volumeMounts:\n                    - mountPath: /mnt/ceph_rbd\n                      name: volume\n            restartPolicy: Never\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: test-ceph-pvc\n
          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#accessing-the-persistent-volume-outside-a-pod","title":"Accessing the persistent volume outside a pod","text":"

          To move files in/out of the persistent volume from outside a pod you can use the kubectl cp command.

          *** On Login Node - replacing pod name with your pod name ***\nkubectl cp /home/data/test_data.csv test-ceph-pvc-job-8c9cc:/mnt/ceph_rbd\n

          For more complex file transfers and synchronisation, create a low resource pod with the persistent volume mounted.

          The bash command rsync can be amended to manage file transfers into the mounted PV following this GitHub repo.

          "},{"location":"services/gpuservice/training/L2_requesting_persistent_volumes/#clean-up","title":"Clean up","text":"
          kubectl delete job test-ceph-pvc-job\n\nkubectl delete pvc test-ceph-pvc\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/","title":"Running a PyTorch task","text":"

          In the following lesson, we'll build a NLP neural network and train it using the EIDF GPU Service.

          The model was taken from the PyTorch Tutorials.

          The lesson will be split into three parts:

          • Requesting a persistent volume and transferring code/data to it
          • Creating a pod with a PyTorch container downloaded from DockerHub
          • Submitting a job to the EIDF GPU Service and retrieving the results
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#load-training-data-and-ml-code-into-a-persistent-volume","title":"Load training data and ML code into a persistent volume","text":""},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#create-a-persistent-volume","title":"Create a persistent volume","text":"

          Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).

          kubectl create -f <pvc-spec-yaml>\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-persistentvolumeclaim","title":"Example PyTorch PersistentVolumeClaim","text":"
          kind: PersistentVolumeClaim\napiVersion: v1\nmetadata:\n name: pytorch-pvc\nspec:\n accessModes:\n  - ReadWriteOnce\n resources:\n  requests:\n   storage: 2Gi\n storageClassName: csi-rbd-sc\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#transfer-codedata-to-persistent-volume","title":"Transfer code/data to persistent volume","text":"
          1. Check PVC has been created

            kubectl get pvc <pv-name>\n
          2. Create a lightweight job with pod with PV mounted (example job below)

            kubectl create -f lightweight-pod-job.yaml\n
          3. Download the PyTorch code

            wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py\n
          4. Copy the Python script into the PV

            kubectl cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/\n
          5. Check whether the files were transferred successfully

            kubectl exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd\n
          6. Delete the lightweight job

            kubectl delete job lightweight-job-<identifier>\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-lightweight-job-specification","title":"Example lightweight job specification","text":"
          apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: lightweight-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: lightweight-pod\n        spec:\n            containers:\n            - name: data-loader\n              image: busybox\n              args: [\"sleep\", \"infinity\"]\n              resources:\n                    requests:\n                        cpu: 1\n                        memory: '1Gi'\n                    limits:\n                        cpu: 1\n                        memory: '1Gi'\n              volumeMounts:\n                    - mountPath: /mnt/ceph_rbd\n                      name: volume\n            restartPolicy: Never\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#creating-a-job-with-a-pytorch-container","title":"Creating a Job with a PyTorch container","text":"

          We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.

          The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.

          Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.

          kubectl create -f <pytorch-job-yaml>\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#example-pytorch-job-specification-file","title":"Example PyTorch Job Specification File","text":"
          apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: pytorch-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 1\n    template:\n        metadata:\n            name: pytorch-pod\n        spec:\n            restartPolicy: Never\n            containers:\n            - name: pytorch-con\n              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n              command: [\"python3\"]\n              args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n              volumeMounts:\n                - mountPath: /mnt/ceph_rbd\n                  name: volume\n              resources:\n                requests:\n                  cpu: 2\n                  memory: \"1Gi\"\n                limits:\n                  cpu: 4\n                  memory: \"4Gi\"\n                  nvidia.com/gpu: 1\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#reviewing-the-results-of-the-pytorch-model","title":"Reviewing the results of the PyTorch model","text":"

          This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.

          1. Check that the model ran to completion

            kubectl logs <pytorch-pod-name>\n
          2. Spin up a lightweight pod to retrieve results

            kubectl create -f lightweight-pod-job.yaml\n
          3. Copy the trained model back to your access VM

            kubectl cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#using-a-kubernetes-job-to-train-the-pytorch-model-multiple-times","title":"Using a Kubernetes job to train the pytorch model multiple times","text":"

          A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.

          A Kubernetes job can create and manage multiple pods with identical or different initial parameters.

          NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.

          Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.

          apiVersion: batch/v1\nkind: Job\nmetadata:\n    name: pytorch-job\n    labels:\n        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue\nspec:\n    completions: 3\n    template:\n        metadata:\n            name: pytorch-pod\n        spec:\n            restartPolicy: Never\n            containers:\n            - name: pytorch-con\n              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel\n              command: [\"python3\"]\n              args: [\"/mnt/ceph_rbd/example_pytorch_code.py\"]\n              volumeMounts:\n                - mountPath: /mnt/ceph_rbd\n                  name: volume\n              resources:\n                requests:\n                  cpu: 2\n                  memory: \"1Gi\"\n                limits:\n                  cpu: 4\n                  memory: \"4Gi\"\n                  nvidia.com/gpu: 1\n            nodeSelector:\n                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb\n            volumes:\n                - name: volume\n                  persistentVolumeClaim:\n                    claimName: pytorch-pvc\n
          "},{"location":"services/gpuservice/training/L3_running_a_pytorch_task/#clean-up","title":"Clean up","text":"
          kubectl delete pod pytorch-job\n\nkubectl delete pvc pytorch-pvc\n
          "},{"location":"services/gpuservice/training/L4_template_workflow/","title":"Template workflow","text":""},{"location":"services/gpuservice/training/L4_template_workflow/#requirements","title":"Requirements","text":"

          It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.

          "},{"location":"services/gpuservice/training/L4_template_workflow/#overview","title":"Overview","text":"

          An example workflow for code development using K8s is outlined below.

          In theory, users can create docker images with all the code, software and data included to complete their analysis.

          In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.

          Therefore, it is recommended to separate code, software, and data preparation into distinct steps:

          1. Data Loading: Loading large data sets asynchronously.

          2. Developing a Docker environment: Manually or automatically building Docker images.

          3. Code development with K8s: Iteratively changing and testing code in a job.

          The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.

          The three stages are interchangeable and may not be relevant to every project.

          Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).

          "},{"location":"services/gpuservice/training/L4_template_workflow/#data-loading","title":"Data loading","text":"

          The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.

          Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.

          Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.

          It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.

          Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.

          "},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-with-a-lightweight-job","title":"Asynchronous data downloading with a lightweight job","text":"
          1. Check a PVC has been created.

            kubectl -n <project-namespace> get pvc template-workflow-pvc\n
          2. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  metadata:\n   name: lightweight-job\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: data-loader\n     image: alpine/curl:latest\n     command: ['sh', '-c', \"cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"1Gi\"\n      limits:\n       cpu: 1\n       memory: \"1Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
          3. Run the data download job.

            kubectl -n <project-namespace> create -f lightweight-pod.yaml\n
          4. Check if the download has completed.

            kubectl -n <project-namespace> get jobs\n
          5. Delete the lightweight job once completed.

            kubectl -n <project-namespace> delete job lightweight-job\n
          "},{"location":"services/gpuservice/training/L4_template_workflow/#asynchronous-data-downloading-within-a-screen-session","title":"Asynchronous data downloading within a screen session","text":"

          Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.

          Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.

          This allows you to start a task, such as downloading a data set, and check in on it asynchronously.

          Once you have started a screen session, you can create a new window with ctrl-a c, swap between windows with ctrl-a 0-9 and exit screen (but keep any task running) with ctrl-a d.

          Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.

          1. Start a screen session.

            screen\n
          2. Create an interactive lightweight job session.

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: lightweight-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  metadata:\n   name: lightweight-pod\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: data-loader\n     image: alpine/curl:latest\n     command: ['sleep','infinity']\n     resources:\n      requests:\n       cpu: 1\n       memory: \"1Gi\"\n      limits:\n       cpu: 1\n       memory: \"1Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
          3. Download data set. Change the curl URL to your data set of interest.

            kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip\n
          4. Exit the remote session by either ending the session or ctrl-a d.

          5. Reconnect at a later time and reattach the screen window.

            screen -list\n\nscreen -r <session-name>\n
          6. Check the download was successful and delete the job.

            kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/\n\nkubectl -n <project-namespace> delete job lightweight-job\n
          7. Exit the screen session.

            exit\n
          "},{"location":"services/gpuservice/training/L4_template_workflow/#preparing-a-custom-docker-image","title":"Preparing a custom Docker image","text":"

          Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.

          It does not provide functionality to build images and create pods from docker files.

          However, use cases may require some custom modifications of a base image, such as adding a python library.

          These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.

          This is not an introduction to building docker images, please see the Docker tutorial for a general overview.

          "},{"location":"services/gpuservice/training/L4_template_workflow/#manually-building-a-docker-image-locally","title":"Manually building a Docker image locally","text":"
          1. Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.

          2. Create a Dockerfile to add any additional packages required to the base image.

            FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10\nRUN pip install pandas\nRUN pip install plotly\n
          3. Build the Docker container locally (You will need to install Docker)

            cd <dockerfile-folder>\n\ndocker build . -t <docker-hub-username>/template-docker-image:latest\n

          Building images for different CPU architectures

          Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.

          If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64 flag to the build function.

          1. Create a repository to hold the image on Docker Hub (You will need to create and setup an account).

          2. Push the Docker image to the repository.

            docker push <docker-hub-username>/template-docker-image:latest\n
          3. Finally, specify your Docker image in the image: tag of the job specification yaml file.

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n
          "},{"location":"services/gpuservice/training/L4_template_workflow/#automatically-building-docker-images-using-github-actions","title":"Automatically building docker images using GitHub Actions","text":"

          In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.

          A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.

          This process requires you to already have a GitHub and Docker Hub account.

          1. Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.

          2. Create two GitHub secrets to securely provide your Docker Hub username and access token.

          3. Add the dockerfile to a code/docker folder within an active GitHub repo.

          4. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.

            name: ci\non:\n  push:\n    paths:\n      - 'code/docker/**'\n\njobs:\n  docker:\n    runs-on: ubuntu-latest\n    steps:\n      -\n        name: Set up QEMU\n        uses: docker/setup-qemu-action@v3\n      -\n        name: Set up Docker Buildx\n        uses: docker/setup-buildx-action@v3\n      -\n        name: Login to Docker Hub\n        uses: docker/login-action@v3\n        with:\n          username: ${{ secrets.DOCKERHUB_USERNAME }}\n          password: ${{ secrets.DOCKERHUB_TOKEN }}\n      -\n        name: Build and push\n        uses: docker/build-push-action@v5\n        with:\n          context: \"{{defaultContext}}:code/docker\"\n          push: true\n          tags: <target-dockerhub-image-name>\n
          5. Push a change to the dockerfile and check the Docker Hub image is updated.

          "},{"location":"services/gpuservice/training/L4_template_workflow/#code-development-with-k8s","title":"Code development with K8s","text":"

          Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.

          However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.

          If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).

          A pod yaml file can be defined to automatically pull the latest code version before running any tests.

          Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create command.

          You must already have a GitHub account to follow this process.

          This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).

          A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.

          "},{"location":"services/gpuservice/training/L4_template_workflow/#create-a-job-that-downloads-and-runs-the-latest-code-version-at-runtime","title":"Create a job that downloads and runs the latest code version at runtime","text":"
          1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n
          2. Add an initial container that runs before the main container to download the latest version of the code.

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: [\"sleep\", \"infinity\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n     - mountPath: /code\n       name: github-code\n   initContainers:\n   - name: lightweight-git-container\n     image: cicirello/alpine-plus-plus\n     command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /code\n       name: github-code\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n   - name: github-code\n     emptyDir:\n      sizeLimit: 1Gi\n
          3. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the initContainers: command: tag.

            apiVersion: batch/v1\nkind: Job\nmetadata:\n name: template-workflow-job\n labels:\n  kueue.x-k8s.io/queue-name: <project-namespace>-user-queue\nspec:\n completions: 1\n parallelism: 1\n template:\n  spec:\n   restartPolicy: Never\n   containers:\n   - name: template-docker-image\n     image: <docker-hub-username>/template-docker-image:latest\n     command: ['sh', '-c', \"python3 /code/<python-script>\"]\n     resources:\n      requests:\n       cpu: 10\n       memory: \"40Gi\"\n      limits:\n       cpu: 10\n       memory: \"80Gi\"\n       nvidia.com/gpu: 1\n     volumeMounts:\n     - mountPath: /mnt/ceph_rbd\n       name: volume\n     - mountPath: /code\n       name: github-code\n   initContainers:\n   - name: lightweight-git-container\n     image: cicirello/alpine-plus-plus\n     command: ['sh', '-c', \"cd /code; git clone <target-repo>\"]\n     resources:\n      requests:\n       cpu: 1\n       memory: \"4Gi\"\n      limits:\n       cpu: 1\n       memory: \"8Gi\"\n     volumeMounts:\n     - mountPath: /code\n       name: github-code\n   volumes:\n   - name: volume\n     persistentVolumeClaim:\n      claimName: template-workflow-pvc\n   - name: github-code\n     emptyDir:\n      sizeLimit: 1Gi\n
          4. Submit the yaml file to kubernetes

            kubectl -n <project-namespace> create -f <job-yaml-file>\n
          "},{"location":"services/graphcore/","title":"Overview","text":"

          EIDF hosts a Graphcore Bow Pod64 system for AI acceleration.

          The specification of the Bow Pod64 is:

          • 16x Bow-2000 machines
          • 64x Bow IPUs (4 IPUs per Bow-2000)
          • 94,208 IPU cores (1472 cores per IPU)
          • 57.6GB of In-Processor-Memory (0.9GB per IPU)

          For more details about the IPU architecture, see documentation from Graphcore.

          The smallest unit of compute resource that can be requested is a single IPU.

          Similarly to the EIDF GPU Service, usage of the Graphcore is managed using Kubernetes.

          "},{"location":"services/graphcore/#service-access","title":"Service Access","text":"

          Access to the Graphcore accelerator is provisioning through the EIDF GPU Service.

          Users should apply for access to Graphcore via the EIDF GPU Service.

          "},{"location":"services/graphcore/#project-quotas","title":"Project Quotas","text":"

          Currently there is no active quota mechanism on the Graphcore accelerator. IPUJobs should be actively using partitions on the Graphcore.

          "},{"location":"services/graphcore/#graphcore-tutorial","title":"Graphcore Tutorial","text":"

          The following tutorial teaches users how to submit tasks to the Graphcore system. This tutorial assumes basic familiary with submitting jobs via Kubernetes. For a tutorial on using Kubernetes, see the GPU service tutorial. For more in-depth lessons about developing applications for Graphcore, see the general documentation and guide for creating IPU jobs via Kubernetes.

          Lesson Objective Getting started with IPU jobs a. How to send an IPUJob.b. Monitoring and Cancelling your IPUJob. Multi-IPU Jobs a. Using multiple IPUs for distributed training. Profiling with PopVision a. Enabling profiling in your code.b. Downloading the profile reports. Other Frameworks a. Using Tensorflow and PopART.b. Writing IPU programs with PopLibs (C++)."},{"location":"services/graphcore/#further-reading-and-help","title":"Further Reading and Help","text":"
          • The Graphcore documentation provides information about using the Graphcore system.

          • The Graphcore examples repository on GitHub provides a catalogue of application examples that have been optimised to run on Graphcore IPUs for both training and inference. It also contains tutorials for using various frameworks.

          "},{"location":"services/graphcore/faq/","title":"Graphcore FAQ","text":""},{"location":"services/graphcore/faq/#graphcore-questions","title":"Graphcore Questions","text":""},{"location":"services/graphcore/faq/#how-do-i-delete-a-runningterminated-pod","title":"How do I delete a running/terminated pod?","text":"

          IPUJobs manages the launcher and worker pods, therefore the pods will be deleted when the IPUJob is deleted, using kubectl delete ipujobs <IPUJob-name>. If only the pod is deleted via kubectl delete pod, the IPUJob may respawn the pod.

          To see running or terminated IPUJobs, run kubectl get ipujobs.

          "},{"location":"services/graphcore/faq/#my-ipujob-died-with-a-message-poptorch_cpp_error-failed-to-acquire-x-ipus-why","title":"My IPUJob died with a message: 'poptorch_cpp_error': Failed to acquire X IPU(s). Why?","text":"

          This error may appear when the IPUJob name is too long.

          We have identified that for IPUJobs with metadata:name length over 36 characters, this error may appear. A solution is to reduce the name to under 36 characters.

          "},{"location":"services/graphcore/training/L1_getting_started/","title":"Getting started with Graphcore IPU Jobs","text":"

          This guide assumes basic familiarity with Kubernetes (K8s) and usage of kubectl. See GPU service tutorial to get started.

          "},{"location":"services/graphcore/training/L1_getting_started/#introduction","title":"Introduction","text":"

          Graphcore provides prebuilt docker containers (full lists here) which contain the required libraries (pytorch, tensorflow, poplar etc.) and can be used directly within the K8s to run on the Graphcore IPUs.

          In this tutorial we will cover running training with a single IPU. The subsequent tutorial will cover using multiple IPUs, which can be used for distrubed training jobs.

          "},{"location":"services/graphcore/training/L1_getting_started/#creating-your-first-ipu-job","title":"Creating your first IPU job","text":"

          For our first IPU job, we will be using the Graphcore PyTorch (PopTorch) container image (graphcore/pytorch:3.3.0) to run a simple example of training a neural network for classification on the MNIST dataset, which is provided here. More applications can be found in the repository https://github.com/graphcore/examples.

          To get started:

          1. to specify the job - create the file mnist-training-ipujob.yaml, then copy and save the following content into the file:
          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: mnist-training-\nspec:\n  # jobInstances defines the number of job instances.\n  # More than 1 job instance is usually useful for inference jobs only.\n  jobInstances: 1\n  # ipusPerJobInstance refers to the number of IPUs required per job instance.\n  # A separate IPU partition of this size will be created by the IPU Operator\n  # for each job instance.\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: mnist-training\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/pytorch/mnist;\n              python -m pip install -r requirements.txt;\n              python mnist_poptorch_code_only.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n
          1. to submit the job - run kubectl create -f mnist-training-ipujob.yaml, which will give the following output:

            ipujob.graphcore.ai/mnist-training-<random string> created\n
          2. to monitor progress of the job - run kubectl get pods, which will give the following output

            NAME                      READY   STATUS      RESTARTS   AGE\nmnist-training-<random string>-worker-0   0/1     Completed   0          2m56s\n
          3. to read the result - run kubectl logs mnist-training-<random string>-worker-0, which will give the following output (or similar)

          ...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [00:23<00:00]\nEpochs: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:34<00:00, 34.18s/it]\n...\nAccuracy on test set: 97.08%\n
          "},{"location":"services/graphcore/training/L1_getting_started/#monitoring-and-cancelling-your-ipu-job","title":"Monitoring and Cancelling your IPU job","text":"

          An IPU job creates an IPU Operator, which manages the required worker or launcher pods. To see running or complete IPUjobs, run kubectl get ipujobs, which will show:

          NAME             STATUS      CURRENT   DESIRED   LASTMESSAGE          AGE\nmnist-training   Completed   0         1         All instances done   10m\n

          To delete the IPUjob, run kubectl delete ipujobs <job-name>, e.g. kubectl delete ipujobs mnist-training-<random string>. This will also delete the associated worker pod mnist-training-<random string>-worker-0.

          Note: simply deleting the pod via kubectl delete pods mnist-training-<random-string>-worker-0 does not delete the IPU job, which will need to be deleted separately.

          Note: you can list all pods via kubectl get all or kubectl get pods, but they do not show the ipujobs. These can be obtained using kubectl get ipujobs.

          Note: kubectl describe <pod-name> provides verbose description of a specific pod.

          "},{"location":"services/graphcore/training/L1_getting_started/#description","title":"Description","text":"

          The Graphcore IPU Operator (Kubernetes interface) extends the Kubernetes API by introducing a custom resource definition (CRD) named IPUJob, which can be seen at the beginning of the included yaml file:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\n

          An IPUJob allows users to defineworkloads that can use IPUs. There are several fields specific to an IPUJob:

          job instances : This defines the number of jobs. In the case of training it should be 1.

          ipusPerJobInstance : This defines the size of IPU partition that will be created for each job instance.

          workers : This defines a Pod specification that will be used for Worker Pods, including the container image and commands.

          These fields have been populated in the example .yaml file. For distributed training (with multiple IPUs), additional fields need to be included, which will be described in the next lesson.

          "},{"location":"services/graphcore/training/L1_getting_started/#additional-information","title":"Additional Information","text":"

          It is possible to further specify the restart policy (Always/OnFailure/Never/ExitCode) and clean up policy (Workers/All/None); see here.

          "},{"location":"services/graphcore/training/L2_multiple_IPU/","title":"Distributed training on multiple IPUs","text":"

          In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on https://github.com/graphcore/examples. These may require distributed training on multiple IPUs.

          The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64.

          "},{"location":"services/graphcore/training/L2_multiple_IPU/#first-example","title":"First example","text":"

          As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from https://github.com/graphcore/examples/tree/master/nlp/bert/pytorch.

          To get started, save and create an IPUJob with the following .yaml file:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: bert-training-multi-ipu-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"4\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: bert-training-multi-ipu\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/nlp/bert/pytorch;\n              apt update ;\n              apt upgrade -y;\n              DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n              pip3 install -r requirements.txt ;\n              python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          Running the above IPUJob and querying the log via kubectl logs pod/bert-training-multi-ipu-<random string>-worker-0 should give:

          ...\nData loaded in 8.559805537108332 secs\n-----------------------------------------------------------\n-------------------- Device Allocation --------------------\nEmbedding  --> IPU 0\nEncoder 0  --> IPU 1\nEncoder 1  --> IPU 1\nEncoder 2  --> IPU 1\nEncoder 3  --> IPU 1\nEncoder 4  --> IPU 2\nEncoder 5  --> IPU 2\nEncoder 6  --> IPU 2\nEncoder 7  --> IPU 2\nEncoder 8  --> IPU 3\nEncoder 9  --> IPU 3\nEncoder 10 --> IPU 3\nEncoder 11 --> IPU 3\nPooler     --> IPU 0\nClassifier --> IPU 0\n-----------------------------------------------------------\n---------- Compilation/Loading from Cache Started ---------\n\n...\n\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [08:02<00:00]\nCompiled/Loaded model in 500.756152929971 secs\n-----------------------------------------------------------\n--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %:   0%|          | 0/1 [00:16<?, ?it/s, throughput: 4035.0 samples/sec]\n-----------------------------------------------------------\n-------------------- Training Metrics ---------------------\nglobal_batch_size: 65536\ndevice_iterations: 1\ntraining_steps: 1\nTraining time: 16.245 secs\n-----------------------------------------------------------\n
          "},{"location":"services/graphcore/training/L2_multiple_IPU/#details","title":"Details","text":"

          In this example, we have requested 4 IPUs:

          ipusPerJobInstance: \"4\"\n

          The python flag --config pretrain_base_128_pod4 uses one of the preset configurations for this model with 4 IPUs. Here we also use the --datset generated flag to generate data rather than download the required dataset.

          To provided sufficient shm for the IPU pod, it may be necessary to mount /dev/shm as follows:

                    volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          It is also required to set spec.hostIPC to true:

            hostIPC: true\n

          and add a securityContext to the container definition than enables the IPC_LOCK capability:

              securityContext:\n      capabilities:\n        add:\n        - IPC_LOCK\n

          Note: IPC_LOCK allows for the RDMA software stack to use pinned memory \u2014 which is particularly useful for PyTorch dataloaders, which can be very memory hungry. This is since all data going to the IPUs go via the network interfaces (via 100Gbps ethernet).

          "},{"location":"services/graphcore/training/L2_multiple_IPU/#memory-usage","title":"Memory usage","text":"

          In general, the graph compilation phase of running large models can require significant memory, and far less during the execution phase.

          In the example above, it is possible to explicitly request the memory via:

                    resources:\n            limits:\n              memory: \"128Gi\"\n            requests:\n              memory: \"128Gi\"\n

          which will succeed. (The graph compilation fails if only 32Gi is requested.)

          As a general guideline, 128GB memory should be enough for the majority of tasks, and rarely exceed 200GB even for jobs with high IPU count. In the example .yaml script, we do not specifically request the memory.

          "},{"location":"services/graphcore/training/L2_multiple_IPU/#scaling-up-ipu-count-and-using-poprun","title":"Scaling up IPU count and using Poprun","text":"

          In the example above, python is launched directly in the pod. When scaling up the number of IPUs (e.g. above 8 IPUs), it may be possible to run into a CPU bottleneck. This may be observed when the throughput scales sub-linearly with the number of data-parallel replicas (i.e. when doubling the IPU count, the performance does not double). This can also be verified by profiling the application and observing a significant proportion of runtime spent on host CPU workload.

          In this case, Poprun can be used launch multiple instances. As an example, we will save the following .yaml configuratoin and run:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: bert-poprun-64ipus-\nspec:\n  jobInstances: 1\n  modelReplicasPerWorker: \"16\"\n  ipusPerJobInstance: \"64\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: bert-poprun-64ipus\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/nlp/bert/pytorch;\n              apt update ;\n              apt upgrade -y;\n              DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;\n              pip3 install -r requirements.txt ;\n              OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 OMPI_ALLOW_RUN_AS_ROOT=1 \\\n              poprun \\\n              --allow-run-as-root 1 \\\n              --vv \\\n              --num-instances 1 \\\n              --num-replicas 16 \\\n               --mpi-global-args=\"--tag-output\" \\\n              --ipus-per-replica 4 \\\n              python3 run_pretraining.py \\\n              --config pretrain_large_128_POD64 \\\n              --dataset generated --training-steps 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          Inspecting the log via kubectl logs <pod-name> should produce:

          ...\n ===========================================================================================\n|                                      poprun topology                                      |\n|===========================================================================================|\n10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE\n10:10:50.154 1 POPRUN [D] Target options from environment: {}\n| hosts     |                                   localhost                                   |\n|-----------|-------------------------------------------------------------------------------|\n| ILDs      |                                       0                                       |\n|-----------|-------------------------------------------------------------------------------|\n| instances |                                       0                                       |\n|-----------|-------------------------------------------------------------------------------|\n| replicas  | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |\n -------------------------------------------------------------------------------------------\n10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.154 1 POPRUN [D] Using target options: {\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\"instanceSize\":\"64\"}\n10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting\n10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None\n10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output'\n'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x'\n'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x'\n'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x'\n'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0'\n'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS=\n{\"ipuLinkDomainSize\":\"64\",\"ipuLinkConfiguration\":\"slidingWindow\",\"ipuLinkTopology\":\"torus\",\"gatewayMode\":\"true\",\n\"instanceSize\":\"64\"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1'\n10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346)\n[1,0]<stderr>:    Registered metric hook: total_compiling_time with object: <function get_results_for_compile_time at 0x7fe0a6e8af70>\n[1,0]<stderr>:Using config: pretrain_large_128_POD64\n...\nGraph compilation: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 100/100 [10:11<00:00][1,0]<stderr>:\n[1,0]<stderr>:Compiled/Loaded model in 683.6591004971415 secs\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:--------------------- Training Started --------------------\nStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %:   0%|          | 0/1 [00:03<?, ?it/s, throughput: 17692.1 samples/sec][1,0]<stderr>:\n[1,0]<stderr>:-----------------------------------------------------------\n[1,0]<stderr>:-------------------- Training Metrics ---------------------\n[1,0]<stderr>:global_batch_size: 65536\n[1,0]<stderr>:device_iterations: 1\n[1,0]<stderr>:training_steps: 1\n[1,0]<stderr>:Training time: 3.718 secs\n[1,0]<stderr>:-----------------------------------------------------------\n
          "},{"location":"services/graphcore/training/L2_multiple_IPU/#notes-on-using-the-examples-respository","title":"Notes on using the examples respository","text":"

          Graphcore provides examples of a variety of models on Github https://github.com/graphcore/examples. When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel.

          "},{"location":"services/graphcore/training/L3_profiling/","title":"Profiling with PopVision","text":"

          Graphcore provides various tools for profiling, debugging, and instrumenting programs run on IPUs. In this tutorial we will briefly demonstrate an example using the PopVision Graph Analyser. For more information, see Profiling and Debugging and PopVision Graph Analyser User Guide.

          We will reuse the same PyTorch MNIST example from lesson 1 (from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/pytorch/mnist).

          To enable profiling and create IPU reports, we need to add the following line to the training script mnist_poptorch_code_only.py :

          training_opts = training_opts.enableProfiling()\n

          (for details the API, see API reference)

          Save and run kubectl create -f <yaml-file> on the following:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: mnist-training-profiling-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: mnist-training-profiling\n          image: graphcore/pytorch:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/pytorch/mnist;\n              python -m pip install -r requirements.txt;\n              sed -i '131i training_opts = training_opts.enableProfiling()' mnist_poptorch_code_only.py;\n              python mnist_poptorch_code_only.py --epochs 1;\n              echo 'RUNNING ls ./training';\n              ls training\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          After completion, using kubectl logs <pod-name>, we can see the following result

          ...\nAccuracy on test set: 96.69%\nRUNNING ls ./training\narchive.a\nprofile.pop\n

          We can see that the training has created two Poplar report files: archive.a which is an archive of the ELF executable files, one for each tile; and profile.pop, the poplar profile, which contains compile-time and execution information about the Poplar graph.

          "},{"location":"services/graphcore/training/L3_profiling/#downloading-the-profile-reports","title":"Downloading the profile reports","text":"

          To download the traing profiles to your local environment, you can use kubectl cp. For example, run

          kubectl cp <pod-name>:/root/build/examples/tutorials/simple_applications/pytorch/mnist/training .\n

          Once you have downloaded the profile report files, you can view the contents locally using the PopVision Graph Analyser tool, which is available for download here https://www.graphcore.ai/developer/popvision-tools.

          From the Graph Analyser, you can analyse information including memory usage, execution trace and more.

          "},{"location":"services/graphcore/training/L4_other_frameworks/","title":"Other Frameworks","text":"

          In this tutorial we'll briefly cover running tensorflow and PopART for Machine Learning, and writing IPU programs directly via the PopLibs library in C++. Extra links and resources will be provided for more in-depth information.

          "},{"location":"services/graphcore/training/L4_other_frameworks/#terminology","title":"Terminology","text":"

          Within Graphcore, Poplar refers to the tools (e.g. Poplar Graph Engine or Poplar Graph Compiler) and libraries (PopLibs) for programming on IPUs.

          The Poplar SDK is a package of software development tools, including

          • TensorFlow 1 and 2 for the IPU
          • PopTorch (Wrapper around PyTorch for running on IPU)
          • PopART (Poplar Advanced Run-Time, provides support for importing, creating, and running ONNX graphs on the IPU)
          • Poplar and PopLibs
          • PopDist (Poplar Distributed Configuration Library) and PopRun (Command line utility to launch distributed applications)
          • Device drivers and command line tools for managing the IPU

          For more details see here.

          "},{"location":"services/graphcore/training/L4_other_frameworks/#other-ml-frameworks-tensorflow-and-popart","title":"Other ML frameworks: Tensorflow and PopART","text":"

          Besides being able to run PyTorch code, as demonstrated in the previous lessons, the Poplar SDK also supports running ML learning applications with tensorflow or PopART.

          "},{"location":"services/graphcore/training/L4_other_frameworks/#tensorflow","title":"Tensorflow","text":"

          The Poplar SDK includes implementation of TensorFlow and Keras for the IPU.

          For more information, refer to Targeting the IPU from TensorFlow 2 and TensorFlow 2 Quick Start.

          These are available from the image graphcore/tensorflow:2.

          For a quick example, we will run an example script from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/tensorflow2/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: tensorflow-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: tensorflow-example\n          image: graphcore/tensorflow:2\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              apt update;\n              apt upgrade -y;\n              apt install git -y;\n              cd;\n              mkdir build;\n              cd build;\n              git clone https://github.com/graphcore/examples.git;\n              cd examples/tutorials/simple_applications/tensorflow2/mnist;\n              python -m pip install -r requirements.txt;\n              python mnist_code_only.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          Running kubectl logs <pod> should show the results similar to the following

          ...\n2023-10-25 13:21:40.263823: I tensorflow/compiler/plugin/poplar/driver/poplar_platform.cc:43] Poplar version: 3.2.0 (1513789a51) Poplar package: b82480c629\n2023-10-25 13:21:42.203515: I tensorflow/compiler/plugin/poplar/driver/poplar_executor.cc:1619] TensorFlow device /device:IPU:0 attached to 1 IPU with Poplar device ID: 0\nDownloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz\n11493376/11490434 [==============================] - 0s 0us/step\n11501568/11490434 [==============================] - 0s 0us/step\n2023-10-25 13:21:43.789573: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)\n2023-10-25 13:21:44.164207: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.\n2023-10-25 13:21:57.935339: I tensorflow/compiler/jit/xla_compilation_cache.cc:376] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.\nEpoch 1/4\n2000/2000 [==============================] - 17s 8ms/step - loss: 0.6188\nEpoch 2/4\n2000/2000 [==============================] - 1s 427us/step - loss: 0.3330\nEpoch 3/4\n2000/2000 [==============================] - 1s 371us/step - loss: 0.2857\nEpoch 4/4\n2000/2000 [==============================] - 1s 439us/step - loss: 0.2568\n
          "},{"location":"services/graphcore/training/L4_other_frameworks/#popart","title":"PopART","text":"

          The Poplar Advanced Run Time (PopART) enables importing and constructing ONNX graphs, and running graphs in inference, evaluation or training modes. PopART provides both a C++ and Python API.

          For more information, see the PopART User Guide

          PopART is available from the image graphcore/popart.

          For a quick example, we will run an example script from https://github.com/graphcore/tutorials/tree/sdk-release-3.1/simple_applications/popart/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: popart-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: popart-example\n          image: graphcore/popart:3.3.0\n          command: [/bin/bash, -c, --]\n          args:\n            - |\n              cd ;\n              mkdir build;\n              cd build ;\n              git clone https://github.com/graphcore/tutorials.git;\n              cd tutorials;\n              git checkout sdk-release-3.1;\n              cd simple_applications/popart/mnist;\n              python3 -m pip install -r requirements.txt;\n              ./get_data.sh;\n              python3 popart_mnist.py --epochs 1\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          Running kubectl logs <pod> should show the results similar to the following

          ...\nCreating ONNX model.\nCompiling the training graph.\nCompiling the validation graph.\nRunning training loop.\nEpoch #1\n   Loss=16.2605\n   Accuracy=88.88%\n
          "},{"location":"services/graphcore/training/L4_other_frameworks/#writing-ipu-programs-directly-with-poplibs","title":"Writing IPU programs directly with PopLibs","text":"

          The Poplar libraries are a set of C++ libraries consisting of the Poplar graph library and the open-source PopLibs libraries.

          The Poplar graph library provides direct access to the IPU by code written in C++. You can write complete programs using Poplar, or use it to write functions to be called from your application written in a higher-level framework such as TensorFlow.

          The PopLibs libraries are a set of application libraries that implement operations commonly required by machine learning applications, such as linear algebra operations, element-wise tensor operations, non-linearities and reductions. These provide a fast and easy way to create programs that run efficiently using the parallelism of the IPU.

          For more information, see Poplar Quick Start and Poplar and PopLibs User Guide.

          These are available from the image graphcore/poplar.

          When using the PopLibs libraries, you will have to include the include files in the include/popops directory, e.g.

          #include <include/popops/ElementWise.hpp>\n

          and to link the relevant PopLibs libraries, in addition to the Poplar library, e.g.

          g++ -std=c++11 my-program.cpp -lpoplar -lpopops\n

          For a quick example, we will run an example from https://github.com/graphcore/examples/tree/master/tutorials/simple_applications/poplar/mnist. To get started, save the following yaml and run kubectl create -f <yaml-file> to create the IPUJob:

          apiVersion: graphcore.ai/v1alpha1\nkind: IPUJob\nmetadata:\n  generateName: poplib-example-\nspec:\n  jobInstances: 1\n  ipusPerJobInstance: \"1\"\n  workers:\n    template:\n      spec:\n        containers:\n        - name: poplib-example\n          image: graphcore/poplar:3.3.0\n          command: [\"bash\"]\n          args: [\"-c\", \"cd && mkdir build && cd build && git clone https://github.com/graphcore/examples.git && cd examples/tutorials/simple_applications/poplar/mnist/ && ./get_data.sh && make &&  ./regression-demo -IPU 1 50\"]\n          resources:\n            limits:\n              cpu: 32\n              memory: 200Gi\n          securityContext:\n            capabilities:\n              add:\n              - IPC_LOCK\n          volumeMounts:\n          - mountPath: /dev/shm\n            name: devshm\n        restartPolicy: Never\n        hostIPC: true\n        volumes:\n        - emptyDir:\n            medium: Memory\n            sizeLimit: 10Gi\n          name: devshm\n

          Running kubectl logs <pod> should show the results similar to the following

          ...\nUsing the IPU\nTrying to attach to IPU\nAttached to IPU 0\nTarget:\n  Number of IPUs:         1\n  Tiles per IPU:          1,472\n  Total Tiles:            1,472\n  Memory Per-Tile:        624.0 kB\n  Total Memory:           897.0 MB\n  Clock Speed (approx):   1,850.0 MHz\n  Number of Replicas:     1\n  IPUs per Replica:       1\n  Tiles per Replica:      1,472\n  Memory per Replica:     897.0 MB\n\nGraph:\n  Number of vertices:            5,466\n  Number of edges:              16,256\n  Number of variables:          41,059\n  Number of compute sets:           20\n\n...\n\nEpoch 1 (99%), accuracy 76%\n
          "},{"location":"services/jhub/","title":"EIDF Jupyterhub","text":"

          QuickStart

          Tutorial

          Documentation

          "},{"location":"services/jhub/docs/","title":"Service Documentation","text":""},{"location":"services/jhub/docs/#online-support","title":"Online support","text":""},{"location":"services/jhub/quickstart/","title":"Quickstart","text":""},{"location":"services/jhub/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/jhub/quickstart/#first-task","title":"First Task","text":""},{"location":"services/jhub/quickstart/#further-information","title":"Further information","text":""},{"location":"services/jhub/tutorial/","title":"Tutorial","text":""},{"location":"services/jhub/tutorial/#first-notebook","title":"First notebook","text":""},{"location":"services/mft/","title":"MFT","text":""},{"location":"services/mft/quickstart/","title":"Managed File Transfer","text":""},{"location":"services/mft/quickstart/#getting-to-the-mft","title":"Getting to the MFT","text":"

          The EIDF MFT can be accessed at https://eidf-mft.epcc.ed.ac.uk

          "},{"location":"services/mft/quickstart/#how-it-works","title":"How it works","text":"

          The MFT provides a 'drop' zone for the project. All users in a given project will have access to the same shared transfer area. They will have the ability to upload, download, and delete files from the project's transfer area. This area is linked to a directory within the projects space on the shared backend storage.

          Files which are uploaded are owned by the Linux user 'nobody' and the group ID of whatever project the file is being uploaded to. They have the permissions: Owner = rw Group = r Others = r

          Once the file is opened on the VM, the user that opened it will become the owner and they can make further changes.

          "},{"location":"services/mft/quickstart/#gaining-access-to-the-mft","title":"Gaining access to the MFT","text":"

          By default a project won't have access to the MFT, this has to be enabled. Currently this can be done by the PI sending a request to the EIDF Helpdesk. Once the project is enabled within the MFT, every user with the project will be able to log into the MFT using their usual EIDF credentials.

          "},{"location":"services/mft/sftp/","title":"SFTP","text":"

          Coming Soon

          "},{"location":"services/mft/using-the-mft/","title":"Using the MFT Web Portal","text":""},{"location":"services/mft/using-the-mft/#logging-in","title":"Logging in","text":"

          When you reach the MFT home page you can log in using your usual VM project credentials.

          You will then be asked what type of session you would like to start. Select New Web Client or Web Client and continue.

          "},{"location":"services/mft/using-the-mft/#file-ingress","title":"File Ingress","text":"

          Once logged in, all files currently in the projects transfer directory will be displayed. Click the 'Upload' button under the 'Home' title to open the dialogue for file upload. You can then drag and drop files in, or click 'Browse' to find them locally.

          Once uploaded, the file will be immediately accessible from the project area, and can be used within any EIDF service which has the filesystem mounted.

          "},{"location":"services/mft/using-the-mft/#file-egress","title":"File Egress","text":"

          File egress can be done in the reverse way. By placing the file into the project transfer directory, it will become available in the MFT portal.

          "},{"location":"services/mft/using-the-mft/#file-management","title":"File Management","text":"

          Directories can be created within the project transfer directory, for example with 'Import' and 'Export' to allow for better file management. Files deleted from either the MFT portal or from the VM itself will remove it from the other, as both locations point at the same file. It's only stored in one place, so modifications made from either place will remove the file.

          "},{"location":"services/rstudioserver/","title":"EIDF R Studio Server","text":"

          QuickStart

          Tutorial

          Documentation

          "},{"location":"services/rstudioserver/docs/","title":"Service Documentation","text":""},{"location":"services/rstudioserver/docs/#online-support","title":"Online support","text":""},{"location":"services/rstudioserver/quickstart/","title":"Quickstart","text":""},{"location":"services/rstudioserver/quickstart/#accessing","title":"Accessing","text":""},{"location":"services/rstudioserver/quickstart/#first-task","title":"First Task","text":""},{"location":"services/rstudioserver/quickstart/#creating-a-new-r-script","title":"Creating a New R Script","text":"

          Your RStudio Server session has been initialised now. If you are participating in a workshop, then all the packages and data required for the workshop have been loaded into the workspace. All that remains is to create a new R script to contain your code!

          1. In the RStudio Server UI, open the File menu item at the far left of the main menu bar at the top of the page
          2. Hover over the \u2018New File\u2019 sub-menu item, then select \u2018R Script\u2019 from the expanded menu
          3. A new window pane will appear in the UI as shown below, and you are now ready to start adding the R code to your script! RStudio Server UI screen with new script
          "},{"location":"services/rstudioserver/quickstart/#further-information","title":"Further information","text":""},{"location":"services/rstudioserver/tutorial/","title":"Tutorial","text":""},{"location":"services/rstudioserver/tutorial/#first-notebook","title":"First notebook","text":""},{"location":"services/ultra2/","title":"Ultra2 Large Memory System","text":"

          Get Access

          Running codes

          "},{"location":"services/ultra2/access/","title":"Ultra2 Large Memory System","text":""},{"location":"services/ultra2/access/#getting-access","title":"Getting Access","text":"

          Access to the Ultra2 system (also referred to as the SDF-CS1 system) is currently by arrangement with EPCC. Please email eidf@epcc.ed.ac.uk with a short description of the work you would like to perform.

          "},{"location":"services/ultra2/run/","title":"Ultra2 High Memory System","text":""},{"location":"services/ultra2/run/#introduction","title":"Introduction","text":"

          The Ultra2 system (also called the SDF-CS1) system, is a single logical CPU system based at EPCC. It is suitable for running jobs which require large volumes of non-distributed memory (as opposed to a cluster).

          "},{"location":"services/ultra2/run/#specifications","title":"Specifications","text":"

          The system is a HPE SuperDome Flex containing 576 individual cores in a SMT-1 arrangement (1 thread per core). The system has 18TB of memory available to users. Home directories are network mounted from the EIDF e1000 Lustre filesystem, although some local NVMe storage is available for temporary file storage during runs.

          "},{"location":"services/ultra2/run/#login","title":"Login","text":"

          Login is via SSH only via ssh <username>@sdf-cs1.epcc.ed.ac.uk. See below for details on the credentials required to access the system.

          "},{"location":"services/ultra2/run/#access-credentials","title":"Access credentials","text":"

          To access Ultra2, you need to use two credentials: your SSH key pair protected by a passphrase and a Time-based one-time password (TOTP).

          "},{"location":"services/ultra2/run/#ssh-key-pairs","title":"SSH Key Pairs","text":"

          You will need to generate an SSH key pair protected by a passphrase to access Ultra2.

          Using a terminal (the command line), set up a key pair that contains your e-mail address and enter a passphrase you will use to unlock the key:

              $ ssh-keygen -t rsa -C \"your@email.com\"\n    ...\n    -bash-4.1$ ssh-keygen -t rsa -C \"your@email.com\"\n    Generating public/private rsa key pair.\n    Enter file in which to save the key (/Home/user/.ssh/id_rsa): [Enter]\n    Enter passphrase (empty for no passphrase): [Passphrase]\n    Enter same passphrase again: [Passphrase]\n    Your identification has been saved in /Home/user/.ssh/id_rsa.\n    Your public key has been saved in /Home/user/.ssh/id_rsa.pub.\n    The key fingerprint is:\n    03:d4:c4:6d:58:0a:e2:4a:f8:73:9a:e8:e3:07:16:c8 your@email.com\n    The key's randomart image is:\n    +--[ RSA 2048]----+\n    |    . ...+o++++. |\n    | . . . =o..      |\n    |+ . . .......o o |\n    |oE .   .         |\n    |o =     .   S    |\n    |.    +.+     .   |\n    |.  oo            |\n    |.  .             |\n    | ..              |\n    +-----------------+\n

          (remember to replace \"your@email.com\" with your e-mail address).

          "},{"location":"services/ultra2/run/#upload-public-part-of-key-pair-to-safe","title":"Upload public part of key pair to SAFE","text":"

          You should now upload the public part of your SSH key pair to the SAFE by following the instructions at:

          Login to SAFE. Then:

          1. Go to the Menu Login accounts and select the Ultra2 account you want to add the SSH key to
          2. On the subsequent Login account details page click the Add Credential button
          3. Select SSH public key as the Credential Type and click Next
          4. Either copy and paste the public part of your SSH key into the SSH Public key box or use the button to select the public key file on your computer.
          5. Click Add to associate the public SSH key part with your account

          Once you have done this, your SSH key will be added to your Ultra2 account.

          "},{"location":"services/ultra2/run/#time-based-one-time-password-totp","title":"Time-based one-time password (TOTP)","text":"

          Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to set up your TOTP before you can log into Ultra2.

          First Login

          When you first log into Ultra2, you will be prompted to change your initial password. This is a three step process:

          1. When promoted to enter your password: Enter the password which you retrieve from SAFE
          2. When prompted to enter your new password: type in a new password
          3. When prompted to re-enter the new password: re-enter the new password

          Your password has now been changed

          You will not use your password when logging on to Ultra2 after the initial logon.

          "},{"location":"services/ultra2/run/#ssh-login","title":"SSH Login","text":"

          To login to the host system, you will need to use the SSH Key and TOTP token you registered when creating the account SAFE, along with the SSH Key you registered when creating the account. For example, with the appropriate key loadedssh <username>@sdf-cs1.epcc.ed.ac.uk will then prompt you, roughly once per day, for your TOTP code.

          "},{"location":"services/ultra2/run/#software","title":"Software","text":"

          The primary software provided is Intel's OneAPI suite containing mpi compilers and runtimes, debuggers and the vTune performance analyser. Standard GNU compilers are also available. The OneAPI suite can be loaded by sourcing the shell script:

          source  /opt/intel/oneapi/setvars.sh\n
          "},{"location":"services/ultra2/run/#running-jobs","title":"Running Jobs","text":"

          All jobs must be run via SLURM to avoid inconveniencing other users of the system. Users should not run jobs directly. Note that the system has one logical processor with a large number of threads and thus appears to SLURM as a single node. This is intentional.

          "},{"location":"services/ultra2/run/#queue-limits","title":"Queue limits","text":"

          We kindly request that users limit their maximum total running job size to 288 cores and 4TB of memory, whether that be a divided into a single job, or a number of jobs. This may be enforced via SLURM in the future.

          "},{"location":"services/ultra2/run/#mpi-jobs","title":"MPI jobs","text":"

          An example script to run a multi-process MPI \"Hello world\" example is shown.

          #!/usr/bin/env bash\n#SBATCH -J HelloWorld\n#SBATCH --nodes=1\n#SBATCH --tasks-per-node=4\n#SBATCH --nodelist=sdf-cs1\n#SBATCH --partition=standard\n##SBATCH --exclusive\n\n\necho \"Running on host ${HOSTNAME}\"\necho \"Using ${SLURM_NTASKS_PER_NODE} tasks per node\"\necho \"Using ${SLURM_CPUS_PER_TASK} cpus per task\"\nlet mpi_threads=${SLURM_NTASKS_PER_NODE}*${SLURM_CPUS_PER_TASK}\necho \"Using ${mpi_threads} MPI threads\"\n\n# Source oneAPI to ensure mpirun available\nif [[ -z \"${SETVARS_COMPLETED}\" ]]; then\nsource /opt/intel/oneapi/setvars.sh\nfi\n\n# mpirun invocation for Intel suite.\nmpirun -n ${mpi_threads} ./helloworld.exe\n
          "},{"location":"services/virtualmachines/","title":"Overview","text":"

          The EIDF Virtual Machine (VM) Service is the underlying infrastrcture upon which the EIDF Data Science Cloud (DSC) is built.

          The service currenly has a mixture of hardware node types which host VMs of various flavours:

          • The mcomp nodes which host general flavour VMs are based upon AMD EPYC 7702 CPUs (128 Cores) with 1TB of DRAM
          • The hcomp nodes which host capability flavour VMs are based upon 4x Intel Xeon Platinum 8280L CPUs (224 Threads, 112 cores with HT) with 3TB of DRAM
          • The GPU nodes which host GPU flavour VMs are based upon 2x Intel Xeon Platinum 8260 CPUs (96 Cores) with 4x Nvidia Tesla V100S 32GB and 1.5TB of DRAM

          The shapes and sizes of the flavours are based on subdivisions of this hardware, noting that CPUs are 4x oversubscribed for mcomp nodes (general VM flavours).

          "},{"location":"services/virtualmachines/#service-access","title":"Service Access","text":"

          Users should have an EIDF account - EIDF Accounts.

          Project Leads will be able to have access to the DSC added to their project during the project application process or through a request to the EIDF helpdesk.

          "},{"location":"services/virtualmachines/#additional-service-policy-information","title":"Additional Service Policy Information","text":"

          Additional information on service policies can be found here.

          "},{"location":"services/virtualmachines/docs/","title":"Service Documentation","text":""},{"location":"services/virtualmachines/docs/#project-management-guide","title":"Project Management Guide","text":""},{"location":"services/virtualmachines/docs/#required-member-permissions","title":"Required Member Permissions","text":"

          VMs and user accounts can only be managed by project members with Cloud Admin permissions. This includes the principal investigator (PI) of the project and all project managers (PM). Through SAFE the PI can designate project managers and the PI and PMs can grant a project member the Cloud Admin role:

          1. Click \"Manage Project in SAFE\" at the bottom of the project page (opens a new tab)
          2. On the project management page in SAFE, scroll down to \"Manage Members\"
          3. Click Add project manager or Set member permissions

          For details please refer to the SAFE documentation: How can I designate a user as a project manager?

          "},{"location":"services/virtualmachines/docs/#create-a-vm","title":"Create a VM","text":"

          To create a new VM:

          1. Select the project from the list of your projects, e.g. eidfxxx
          2. Click on the 'New Machine' button
          3. Complete the 'Create Machine' form as follows:

            1. Provide an appropriate name, e.g. dev-01. The project code will be prepended automatically to your VM name, in this case your VM would be named eidfxxx-dev-01.
            2. Select a suitable operating system
            3. Select a machine specification that is suitable
            4. Choose the required disk size (in GB) or leave blank for the default
            5. Tick the checkbox \"Configure RDP access\" if you would like to install RDP and configure VDI connections via RDP for your VM.
            6. Select the package installations from the software catalogue drop-down list, or \"None\" if you don't require any pre-installed packages
          4. Click on 'Create'

          5. You should see the new VM listed under the 'Machines' table on the project page and the status as 'Creating'
          6. Wait while the job to launch the VM completes. This may take up to 10 minutes, depending on the configuration you requested. You have to reload the page to see updates.
          7. Once the job has completed successfully the status shows as 'Active' in the list of machines.

          You may wish to ensure that the machine size selected (number of CPUs and RAM) does not exceed your remaining quota before you press Create, otherwise the request will fail.

          In the list of 'Machines' in the project page in the portal, click on the name of new VM to see the configuration and properties, including the machine specification, its 10.24.*.* IP address and any configured VDI connections.

          "},{"location":"services/virtualmachines/docs/#quota-and-usage","title":"Quota and Usage","text":"

          Each project has a quota for the number of instances, total number of vCPUs, total RAM and storage. You will not be able to create a VM if it exceeds the quota.

          You can view and refresh the project usage compared to the quota in a table near the bottom of the project page. This table will be updated automatically when VMs are created or removed, and you can refresh it manually by pressing the \"Refresh\" button at the top of the table.

          Please contact the helpdesk if your quota requirements have changed.

          "},{"location":"services/virtualmachines/docs/#add-a-user-account","title":"Add a user account","text":"

          User accounts allow project members to log in to the VMs in a project. The Project PI and project managers manage user accounts for each member of the project. Users usually use one account (username and password) to log in to all the VMs in the same project that they can access, however a user may have multiple accounts in a project, for example for different roles.

          1. From the project page in the portal click on the 'Create account' button under the 'Project Accounts' table at the bottom
          2. Complete the 'Create User Account' form as follows:

            1. Choose 'Account user name': this could be something sensible like the first and last names concatenated (or initials) together with the project name. The username is unique across all EPCC systems so the user will not be able to reuse this name in another project once it has been assigned.
            2. Select the project member from the 'Account owner' drop-down field
            3. Click 'Create'

          The user can now set the password for their new account on the account details page.

          "},{"location":"services/virtualmachines/docs/#adding-access-to-the-vm-for-a-user","title":"Adding Access to the VM for a User","text":"

          User accounts can be granted or denied access to existing VMs.

          1. Click 'Manage' next to an existing user account in the 'Project Accounts' table on the project page, or click on the account name and then 'Manage' on the account details page
          2. Select the checkboxes in the column \"Access\" for the VMs to which this account should have access or uncheck the ones without access
          3. Click the 'Update' button
          4. After a few minutes, the job to give them access to the selected VMs will complete and the account status will show as \"Active\".

          If a user is logged in already to the VDI at https://eidf-vdi.epcc.ed.ac.uk/vdi newly added connections may not appear in their connections list immediately. They must log out and log in again to refresh the connection information, or wait until the login token expires and is refreshed automatically - this might take a while.

          If a user only has one connection available in the VDI they will be automatically directed to the VM with the default connection.

          "},{"location":"services/virtualmachines/docs/#sudo-permissions","title":"Sudo permissions","text":"

          A project manager or PI may also grant sudo permissions to users on selected VMs. Management of sudo permissions must be requested in the project application - if it was not requested or the request was denied the functionality described below is not available.

          1. Click 'Manage' next to an existing user account in the 'Project Accounts' table on the project page
          2. Select the checkboxes in the column \"Sudo\" for the VMs on which this account is granted sudo permissions or uncheck to remove permissions
          3. Make sure \"Access\" is also selected for the sudo VMs to allow login
          4. Click the 'Update' button

          After a few minutes, the job to give the user account sudo permissions on the selected VMs will complete. On the account detail page a \"sudo\" badge will appear next to the selected VMs.

          Please contact the helpdesk if sudo permission management is required but is not available in your project.

          "},{"location":"services/virtualmachines/docs/#first-login","title":"First login","text":"

          A new user account must reset the password before they can log in for the first time.

          The user can reset the password in their account details page.

          "},{"location":"services/virtualmachines/docs/#updating-an-existing-machine","title":"Updating an existing machine","text":""},{"location":"services/virtualmachines/docs/#adding-rdp-access","title":"Adding RDP Access","text":"

          If you did not select RDP access when you created the VM you can add it later:

          1. Open the VM details page by selecting the name on the project page
          2. Click on 'Configure RDP'
          3. The configuration job runs for a few minutes.

          Once the RDP job is completed, all users that are allowed to access the VM will also be permitted to use the RDP connection.

          "},{"location":"services/virtualmachines/docs/#software-catalogue","title":"Software catalogue","text":"

          You can install packages from the software catalogue at a later time, even if you didn't select a package when first creating the machine.

          1. Open the VM details page by selecting the name on the project page
          2. Click on 'Software Catalogue'
          3. Select the configuration you wish to install and press 'Submit'
          4. The configuration job runs for a few minutes.
          "},{"location":"services/virtualmachines/flavours/","title":"Flavours","text":"

          These are the current Virtual Machine (VM) flavours (configurations) available on the the Virtual Desktop cloud service. Note that all VMs are built and configured using the EIDF Portal by PIs/Cloud Admins of projects, except GPU flavours which must be requested via the helpdesk or the support request form.

          Flavour Name vCPUs DRAM in GB Pinned Cores GPU general.v2.tiny 1 2 No No general.v2.small 2 4 No No general.v2.medium 4 8 No No general.v2.large 8 16 No No general.v2.xlarge 16 32 No No capability.v2.8cpu 8 112 Yes No capability.v2.16cpu 16 224 Yes No capability.v2.32cpu 32 448 Yes No capability.v2.48cpu 48 672 Yes No capability.v2.64cpu 64 896 Yes No gpu.v1.8cpu 8 128 Yes Yes gpu.v1.16cpu 16 256 Yes Yes gpu.v1.32cpu 32 512 Yes Yes gpu.v1.48cpu 48 768 Yes Yes"},{"location":"services/virtualmachines/policies/","title":"EIDF Data Science Cloud Policies","text":""},{"location":"services/virtualmachines/policies/#end-of-life-policy-for-user-accounts-and-projects","title":"End of Life Policy for User Accounts and Projects","text":""},{"location":"services/virtualmachines/policies/#what-happens-when-an-account-or-project-is-no-longer-required-or-a-user-leaves-a-project","title":"What happens when an account or project is no longer required, or a user leaves a project","text":"

          These situations are most likely to come about during one of the following scenarios:

          1. The retirement of project (usually one month after project end)
          2. A Principal Investigator (PI) tidying up a project requesting the removal of user(s) no longer working on the project
          3. A user wishing their own account to be removed
          4. A failure by a user to respond to the annual request to verify their email address held in the SAFE

          For each user account involved, assuming the relevant consent is given, the next step can be summarised as one of the following actions:

          • Removal of the EIDF account
          • The re-owning of the EIDF account within an EIDF project (typically to PI)
          • In addition, the corresponding SAFE account may be retired under scenario 4

          It will be possible to have the account re-activated up until resources are removed (as outlined above); after this time it will be necessary to re-apply.

          A user's right to use EIDF is granted by a project. Our policy is to treat the account and associated data as the property of the PI as the owner of the project and its resources. It is the user's responsibility to ensure that any data they store on the EIDF DSC is handled appropriately and to copy off anything that they wish to keep to an appropriate location.

          A project manager or the PI can revoke a user's access accounts within their project at any time, by locking, removing or re-owning the account as appropriate.

          A user may give up access to an account and return it to the control of the project at any time.

          When a project is due to end, the PI will receive notification of the closure of the project and its accounts one month before all project accounts and DSC resources (VMs, data volumes) are closed and cleaned or removed.

          "},{"location":"services/virtualmachines/policies/#backup-policies","title":"Backup policies","text":"

          The current policy is:

          • The content of VM disk images is not backed up
          • The VM disk images are not backed up

          We strongly advise that you keep copies of any critical data on on an alternative system that is fully backed up.

          "},{"location":"services/virtualmachines/policies/#patching-of-user-vms","title":"Patching of User VMs","text":"

          The EIDF team updates and patches the hypervisors and the cloud management software as part of the EIDF Maintenance sessions. It is the responsibility of project PIs to keep the VMs in their projects up to date. VMs running the Ubuntu operating system automatically install security patches and alert users at log-on (via SSH) to reboot as necessary for the changes to take effect. It also encourages users to update packages.

          "},{"location":"services/virtualmachines/quickstart/","title":"Quickstart","text":"

          Projects using the Virtual Desktop cloud service are accessed via the EIDF Portal.

          Authentication is provided by SAFE, so if you do not have an active web browser session in SAFE, you will be redirected to the SAFE log on page. If you do not have a SAFE account follow the instructions in the SAFE documentation how to register and receive your password.

          "},{"location":"services/virtualmachines/quickstart/#accessing-your-projects","title":"Accessing your projects","text":"
          1. Log into the portal at https://portal.eidf.ac.uk/. The login will redirect you to the SAFE.

          2. View the projects that you have access to at https://portal.eidf.ac.uk/project/

          "},{"location":"services/virtualmachines/quickstart/#joining-a-project","title":"Joining a project","text":"
          1. Navigate to https://portal.eidf.ac.uk/project/ and click the link to \"Request access\", or choose \"Request Access\" in the \"Project\" menu.

          2. Select the project that you want to join in the \"Project\" dropdown list - you can search for the project name or the project code, e.g. \"eidf0123\".

          Now you have to wait for your PI or project manager to accept your request to join.

          "},{"location":"services/virtualmachines/quickstart/#accessing-a-vm","title":"Accessing a VM","text":"
          1. Select a project and view your user accounts on the project page.

          2. Click on an account name to view details of the VMs that are you allowed to access with this account, and to change the password for this account.

          3. Before you log in for the first time with a new user account, you must change your password as described below.

          4. Follow the link to the Guacamole login or log in directly at https://eidf-vdi.epcc.ed.ac.uk/vdi/. Please see the VDI guide for more information.

          5. You can also log in via the EIDF Gateway Jump Host if this is available in your project.

          Warning

          You must set a password for a new account before you log in for the first time.

          "},{"location":"services/virtualmachines/quickstart/#set-or-change-the-password-for-a-user-account","title":"Set or change the password for a user account","text":"

          Follow these instructions to set a password for a new account before you log in for the first time. If you have forgotten your password you may reset the password as described here.

          1. Select a project and click the account name in the project page to view the account details.

          2. In the user account detail page, press the button \"Set Password\" and follow the instructions in the form.

          There may be a short delay while the change is implemented before the new password becomes usable.

          "},{"location":"services/virtualmachines/quickstart/#further-information","title":"Further information","text":"

          Managing VMs: Project management guide to creating, configuring and removing VMs and managing user accounts in the portal.

          Virtual Desktop Interface: Working with the VDI interface.

          EIDF Gateway: SSH access to VMs via the EIDF SSH Gateway jump host.

          "},{"location":"status/","title":"EIDF Service Status","text":"

          The table below represents the broad status of each EIDF service.

          Service Status EIDF Portal VM SSH Gateway VM VDI Gateway Virtual Desktops Cerebras CS-2 SuperDome Flex (SDF-CS1 / Ultra2)"},{"location":"status/#maintenance-sessions","title":"Maintenance Sessions","text":"

          There will be a service outage on the 3rd Thursday of every month from 9am to 5pm. We keep maintenance downtime to a minimum on the service but do occasionally need to perform essential work on the system. Maintenance sessions are used to ensure that:

          • software versions are kept up to date;
          • firmware levels on the underlying hardware are kept up to date;
          • essential security patches are applied;
          • failed/suspect hardware can be replaced;
          • new software can be installed; periodic essential maintenance on electrical and mechanical support equipment (cooling systems and power distribution units) can be undertaken safely.

          The service will be returned to service ahead of 5pm if all the work is completed early.

          "}]} \ No newline at end of file diff --git a/services/gpuservice/faq/index.html b/services/gpuservice/faq/index.html index 4ad388fca..23a686d50 100644 --- a/services/gpuservice/faq/index.html +++ b/services/gpuservice/faq/index.html @@ -1278,6 +1278,15 @@ + + +
        16. + + + Access to GPU Service resources in default namespace is 'Forbidden' + + +
        17. @@ -2216,6 +2225,15 @@ +
        18. + +
        19. + + + Access to GPU Service resources in default namespace is 'Forbidden' + + +
        20. @@ -2301,6 +2319,10 @@

          How do I access the GPU Service?

          The default access route to the GPU Service is via an EIDF DSC VM. The DSC VM will have access to all EIDF resources for your project and can be accessed through the VDI (SSH or if enabled RDP) or via the EIDF SSH Gateway.

          How do I obtain my project kubeconfig file?

          Project Leads and Managers can access the kubeconfig file from the Project page in the Portal. Project Leads and Managers can provide the file on any of the project VMs or give it to individuals within the project.

          +

          Access to GPU Service resources in default namespace is 'Forbidden'

          +
          Error from server (Forbidden): error when creating "myjobfile.yml": jobs is forbidden: User <user> cannot create resource "jobs" in API group "" in the namespace "default"
          +
          +

          Some version of the above error is common when submitting jobs/pods to the GPU cluster using the kubectl command. This arises when you forgot to specify you are submitting job/pods to your project namespace, not the "default" namespace which you do not have permissions to use. Resubmitting the job/pod with kubectl -n <project-namespace> create "myjobfile.yml" should solve the issue.

          I can't mount my PVC in multiple containers or pods at the same time

          The current PVC provisioner is based on Ceph RBD. The block devices provided by Ceph to the Kubernetes PV/PVC providers cannot be mounted in multiple pods at the same time. They can only be accessed by one pod at a time, once a pod has unmounted the PVC and terminated, the PVC can be reused by another pod. The service development team is working on new PVC provider systems to alleviate this limitation.

          How many GPUs can I use in a pod?

          diff --git a/services/gpuservice/index.html b/services/gpuservice/index.html index feea75356..cfc49d2b9 100644 --- a/services/gpuservice/index.html +++ b/services/gpuservice/index.html @@ -2236,7 +2236,7 @@

          Overview

          The service provides access to:

          • Nvidia A100 40GB
          • -
          • Nvidia 80GB
          • +
          • Nvidia A100 80GB
          • Nvidia MIG A100 1G.5GB
          • Nvidia MIG A100 3G.20GB
          • Nvidia H100 80GB
          • @@ -2265,13 +2265,20 @@

            Overview

            Please see Getting started with Kubernetes to learn about specifying GPU resources.

            Service Access

            -

            Users should have an EIDF Account.

            -

            Project Leads will be able to request access to the EIDF GPU Service for their project either during the project application process or through a service request to the EIDF helpdesk.

            -

            Each project will be given a namespace to operate in and the ability to add a kubeconfig file to any of their Virtual Machines in their EIDF project - information on access to VMs is available here.

            -

            All EIDF virtual machines can be set up to access the EIDF GPU Service. The Virtual Machine does not require to be GPU-enabled.

            +

            Users should have an EIDF Account as the EIDF GPU Service is only accessible through EIDF Virtual Machines.

            +

            Existing projects can request access to the EIDF GPU Service through a service request to the EIDF helpdesk or emailing eidf@epcc.ed.ac.uk .

            +

            New projects wanting to using the GPU Service should include this in their EIDF Project Application.

            +

            Each project will be given a namespace within the EIDF GPU service to operate in.

            +

            This namespace will normally be the EIDF Project code appended with ’ns’, i.e. eidf989ns for a project with code 'eidf989'.

            +

            Once access to the EIDF GPU service has been confirmed, Project Leads will be give the ability to add a kubeconfig file to any of the VMs in their EIDF project - information on access to VMs is available here.

            +

            All EIDF VMs with the project kubeconfig file downloaded can access the EIDF GPU Service using the kubectl command line tool.

            +

            The VM does not require to be GPU-enabled.

            +

            A quick check to see if a VM has access to the EIDF GPU service can be completed by typing kubectl -n <project-namespace> get jobs in to the command line.

            +

            If this is first time you have connected to the GPU service the response should be No resources found in <project-namespace> namespace.

            EIDF GPU Service vs EIDF GPU-Enabled VMs

            -

            The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs. This allows a project to access multiple GPUs of different types.

            +

            The EIDF GPU Service is a container based service which is accessed from EIDF Virtual Desktop VMs.

            +

            This allows a project to access multiple GPUs of different types.

            An EIDF Virtual Desktop GPU-enabled VM is limited to a small number (1-2) of GPUs of a single type.

            Projects do not have to apply for a GPU-enabled VM to access the GPU Service.

            @@ -2285,11 +2292,17 @@

            Project Quotas

            Quota is a maximum on a Shared Resource

            A project quota is the maximum proportion of the service available for use by that project.

            -

            During periods of high demand, Jobs will be queued awaiting resource availability on the Service.

            -

            This means that a project has access up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

            +

            Any submitted job requests that would exceed the total project quota will be queued.

            Project Queues

            EIDF GPU Service is introducing the Kueue system in February 2024. The use of this is detailed in the Kueue.

            +
            +

            Job Queuing

            +

            During periods of high demand, jobs will be queued awaiting resource availability on the Service.

            +

            As a general rule, the higher the GPU/CPU/Memory resource request of a single job the longer it will wait in the queue before enough resources are free on a single node for it be allocated.

            +

            GPUs in high demand, such as Nvidia H100s, typically have longer wait times.

            +

            Furthermore, a project may have a quota of up to 12 GPUs but due to demand may only be able to access a smaller number at any given time.

            +

            Additional Service Policy Information

            Additional information on service policies can be found here.

            EIDF GPU Service Tutorial

            diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 5833b4e0bd30d631e5076cd29de9e62bccd3f6a8..35528423f73182fbb56189ccdac3932e4932156e 100644 GIT binary patch delta 15 WcmX@Zc7}~jzMF&NSNujcH)a4R>IA+3 delta 15 WcmX@Zc7}~jzMF&Ncj!hoH)a4Rz67-Z