Skip to content

Commit

Permalink
Merge pull request #2138 from cgoveas/main
Browse files Browse the repository at this point in the history
Updating 1.4.3 changes
  • Loading branch information
sujit-jadhav authored Aug 3, 2023
2 parents 7504bf9 + 3b74206 commit af7685f
Show file tree
Hide file tree
Showing 27 changed files with 364 additions and 104 deletions.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Install oneAPI for MPI jobs
___________________________
Install oneAPI for MPI jobs on Intel processors
________________________________________________

**Pre-requisites**

Expand Down
73 changes: 73 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/OpenMPI_AOCC.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Open MPI AOCC HPL benchmark for AMD processors
----------------------------------------------

**Prerequisites**

* Provision the cluster and install slurm on all cluster nodes.
* OpenMPI should be installed and compiled with slurm on all cluster nodes or should be available on the NFS share.


**To execute multi-node jobs**

1. Update the following parameters in ``/etc/slurm/slurm.conf``: ::

SelectType=select/cons_tres
SelectTypeParameters=CR_Core
TaskPlugin=task/affinity,task/cgroup

2. Restart ``slurmd.service`` on all compute nodes. ::

systemctl stop slurmd
systemctl start slurmd

3. Once the service restarts on the compute nodes, restart ``slurmctld.service`` on the manager node. ::

systemctl stop slurmctld.service
systemctl start slurmctld.service

4. Job execution can now be initiated. Provide the host list using ``srun`` and ``sbatch``. For example:

For a job to run on multiple nodes (``omnianode00001.omnia.test``,``omnianode00006.omnia.test`` and,``omnianode00005.omnia.test``) and OpenMPI is compiled and installed on the NFS share (``/home/omnia-share/openmpi/bin/mpirun``), the job can be initiated as below: ::


srun -N 3 --partition=mpiexectrial /home/omnia-share/openmpi/bin/mpirun -host omnianode00001.omnia.test,omnianode00006.omnia.test,omnianode00005.omnia.test ./amd-zen-hpl-2023_07_18/xhpl

For a batch job using the same parameters, the command would be: ::



#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test.log
#SBATCH --partition=normal
#SBATCH -N 3
#SBATCH --time=10:00
#SBATCH --ntasks=2
source /home/omnia-share/setenv_AOCC.sh
export PATH=$PATH:/home/omnia-share/openmpi/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib

mpirun -host omnianode00001.omnia.test,omnianode00005.omnia.test ./amd-zen-hpl-2023_07_18/xhpl
srun sleep 30





6 changes: 6 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Running HPC benchmarks on omnia clusters
=========================================

.. toctree::
OneAPI
OpenMPI_AOCC
Original file line number Diff line number Diff line change
Expand Up @@ -196,9 +196,9 @@ Once user accounts are created, admins can enable passwordless SSH for users to
* If ``enable_omnia_nfs`` is true in ``input/omnia_config.yml``, follow the below steps to configure an NFS share on your LDAP server:
- From the manager node:
1. Add the LDAP server IP address to ``/etc/exports``.
2. Run ``exports -ra`` to enable the NFS configuration.
2. Run ``exportfs -ra`` to enable the NFS configuration.
- From the LDAP server:
1. Add the required fstab entries in ``/etc/fstab``.
1. Add the required fstab entries in ``/etc/fstab``. (The corresponding entry will be available on the compute nodes in ``/etc/fstab``)
2. Mount the NFS share using ``mount manager_ip: /home/omnia-share /home/omnia-share``.
* If ``enable_omnia_nfs`` is false in ``input/omnia_config.yml``, ensure the user-configured NFS share is mounted on the LDAP server.

Expand Down
21 changes: 20 additions & 1 deletion docs/source/InstallationGuides/BuildingClusters/index.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,32 @@
Configuring the cluster
=======================

**Features enabled by omnia.yml**

* Centralized authentication: Once all the required parameters in `security_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up FreeIPA/LDAP.

* Slurm: Once all the required parameters in `omnia_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up slurm.

* Login Node (Additionally secure login node)

* Kubernetes: Once all the required parameters in `omnia_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up kubernetes.

* BeeGFS bolt on installation: Once all the required parameters in `storage_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up NFS.

* NFS bolt on support: : Once all the required parameters in `storage_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up BeeGFS.



.. toctree::
schedulerinputparams
schedulerprereqs
installscheduler
Authentication
OneAPI
BeeGFS
NFS
OneAPI
OpenMPI_AOCC
KernelUpdateRHEL



Original file line number Diff line number Diff line change
@@ -1,20 +1,27 @@
Building clusters
------------------

1. In the ``input/omnia_config.yml`` file, provide the `required details <schedulerinputparams.html>`_.
1. In the ``input/omnia_config.yml``, ``input/security_config.yml`` and [optional] ``input/storage_config.yml`` files, provide the `required details <schedulerinputparams.html>`_.

.. note::
* Use the parameter ``scheduler_type`` in ``input/omnia_config.yml`` to customize what schedulers are installed in the cluster.
* Without the login node, Slurm jobs can be scheduled only through the manager node.

2. Create an inventory file in the *omnia* folder. Add login node IP address under the manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and Login node IP under the *[login]* group,. Check out the `sample inventory for more information <../samplefiles.html>`_.
2. Create an inventory file in the *omnia* folder. Add login node IP address under the manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and Login node IP under the *[login]* group,. Check out the `sample inventory for more information <../../samplefiles.html>`_.

.. note::
* RedHat nodes that are not configured by Omnia need to have a valid subscription. To set up a subscription, `click here <https://omnia-doc.readthedocs.io/en/latest/Roles/Utils/rhsm_subscription.html>`_.
* Omnia creates a log file which is available at: ``/var/log/omnia.log``.
* If only Slurm is being installed on the cluster, docker credentials are not required.

3. To run ``omnia.yml``: ::

3. ``omnia.yml`` is a wrapper playbook comprising of:

i. ``security.yml``: This playbook sets up centralized authentication (LDAP/FreeIPA) on the cluster. For more information, `click here. <Authentication.html>`_
ii. ``scheduler.yml``: This playbook sets up job schedulers (Slurm or Kubernetes) on the cluster.
iii. ``storage.yml``: This playbook sets up storage tools like `BeeGFS <BeeGFS.html>`_ and `NFS <NFS.html>`_.

To run ``omnia.yml``: ::

ansible-playbook omnia.yml -i inventory

Expand Down
Loading

0 comments on commit af7685f

Please sign in to comment.