Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating 1.4.3 changes #2138

Merged
merged 28 commits into from
Aug 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
9e6919f
Updating known issues
cgoveas Jul 31, 2023
26738e8
Updating known issues
cgoveas Jul 31, 2023
96a1c80
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
505de7b
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
1b85d9d
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
7d021fb
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
02203ab
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
2c8bb8a
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
cf8621f
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
382b5eb
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
5bfdf74
Updating 1.4.3 doc fixes
cgoveas Aug 1, 2023
8775526
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
5883323
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
dfba8da
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
06062a8
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
8d08576
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
315ff77
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
89ab829
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
2d60814
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
8ee538d
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
a6c077f
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
c466588
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
c0bbc0b
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
7f4ccea
Updating 1.4.3 doc fixes
cgoveas Aug 2, 2023
143e6c4
Merge branch 'dellhpc:main' into main
cgoveas Aug 3, 2023
62d3939
Updating HPCApptainer
cgoveas Aug 3, 2023
657dc98
Updating OpenMPI
cgoveas Aug 3, 2023
3b74206
Merge remote-tracking branch 'origin/main'
cgoveas Aug 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Install oneAPI for MPI jobs
___________________________
Install oneAPI for MPI jobs on Intel processors
________________________________________________

**Pre-requisites**

Expand Down
73 changes: 73 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/OpenMPI_AOCC.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Open MPI AOCC HPL benchmark for AMD processors
----------------------------------------------

**Prerequisites**

* Provision the cluster and install slurm on all cluster nodes.
* OpenMPI should be installed and compiled with slurm on all cluster nodes or should be available on the NFS share.


**To execute multi-node jobs**

1. Update the following parameters in ``/etc/slurm/slurm.conf``: ::

SelectType=select/cons_tres
SelectTypeParameters=CR_Core
TaskPlugin=task/affinity,task/cgroup

2. Restart ``slurmd.service`` on all compute nodes. ::

systemctl stop slurmd
systemctl start slurmd

3. Once the service restarts on the compute nodes, restart ``slurmctld.service`` on the manager node. ::

systemctl stop slurmctld.service
systemctl start slurmctld.service

4. Job execution can now be initiated. Provide the host list using ``srun`` and ``sbatch``. For example:

For a job to run on multiple nodes (``omnianode00001.omnia.test``,``omnianode00006.omnia.test`` and,``omnianode00005.omnia.test``) and OpenMPI is compiled and installed on the NFS share (``/home/omnia-share/openmpi/bin/mpirun``), the job can be initiated as below: ::


srun -N 3 --partition=mpiexectrial /home/omnia-share/openmpi/bin/mpirun -host omnianode00001.omnia.test,omnianode00006.omnia.test,omnianode00005.omnia.test ./amd-zen-hpl-2023_07_18/xhpl

For a batch job using the same parameters, the command would be: ::



#!/bin/bash

#SBATCH --job-name=test

#SBATCH --output=test.log

#SBATCH --partition=normal

#SBATCH -N 3

#SBATCH --time=10:00

#SBATCH --ntasks=2





source /home/omnia-share/setenv_AOCC.sh

export PATH=$PATH:/home/omnia-share/openmpi/bin

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/omnia-share/openmpi/lib




mpirun -host omnianode00001.omnia.test,omnianode00005.omnia.test ./amd-zen-hpl-2023_07_18/xhpl

srun sleep 30





6 changes: 6 additions & 0 deletions docs/source/InstallationGuides/Benchmarks/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Running HPC benchmarks on omnia clusters
=========================================

.. toctree::
OneAPI
OpenMPI_AOCC
Original file line number Diff line number Diff line change
Expand Up @@ -196,9 +196,9 @@ Once user accounts are created, admins can enable passwordless SSH for users to
* If ``enable_omnia_nfs`` is true in ``input/omnia_config.yml``, follow the below steps to configure an NFS share on your LDAP server:
- From the manager node:
1. Add the LDAP server IP address to ``/etc/exports``.
2. Run ``exports -ra`` to enable the NFS configuration.
2. Run ``exportfs -ra`` to enable the NFS configuration.
- From the LDAP server:
1. Add the required fstab entries in ``/etc/fstab``.
1. Add the required fstab entries in ``/etc/fstab``. (The corresponding entry will be available on the compute nodes in ``/etc/fstab``)
2. Mount the NFS share using ``mount manager_ip: /home/omnia-share /home/omnia-share``.
* If ``enable_omnia_nfs`` is false in ``input/omnia_config.yml``, ensure the user-configured NFS share is mounted on the LDAP server.

Expand Down
21 changes: 20 additions & 1 deletion docs/source/InstallationGuides/BuildingClusters/index.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,32 @@
Configuring the cluster
=======================

**Features enabled by omnia.yml**

* Centralized authentication: Once all the required parameters in `security_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up FreeIPA/LDAP.

* Slurm: Once all the required parameters in `omnia_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up slurm.

* Login Node (Additionally secure login node)

* Kubernetes: Once all the required parameters in `omnia_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up kubernetes.

* BeeGFS bolt on installation: Once all the required parameters in `storage_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up NFS.

* NFS bolt on support: : Once all the required parameters in `storage_config.yml <schedulerinputparams.html>`_ are filled in, ``omnia.yml`` can be used to set up BeeGFS.



.. toctree::
schedulerinputparams
schedulerprereqs
installscheduler
Authentication
OneAPI
BeeGFS
NFS
OneAPI
OpenMPI_AOCC
KernelUpdateRHEL



Original file line number Diff line number Diff line change
@@ -1,20 +1,27 @@
Building clusters
------------------

1. In the ``input/omnia_config.yml`` file, provide the `required details <schedulerinputparams.html>`_.
1. In the ``input/omnia_config.yml``, ``input/security_config.yml`` and [optional] ``input/storage_config.yml`` files, provide the `required details <schedulerinputparams.html>`_.

.. note::
* Use the parameter ``scheduler_type`` in ``input/omnia_config.yml`` to customize what schedulers are installed in the cluster.
* Without the login node, Slurm jobs can be scheduled only through the manager node.

2. Create an inventory file in the *omnia* folder. Add login node IP address under the manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and Login node IP under the *[login]* group,. Check out the `sample inventory for more information <../samplefiles.html>`_.
2. Create an inventory file in the *omnia* folder. Add login node IP address under the manager node IP address under the *[manager]* group, compute node IP addresses under the *[compute]* group, and Login node IP under the *[login]* group,. Check out the `sample inventory for more information <../../samplefiles.html>`_.

.. note::
* RedHat nodes that are not configured by Omnia need to have a valid subscription. To set up a subscription, `click here <https://omnia-doc.readthedocs.io/en/latest/Roles/Utils/rhsm_subscription.html>`_.
* Omnia creates a log file which is available at: ``/var/log/omnia.log``.
* If only Slurm is being installed on the cluster, docker credentials are not required.

3. To run ``omnia.yml``: ::

3. ``omnia.yml`` is a wrapper playbook comprising of:

i. ``security.yml``: This playbook sets up centralized authentication (LDAP/FreeIPA) on the cluster. For more information, `click here. <Authentication.html>`_
ii. ``scheduler.yml``: This playbook sets up job schedulers (Slurm or Kubernetes) on the cluster.
iii. ``storage.yml``: This playbook sets up storage tools like `BeeGFS <BeeGFS.html>`_ and `NFS <NFS.html>`_.

To run ``omnia.yml``: ::

ansible-playbook omnia.yml -i inventory

Expand Down
Loading