Skip to content

Commit

Permalink
SynapseAi 1.16.1 release
Browse files Browse the repository at this point in the history
 * Update dockerfiles with 1.16.1 content
  • Loading branch information
omrialmog committed Jun 19, 2024
1 parent 223a927 commit fe01c25
Show file tree
Hide file tree
Showing 25 changed files with 309 additions and 297 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Gaudi Setup and Installation
# Intel® Gaudi® Accelerator Setup and Installation

<br />

---

<br />

By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Habana software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Intel Gaudi software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).

<br />

Expand All @@ -18,7 +18,7 @@ By installing, copying, accessing, or using the software, you agree to be legall

Welcome to Setup and Installation GitHub Repository!

The full installation documentation has been consolidated into the Installation Guide in our Habana Documentation. Please reference our [Habana docs](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) for the full installation guide.
The full installation documentation has been consolidated into the Installation Guide in our Intel Gaudi Documentation. Please reference our [Intel Gaudi docs](https://docs.habana.ai/en/latest/Installation_Guide/GAUDI_Installation_Guide.html) for the full installation guide.

This respository contains the following references:
- dockerfiles -- Reference dockerfiles and build script to build Gaudi Docker images
Expand Down
6 changes: 3 additions & 3 deletions dockerfiles/base/Dockerfile.rhel8.6
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,13 @@ RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.n

RUN echo "[appstream]" > /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
echo "name=CentOS Linux 8 - AppStream" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
echo "mirrorlist=http://mirrorlist.centos.org/?release=\$releasever-stream&arch=\$basearch&repo=AppStream&infra=\$infra" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
echo "baseurl=https://vault.centos.org/8-stream/AppStream/x86_64/os" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo && \
echo "gpgcheck=0" >> /etc/yum.repos.d/CentOS-Linux-AppStream.repo


RUN echo "[BaseOS]" > /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
echo "name=CentOS Linux 8 - BaseOS" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
echo "mirrorlist=http://mirrorlist.centos.org/?release=\$releasever-stream&arch=\$basearch&repo=BaseOS&infra=\$infra" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
echo "baseurl=https://vault.centos.org/8-stream/BaseOS/x86_64/os" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo && \
echo "gpgcheck=0" >> /etc/yum.repos.d/CentOS-Linux-BaseOS.repo

RUN dnf install -y \
Expand Down Expand Up @@ -77,7 +77,7 @@ RUN echo "[habanalabs]" > /etc/yum.repos.d/habanalabs.repo && \

RUN echo "[powertools]" > /etc/yum.repos.d/powertools.repo && \
echo "name=powertools" >> /etc/yum.repos.d/powertools.repo && \
echo "baseurl=http://mirror.centos.org/centos/8-stream/PowerTools/x86_64/os/" >> /etc/yum.repos.d/powertools.repo && \
echo "baseurl=https://vault.centos.org/8-stream/PowerTools/x86_64/os/" >> /etc/yum.repos.d/powertools.repo && \
echo "gpgcheck=0" >> /etc/yum.repos.d/powertools.repo

RUN dnf install -y habanalabs-rdma-core-"$VERSION"-"$REVISION".el8 \
Expand Down
4 changes: 2 additions & 2 deletions dockerfiles/common.mk
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ BUILD_DIR ?= $(CURDIR)/dockerbuild

REPO_SERVER ?= vault.habana.ai
PT_VERSION ?= 2.2.2
RELEASE_VERSION ?= 1.16.0
RELEASE_BUILD_ID ?= 526
RELEASE_VERSION ?= 1.16.1
RELEASE_BUILD_ID ?= 7

BASE_IMAGE_URL ?= base-installer-$(BUILD_OS)
IMAGE_URL = $(IMAGE_NAME):$(RELEASE_VERSION)-$(RELEASE_BUILD_ID)
Expand Down
60 changes: 30 additions & 30 deletions utils/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Gaudi Utils

By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Habana software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).
By installing, copying, accessing, or using the software, you agree to be legally bound by the terms and conditions of the Intel Gaudi software license agreement [defined here](https://habana.ai/habana-outbound-software-license-agreement/).

## Table of Contents

Expand All @@ -14,100 +14,100 @@ By installing, copying, accessing, or using the software, you agree to be legall
- [Status](#status)
- [Set IP](#set-ip)
- [Unset IP](#unset-ip)
- [check\_habana\_framework\_env](#check_habana_framework_env)
- [Habana Health Screen (HHS)](#habana-health-screen-hhs)
- [check\_framework\_env](#check_framework_env)
- [Intel Gaudi Health Screen (IGHS)](#intel-gaudi-health-screen-ighs)

## Overview

Welcome to Gaudi's Util Scripts!
Welcome to Intel Gaudi's Util Scripts!

This folder contains some Gaudi utility scripts that users can access as reference.
This folder contains some Intel Gaudi utility scripts that users can access as reference.

## manage_network_ifs

Moved to habanalabs-qual Example: (/opt/habanalabs/qual/gaudi2/bin/manage_network_ifs.sh).

This script can be used as reference to bring up, take down, set IPs, unset IPs and check for status of the Gaudi network interfaces.
This script can be used as reference to bring up, take down, set IPs, unset IPs and check for status of the Intel Gaudi network interfaces.

The following is the usage of the script:

```
usage: ./manage_network_ifs.sh [options]
options:
--up toggle up all Habana network interfaces
--down toggle down all Habana network interfaces
--status print status of all Habana network interfaces
--set-ip set IP for all internal Habana network interfaces
--unset-ip unset IP from all internal Habana network interfaces
--up toggle up all Intel Gaudi network interfaces
--down toggle down all Intel Gaudi network interfaces
--status print status of all Intel Gaudi network interfaces
--set-ip set IP for all internal Intel Gaudi network interfaces
--unset-ip unset IP from all internal Intel Gaudi network interfaces
-v, --verbose print more logs
-h, --help print this help
Note: Please run this script with one operation at a time
```
## Operations

Before executing any operation, this script finds all the Habana network interfaces available on the system and stores the Habana interface information into a list.
The list will be used for the operations. If no Habana network interface is found, the script will exit.
Before executing any operation, this script finds all the Intel Gaudi network interfaces available on the system and stores the Intel Gaudi interface information into a list.
The list will be used for the operations. If no Intel Gaudi network interface is found, the script will exit.

### Up

Use the following command to bring all Habana network interfaces online:
Use the following command to bring all Intel Gaudi network interfaces online:
```
sudo manage_network_ifs.sh --up
```
Once all the Habana interfaces are toggled up, IPs will be set by default. Please refer [Set Ip](#set-ip) for more detail. To unset IPs, run this script with '--unset-ip'
Once all the Intel Gaudi interfaces are toggled up, IPs will be set by default. Please refer [Set Ip](#set-ip) for more detail. To unset IPs, run this script with '--unset-ip'
### Down

Use the following command to bring all Habana network interfaces offline:
Use the following command to bring all Intel Gaudi network interfaces offline:
```
sudo manage_network_ifs.sh --down
```
### Status

Print the current operational state of all Habana network interfaces such as how many ports are up/down:
Print the current operational state of all Intel Gaudi network interfaces such as how many ports are up/down:
```
sudo manage_network_ifs.sh --status
```
### Set IP

Use the following command to assign a default IP for all Habana network interfaces:
Use the following command to assign a default IP for all Intel Gaudi network interfaces:
```
sudo manage_network_ifs.sh --set-ip
```
Note: Default IPs are 192.168.100.1, 192.168.100.2, 192.168.100.3 and so on
### Unset IP

Remove IP from all available Habana network interfaces by the following command:
Remove IP from all available Intel Gaudi network interfaces by the following command:
```
sudo manage_network_ifs.sh --unset-ip
```

## check_habana_framework_env
## check_framework_env

This script can be used as reference to check the environment for running PyTorch on Habana.
This script can be used as reference to check the environment for running PyTorch on Intel Gaudi.

The following is the usage of the script:

```
usage: check_habana_framework_env.py [-h] [--cards CARDS]
usage: check_framework_env.py [-h] [--cards CARDS]
Check health of HPUs for PyTorch
Check health of Intel Gaudi for PyTorch
optional arguments:
-h, --help show this help message and exit
--cards CARDS Set number of cards to test (default: 1)
```

## Habana Health Screen (HHS)
## Intel Gaudi Health Screen (IGHS)

**Habana Health Screen** (HHS) tool has been developed to verify the cluster network health through a suite of diagnostic tests. The test
**Intel Gaudi Health Screen** (IGHS) tool has been developed to verify the cluster network health through a suite of diagnostic tests. The test
includes checking gaudi port status, running small workloads, and running standard collective operations arcoss multiple systems.

``` bash
usage: screen.py [-h] [--initialize] [--screen] [--target-nodes TARGET_NODES]
[--job-id JOB_ID] [--round ROUND] [--config CONFIG]
[--hhs-check [{node,hccl-demo,none}]] [--node-write-report]
[--ighs-check [{node,hccl-demo,none}]] [--node-write-report]
[--node-name NODE_NAME] [--logs-dir LOGS_DIR]

optional arguments:
Expand All @@ -119,18 +119,18 @@ optional arguments:
--job-id JOB_ID Needed to identify hccl-demo running log
--round ROUND Needed to identify hccl-demo running round log
--config CONFIG Configuration file for Health Screener
--hhs-check [{node,hccl-demo,none}]
Check HHS Status for Node (Ports status, Device Acquire Fail) or all_reduce
--ighs-check [{node,hccl-demo,none}]
Check IGHS Status for Node (Ports status, Device Acquire Fail, Device Temperature) or all_reduce
(HCCL_DEMO between paris of nodes)
--node-write-report Write Individual Node Health Report
--node-name NODE_NAME Name of Node
--logs-dir LOGS_DIR Output directory of health screen results
```
To run a full HHS test, run the below command:
To run a full IGHS test, run the below command:
``` bash
# Creates HHS Report and screens clusters for any infected nodes.
# Creates IGHS Report and screens clusters for any infected nodes.
# Will check Level 1 and 2 by default
python screen.py --initialize --screen
```
14 changes: 7 additions & 7 deletions utils/check_habana_framework_env.py → utils/check_framework_env.py
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
import concurrent.futures

def parse_arguments():
parser = argparse.ArgumentParser(description="Check health of HPUs for PyTorch")
parser = argparse.ArgumentParser(description="Check health of Intel Gaudi for PyTorch")

parser.add_argument("--cards",
default=1,
Expand All @@ -29,11 +29,11 @@ def parse_arguments():
return args

def pytorch_test(device_id=0):
""" Checks health of HPU through running a basic
PyTorch example on HPU
""" Checks health of Intel Gaudi through running a basic
PyTorch example on Intel Gaudi
Args:
device_id (int, optional): ID of HPU. Defaults to 0.
device_id (int, optional): ID of Intel Gaudi. Defaults to 0.
"""

os.environ["ID"] = str(device_id)
Expand All @@ -42,15 +42,15 @@ def pytorch_test(device_id=0):
import torch
import habana_frameworks.torch.core
except Exception as e:
print(f"Card {device_id} Failed to initialize Habana PyTorch: {str(e)}")
print(f"Card {device_id} Failed to initialize Intel Gaudi PyTorch: {str(e)}")
raise

try:
x = torch.tensor([2]).to('hpu')
y = x + x

assert y == 4, 'Sanity check failed: Wrong Add output'
assert 'hpu' in y.device.type.lower(), 'Sanity check failed: Operation not executed on Habana Device'
assert 'hpu' in y.device.type.lower(), 'Sanity check failed: Operation not executed on Intel Gaudi Card'
except (RuntimeError, AssertionError) as e:
print(f"Card {device_id} Failure: {e}")
raise
Expand All @@ -64,7 +64,7 @@ def pytorch_test(device_id=0):
for device_id, res in zip(range(args.cards), executor.map(pytorch_test, range(args.cards))):
print(f"Card {device_id} PASSED")
except Exception as e:
print(f"Failed to initialize Habana, error: {str(e)}")
print(f"Failed to initialize on Intel Gaudi, error: {str(e)}")
print(f"Check FAILED")
exit(1)

Expand Down
1 change: 0 additions & 1 deletion utils/habana_health_screen/version.txt

This file was deleted.

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@

import logging

_logger = logging.getLogger("habana_health_screener")
_logger = logging.getLogger("health_screener")

class HabanaHealthReport():
class HealthReport():

def __init__(self, f_dir="tmp", report_name="health_report.csv"):
""" Initialize Habana Health Report Class
""" Initialize Health Report Class
Args:
f_dir (str, optional): File Directory to store Health Report logs and results. Defaults to "tmp".
Expand Down Expand Up @@ -83,8 +83,8 @@ def write_rows(self, cards=list(), node_id="", data=list(), level=1):
""" Write health check results to Health Report CSV. Can write multiple rows at once
Args:
cards ([HCard], optional): Level 1 HCards to report about. Defaults to list().
node_id (str, optional): Node ID of HCards. Defaults to "".
cards ([IGCard], optional): Level 1 IGCards to report about. Defaults to list().
node_id (str, optional): Node ID of IGCards. Defaults to "".
data (_type_, optional): Health Report CSV Row data. Defaults to list().
level (int, optional): Health Screen Level. Defaults to 1.
"""
Expand Down Expand Up @@ -118,12 +118,12 @@ def update_health_report(self, detected_nodes, infected_nodes, missing_nodes):
infected_nodes (list[str]): List of infected node_ids
missing_nodes (list[str]): List of missing node_ids
"""
tempfile = NamedTemporaryFile(mode='w', delete=False)
temp_file = NamedTemporaryFile(mode='w', delete=False)
detected_nodes_cp = detected_nodes.copy()

with open(self.f_path, 'r', newline='') as csvfile, tempfile:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(tempfile, fieldnames=self.header)
with open(self.f_path, 'r', newline='') as csv_file, temp_file:
reader = csv.DictReader(csv_file)
writer = csv.DictWriter(temp_file, fieldnames=self.header)

writer.writeheader()
for row in reader:
Expand All @@ -148,22 +148,22 @@ def update_health_report(self, detected_nodes, infected_nodes, missing_nodes):
for n in missing_nodes:
writer.writerow({"node_id": n, "multi_node_fail": True, "missing": True})

shutil.move(tempfile.name, self.f_path)
shutil.move(temp_file.name, self.f_path)

def update_hccl_demo_health_report(self, round, all_node_pairs, multi_node_fail, qpc_fail, missing_nodes):
""" Update health_report with hccl_demo results, based on infected_nodes.
Args:
all_node_pairs (list[str]): List of all node pairs reported by Level 2 round
all_node_pairs (list[str]): List of all Node Pairs reported by Level 2 round
multi_node_fail (list[str]): List of Node Pairs that failed HCCL_Demo Test
qpc_fail (list[str]): List of Node Pairs that failed HCCL_Demo Test due to QPC error
missing_nodes (list[str]): List of Node Pairs that couldn't run HCCL_Demo
"""
tempfile = NamedTemporaryFile(mode='w', delete=False)
temp_file = NamedTemporaryFile(mode='w', delete=False)

with open(self.f_path_hccl_demo, 'r', newline='') as csvfile, tempfile:
reader = csv.DictReader(csvfile)
writer = csv.DictWriter(tempfile, fieldnames=self.header_hccl_demo, extrasaction='ignore')
with open(self.f_path_hccl_demo, 'r', newline='') as csv_file, temp_file:
reader = csv.DictReader(csv_file)
writer = csv.DictWriter(temp_file, fieldnames=self.header_hccl_demo, extrasaction='ignore')

writer.writeheader()
for row in reader:
Expand All @@ -181,7 +181,7 @@ def update_hccl_demo_health_report(self, round, all_node_pairs, multi_node_fail,
if len(all_node_pairs):
writer.writerows(list(all_node_pairs.values()))

shutil.move(tempfile.name, self.f_path_hccl_demo)
shutil.move(temp_file.name, self.f_path_hccl_demo)

def check_screen_complete(self, num_nodes, hccl_demo=False, round=0):
""" Check on status of Health Screen Check.
Expand Down Expand Up @@ -306,11 +306,11 @@ def gather_health_report(self, level, remote_path, hosts):
""" Gathers Health Report from all hosts
Args:
level (str): HHS Level
remote_path (str): Remote Destintation of HHS Report
hosts (list, optional): List of IP Addresses to gather HHS Reports
level (str): IGHS Level
remote_path (str): Remote Destintation of IGHS Report
hosts (list, optional): List of IP Addresses to gather IGHS Reports
"""
copy_files(src=f"{remote_path}/habana_health_screen/{self.f_dir}/L{level}",
copy_files(src=f"{remote_path}/intel_gaudi_health_screen/{self.f_dir}/L{level}",
dst=f"{self.f_dir}",
hosts=hosts,
to_remote=False)
Expand All @@ -319,16 +319,16 @@ def consolidate_health_report(self, level, report_dir):
""" Consolidates the health_report_*.csv from worker pods into a single master csv file
Args:
level (str): HHS Level
level (str): IGHS Level
report_dir (str): Directory of CSV files to merge
"""
data = list()
path = f"{report_dir}/L{level}/health_report_*.csv"
csv_files = glob.glob(path)

for f in csv_files:
with open(f, 'r', newline='') as csvfile:
reader = csv.DictReader(csvfile)
with open(f, 'r', newline='') as csv_file:
reader = csv.DictReader(csv_file)
for row in reader:
data.append(row)

Expand Down
Loading

0 comments on commit fe01c25

Please sign in to comment.