Skip to content

Commit

Permalink
Habana Health Screen Tool Added
Browse files Browse the repository at this point in the history
 * Add Habana Health Screen tool
  • Loading branch information
omrialmog committed Jun 5, 2024
1 parent 2f992e3 commit 223a927
Show file tree
Hide file tree
Showing 21 changed files with 2,438 additions and 16 deletions.
8 changes: 4 additions & 4 deletions legal-disclaimer.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
## Legal Notice and Disclaimer
## Legal Notice and Disclaimer

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Habana Labs disclaims all warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

All information provided here is subject to change without notice. Habana Labs may make changes to its test conditions and internal reliability goals at any time. Contact your Habana Labs representative to obtain the latest Habana Labs product specifications and roadmaps. Your costs and results may vary.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Software and workloads used in performance tests may have been optimized for performance only on Habana Labs hardware. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Software and workloads used in performance tests may have been optimized for performance only on Habana Labs hardware. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

No product or component can be absolutely secure.

Habana Labs, Gaudi and SynapseAI are trademarks of Habana Labs in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2021 Habana Labs
© 2021 Habana Labs
70 changes: 58 additions & 12 deletions utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,18 @@ By installing, copying, accessing, or using the software, you agree to be legall

## Table of Contents

- [Gaudi Utils](#gaudi-utils)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [manage_network_ifs.sh](#manage_network_ifs)
- [manage\_network\_ifs](#manage_network_ifs)
- [Operations](#operations)
- [Up](#up)
- [Down](#down)
- [Status](#status)
- [Set IP](#set-ip)
- [Unset IP](#unset-ip)
- [check\_habana\_framework\_env](#check_habana_framework_env)
- [Habana Health Screen (HHS)](#habana-health-screen-hhs)

## Overview

Expand All @@ -22,23 +32,23 @@ This script can be used as reference to bring up, take down, set IPs, unset IPs
The following is the usage of the script:

```
usage: ./manage_network_ifs.sh [options]
usage: ./manage_network_ifs.sh [options]
options:
--up toggle up all Habana network interfaces
--down toggle down all Habana network interfaces
--status print status of all Habana network interfaces
--set-ip set IP for all internal Habana network interfaces
--unset-ip unset IP from all internal Habana network interfaces
-v, --verbose print more logs
-h, --help print this help
options:
--up toggle up all Habana network interfaces
--down toggle down all Habana network interfaces
--status print status of all Habana network interfaces
--set-ip set IP for all internal Habana network interfaces
--unset-ip unset IP from all internal Habana network interfaces
-v, --verbose print more logs
-h, --help print this help
Note: Please run this script with one operation at a time
```
## Operations

Before executing any operation, this script finds all the Habana network interfaces available on the system and stores the Habana interface information into a list.
The list will be used for the operations. If no Habana network interface is found, the script will exit.
Before executing any operation, this script finds all the Habana network interfaces available on the system and stores the Habana interface information into a list.
The list will be used for the operations. If no Habana network interface is found, the script will exit.

### Up

Expand Down Expand Up @@ -87,4 +97,40 @@ Check health of HPUs for PyTorch
optional arguments:
-h, --help show this help message and exit
--cards CARDS Set number of cards to test (default: 1)
```

## Habana Health Screen (HHS)

**Habana Health Screen** (HHS) tool has been developed to verify the cluster network health through a suite of diagnostic tests. The test
includes checking gaudi port status, running small workloads, and running standard collective operations arcoss multiple systems.

``` bash
usage: screen.py [-h] [--initialize] [--screen] [--target-nodes TARGET_NODES]
[--job-id JOB_ID] [--round ROUND] [--config CONFIG]
[--hhs-check [{node,hccl-demo,none}]] [--node-write-report]
[--node-name NODE_NAME] [--logs-dir LOGS_DIR]

optional arguments:
-h, --help show this help message and exit
--initialize Downloads Necessary Repos and Creates Report Template
--screen Starts Health Screen for Cluster
--target-nodes TARGET_NODES
List of target nodes
--job-id JOB_ID Needed to identify hccl-demo running log
--round ROUND Needed to identify hccl-demo running round log
--config CONFIG Configuration file for Health Screener
--hhs-check [{node,hccl-demo,none}]
Check HHS Status for Node (Ports status, Device Acquire Fail) or all_reduce
(HCCL_DEMO between paris of nodes)
--node-write-report Write Individual Node Health Report
--node-name NODE_NAME Name of Node
--logs-dir LOGS_DIR Output directory of health screen results
```
To run a full HHS test, run the below command:
``` bash
# Creates HHS Report and screens clusters for any infected nodes.
# Will check Level 1 and 2 by default
python screen.py --initialize --screen
```
5 changes: 5 additions & 0 deletions utils/habana_health_screen/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
tmp/*
build/*
logs/*
.graph_dump/*
__pycache__*
221 changes: 221 additions & 0 deletions utils/habana_health_screen/HNodes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
# Copyright (c) 2024 Habana Labs, Ltd. an Intel Company.Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os, time, yaml, csv
import logging
from multiprocessing.pool import Pool

from HabanaHealthReport import HabanaHealthReport
from utilities import run_cmd, create_logger
from hccl_demo_helper import find_groups

_logger = logging.getLogger("habana_health_screener")


class HNodes():

def __init__(self, health_report=HabanaHealthReport()):
""" Keeps Track of Nodes and their current states
Args:
health_report (HabanaHealthReport, optional): HHS Health Report. Defaults to creating a new HabanaHealthReport().
"""
self.all_nodes = list()
self.launcher_nodes = list()
self.worker_nodes = list()
self.healthy_nodes = list()
self.infected_nodes = list()

self.groups_tracker = list()

self.health_report = health_report
self.log_dir = health_report.f_dir



class HNode():

def __init__(self, name="", health_report=HabanaHealthReport(), num_checks_link_state=10, log_level=logging.INFO):
self.name = name
if name == "" and "MY_NODE_NAME" in os.environ:
self.name = os.environ["MY_NODE_NAME"]


self.cards = dict()
self.num_checks_link_state = num_checks_link_state

self.health_report = health_report
if not self.health_report.exist():
self.health_report.create()

self.logger, _ = create_logger(logger_name=self.name, logger_file_name=self.name, f_path=f"{health_report.f_dir}/L1", level=log_level)


def scan_cards(self):
self.logger.info(f"Scanning cards info on Node: {self.name}")

cmd = "hl-smi -Q index,module_id,bus_id,memory.used,temperature.aip -f csv,noheader"
output = run_cmd(cmd)

reader = csv.reader(output.split('\n'), delimiter=',')
for row in reader:
if len(row) == 0:
continue

i = row[0]
module_id = row[1].strip()
pci_address = row[2]
memory_used = int(row[3].split()[0])
temperature_C = int(row[4].split()[0])

card = HCard(index=i, module_id=module_id, pci_address=pci_address, memory_used=memory_used, temperature=temperature_C, logger=self.logger)
self.cards[i] = card

self.cards = dict(sorted(self.cards.items()))

def health_check(self, target_cards=[], write_report=False):
checked_cards = list()

if len(target_cards) == 0:
target_cards = self.cards.keys()

for i in target_cards:
card = self.cards[str(i)]
card.check_health(num_checks_link_state=self.num_checks_link_state)

checked_cards.append(card)
self.logger.info(card)

if(write_report):
self.health_report.write_rows(node_id=self.name, cards=checked_cards)



class HCard():

def __init__(self, index=-1, module_id=-1, pci_address="", memory_used=-1, framework="pytorch", temperature=-1, logger=None):
self.logger = logger
self.index = index
self.module_id = module_id
self.pci_address = pci_address
self.memory_used = memory_used
self.temperature_C = temperature
self.temperature_state_C = ""

self.framework = framework
self.down_links = list()
self.device_acquire_fail = False
self.multi_node_fail = False

self.external_ports = [1, 8, 9]
self.incorrect_ports_direction = list()

def check_health(self,num_checks_link_state=10):
self.check_link_state(attempts=num_checks_link_state, sleep_sec=0.2)
self.check_device_acquire_fail()
self.check_temperature_state()

def check_link_state(self, attempts=10, sleep_sec=0.5):
self.logger.debug(f"Checking {self.pci_address} Link State. Will check {attempts} times")
cmd = f"hl-smi -n link -i {self.pci_address}"
down_links = set()

for a in range(attempts):
output = run_cmd(cmd)
links_state = output.strip().split("\n")

for i, status in enumerate(links_state):
if ("DOWN" in status):
down_links.add(i)
self.logger.debug(f"Attempt: {a} Port: {i} DOWN")

time.sleep(sleep_sec)

self.down_links = list(down_links)

return self.down_links


def check_port_direction(self):
self.logger.debug(f"Checking {self.pci_address} Port Directions")

incorrect_ports_direction = list()
cmd = f"hl-smi -n ports -i {self.pci_address}"
output = run_cmd(cmd)

ports_direction = output.strip().split("\n")
if ports_direction[-1] == "":
ports_direction.pop()

for i, direction in enumerate(ports_direction):
if i in self.external_ports:
if "internal" in direction:
incorrect_ports_direction.append(i)
else:
if "external" in direction:
incorrect_ports_direction.append(i)

self.incorrect_ports_direction = incorrect_ports_direction

return incorrect_ports_direction

def check_device_acquire_fail(self):
self.logger.debug(f"Checking {self.pci_address} for Device Acquire Issues")

from build.Setup_and_Install.utils import check_habana_framework_env

self.device_acquire_fail = False
fw_test = check_habana_framework_env.pytorch_test
if self.framework == "tensorflow":
fw_test = check_habana_framework_env.tensorflow_test

try:
with Pool() as pool:
result = pool.apply(fw_test, args=(self.module_id))

except (RuntimeError, AssertionError, Exception) as e:
self.device_acquire_fail = True
self.logger.warning(f"{self.pci_address} Device Acquire Failure")

return self.device_acquire_fail

def check_temperature_state(self):
max_good_temperature = 83
base_temperature = 25
max_delta = 25

if self.temperature_C >= max_good_temperature:
self.temperature_state_C = "CRITICAL"
elif self.temperature_C - base_temperature >= max_delta:
self.temperature_state_C = "WARN"

def check_temperature_state(self):
max_good_temperature = 83
base_temperature = 25
max_delta = 25

if self.temperature_C >= max_good_temperature:
self.temperature_state_C = "CRITICAL"
elif self.temperature_C - base_temperature >= max_delta:
self.temperature_state_C = "WARN"

def __str__(self):
report_str = f""" Index: {self.index}
Module Id: {self.module_id}
PCI Address: {self.pci_address}
Temperature: {self.temperature_C} C
Temperature State: {self.temperature_state_C}
Down Links: {self.down_links}
Device Acquire Fail: {self.device_acquire_fail}"""

return report_str

Loading

0 comments on commit 223a927

Please sign in to comment.