Skip to content

edf-hpc/collectd-opaerror-plugin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Collectd Opa error counters Python Plugin

Overview

This project contains a Python script ora_error.py that collects all Opa error counters on all OPA ports on infiniband fabric.

This script is designed to be used as a collectd Python plugin and directly from CLI (mostly for testing purpose).

This Python script relies on opaextracterror which is a bash script which call opaerror command.

Usage

CLI

Once the script is deployed on one node with Intel OPA CLI commands, just run the Python script:

python opa_error.py

Collectd plugin

Deploy the script on one of your Slurm cluster node (eg. batch controller) in the directory of your choice (eg. /usr/share/python/collectd/plugins/). Then add the module to your collectd.conf:

    LoadPlugin python

    <Plugin python>
       ModulePath "/usr/share/python/collectd/plugins"
       Import "opa_error"
    </Plugin>

Make sure to adjust the ModulePath value.

Description

See bellow description of thresolds collected by this plugin:

Errors: Signal Integrity Link Qual Indicator LinkQualityIndicator Uncorrectable Errors UncorrectableErrors Link Downed LinkDowned Rcv Errors RcvErrors Exc. Buffer Overrun ExcessiveBufferOverruns FM Config Errors FMConfigErrors Local Link Integ Err LocalLinkIntegrityErrors Link Error Recovery LinkErrorRecovery Rcv Rmt Phys Err RcvRemotePhysicalErrors Errors: Security Xmit Constraint XmitConstraintErrors Rcv Constraint RcvConstraintErrors Errors: Routing and Other Errors Rcv Sw Relay Err RcvSwitchRelayErrors Xmit Discards XmitDiscards

Licensing

This script is distributed under the terms of the GNU General Public License version 3, or any later version.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages