Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose function for listening to policy violations on a specific GPU group #73

Merged
merged 1 commit into from
Sep 10, 2024

Conversation

sanjams2
Copy link
Contributor

@sanjams2 sanjams2 commented Sep 9, 2024

== Motivation ==

Enable finer grained GPU policy violation tracking

== Details ==

The current go-dcgm library exposes a way to listen to policy violations across all GPUs. While this is useful, it does not currently help with identifying exactly which GPUs are experiencing issues. Ideally, the policy violation would contain identifying GPU information, but it seems today it does not (struct definitions). So instead, it would be useful if users could listen to policy violations on groups created for specific GPUs. This would allow users to then know when specific GPUs were experiencing issues.

This change exposes a new function, ListenForPolicyViolationsForGroup, which takes a GroupHandle passed by the user and listens to policy violations for that group. It also modifies ListenForPolicyViolations to use this new function, but with specifying the group for all GPUs — so no net change in behavior.

Signed-off-by: sanjams2 sanjams2@users.noreply.github.com

…c GPU group

== Motivation ==

Enable finer grained GPU policy violation tracking

== Details ==

The current go-dcgm library exposes a way to listen to policy violations across
all GPUs. While this is useful, it does not enable users to understand exactly
which GPUs are experiencing issues. Ideally, users would also be able to listen
to policy violations on specific groups which could be created on a per-gpu basis.
This would allow users to then know when specific GPUs were experiencing issues.

This change exposes a new function, ListenForPolicyViolationsForGroup, which takes a
GroupHandle passed by the user and listens to policy violations for that group. It
also modifies ListenForPolicyViolations to use this new function, but with specifying
the group for all GPUs — so no net change in behavior.
@nvvfedorov
Copy link
Collaborator

@sanjams2 , Thank you for the PR. Please sign your PR: https://github.com/NVIDIA/go-dcgm/blob/main/CONTRIBUTING.md.

@nvvfedorov nvvfedorov self-assigned this Sep 9, 2024
Copy link
Collaborator

@nvvfedorov nvvfedorov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nvvfedorov nvvfedorov merged commit 85ceb31 into NVIDIA:main Sep 10, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants