Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm Metrics - sacct command to get job historic data #856

Open
abujeda opened this issue Oct 22, 2024 · 3 comments · May be fixed by #857
Open

Slurm Metrics - sacct command to get job historic data #856

abujeda opened this issue Oct 22, 2024 · 3 comments · May be fixed by #857

Comments

@abujeda
Copy link

abujeda commented Oct 22, 2024

in Harvard IQSS, we are working in creating a metrics widget for Open OnDemand.
After evaluating different datasources for the metrics, we have settled for getting the metrics directly from Slurm.

As a proof of concept, we created an script that uses sacct command to extract the relevant historic data from slurm to build the required metrics for our new dashboard widget. The historic data will be for a 30 days, 7 days and last 24 periods

A summary of our metrics requirements:

  • Number of CPU/GPU jobs by State: Completed, Timeout, Canceled, Out of Memory, Failed
  • CPU Metrics: Average Used, Average Efficiency, Average Allocated, Total Walltime
  • GPU Metrics: Average Allocated, Total Usage
  • Memory Metrics: Average Used, Average Efficiency, Average Allocated, Total Used
  • Time Metrics: Average Used, Average Efficiency, Average Allocated, Average Waiting Time

As an OnDemand prototype, we have created a Slurm adapter extensions via a monkey patch to execute the relevant sacct command. As well as a widget to calculate the metrics and display the results.

We will be creating a PR for the Slurm adapter changes to support this sacct command as we think it would be useful for other institutions.

@abujeda
Copy link
Author

abujeda commented Oct 22, 2024

This is the OnDemand prototype that we created:
Screenshot 2024-10-22 at 12 37 59

@johrstrom
Copy link
Contributor

We will be creating a PR for the Slurm adapter changes to support this sacct command as we think it would be useful for other institutions.

Historic info is something I've been thinking about too. I think it should be a new API like historic_info to separate the two.

@abujeda
Copy link
Author

abujeda commented Oct 23, 2024

Created a PR with our IQSS specific implementation: #857
Struggling to make it Slurm agnostic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants