Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize System Status app for use at other sites #92

Open
3 tasks
msquee opened this issue Sep 25, 2020 · 10 comments
Open
3 tasks

Generalize System Status app for use at other sites #92

msquee opened this issue Sep 25, 2020 · 10 comments
Assignees

Comments

@msquee
Copy link
Contributor

msquee commented Sep 25, 2020

More sites have expressed interest in the System Status app [1], there is still OSC specific code in the latest version. Let's generalize the status app so it can be dropped in and deployed at other sites.

The ideal scenario would be to support all of the adapters that Open OnDemand supports but I think focusing on supporting SLURM clusters is a good place to start.

Todo (WIP):

  • Merge gpu_cluster_status.rb into moab_showq_client.rb and rename to torque_moab_client.rb (omitting any OSC specific items)
  • Add a torque - only adapter - several interested sites actually have Torque (but might not have Moab); could get the aggregate information about jobs by just parsing info_all (which is slower but not that much slower) or use qstat directly with arguments to display server status
    Displaying Server Status
    
    If batch server status is being displayed and the -f option is not specified, the following items are dis‐
    played on a single line, in the specified order, separated by white space:
    
         -      the server name
    
         -      the maximum number of jobs that the server may run concurrently
    
         -      the total number of jobs currently managed by the server
    
         -      the status of the server
    
         -      for each job state, the name of the state and the number of jobs in the server in that state
    
  • Support specifying partitions to display in system status (perhaps in cluster config there might be a custom: systemstatus: partitions: [ serial, parallel ] configuration that if exists, we create a separate graph for each (and then the sinfo/squeue calls are constrained to those partitions)

[1] https://discourse.osc.edu/t/system-status-app/1129

@msquee msquee self-assigned this Sep 25, 2020
@msquee
Copy link
Contributor Author

msquee commented Sep 25, 2020

Would it be best to remove Ganglia and focus on supporting Grafana?

@achalker
Copy link

achalker commented Sep 25, 2020 via email

@ericfranz
Copy link
Contributor

Make https://github.com/OSC/osc-systemstatus/blob/master/views/layout.erb#L164 configurable via ENV.

Just remove that.

@ericfranz
Copy link
Contributor

Support setting custom colors on graphs

What did you have in mind here? I think the MVP might not require this.

@msquee
Copy link
Contributor Author

msquee commented Sep 25, 2020

@ericfranz It's not required. Since there's support for customization on the Dashboard, we could bring that functionality here eventually.

@mcuma
Copy link

mcuma commented Sep 25, 2020

We do have Ganglia but I know next to nothing about it. Though we could probably plug it in. I would vote for make it optional, if possible.

What I was mostly after is output from sinfo to see what resources are available at a given time, so that people could decide what cluster and partition to use to get their job running ASAP. E.g. for our simplest cluster, sinfo gives this:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
lonepeak* up 3-00:00:00 1 drain* lp203
lonepeak* up 3-00:00:00 210 alloc lp[001-089,091-112,133-202,204-232]
lonepeak* up 3-00:00:00 1 idle lp090
lonepeak-shared up 3-00:00:00 1 drain* lp203
lonepeak-shared up 3-00:00:00 210 alloc lp[001-089,091-112,133-202,204-232]
lonepeak-shared up 3-00:00:00 1 idle lp090
lonepeak-guest up 3-00:00:00 21 alloc lp[113-132,233]
lonepeak-shared-guest up 3-00:00:00 21 alloc lp[113-132,233]
liu-lp up 14-00:00:0 20 alloc lp[113-132]
liu-shared-lp up 14-00:00:0 20 alloc lp[113-132]
fischer-lp up 14-00:00:0 1 alloc lp233
fischer-shared-lp up 14-00:00:0 1 alloc lp233

We have 2 main partitions, the "lonepeak", with synonym "lonepeak-shared" for jobs that can share a node, and the "owner" partition, which consists of the liu-lp and fischer-lp nodes, their shared synonyms, and the guest access to the owner nodes (lonepeak-guest) and its shared synonym.

So, in the simplest case, we could report the status of the "lonepeak" and "lonepeak-guest" partitions (alloc, idle, drain, mix=partially occupied), and, potentially how busy each owner partition is, as sometimes guests target specific owner nodes for smaller chances of preemptions (owner jobs preempt guest jobs).

I hope this helps you with the generalization strategy, or possibly make some plug ins for site specific stuff like ours. And feel free to let me know if I can help with anything.

@ericfranz
Copy link
Contributor

@mcuma we made a few quick changes to the app that at least now runs at CHPC. Here is a screenshot:

screen 2020-09-25 at 12 36 26 PM

If you just get the latest code from the master branch and touch tmp/restart.txt it should run. If you are updating a previously cloned version you will need to rm -rf .bundle and rm -rf vendor/bundle.

Now, that said, as you can see from the screenshot, it just builds these graph for each cluster, not for partitions of a specific cluster. It seems like maybe what you are looking for would be best served by a custom widget when we are able to easily support that type of thing in OnDemand.

@ericfranz
Copy link
Contributor

Or maybe we are talking about the same graphs above, but being able to make graphs per partition instead of per cluster, or pick the cluster and partitions to do the graphs for?

@mcuma
Copy link

mcuma commented Sep 25, 2020

Great, let me try that and let you know how it went. It's looking good enough for now from the screenshot. I may hack around at it to get the two partitions separate (lonepeak, lonepeak-guest) if I get a chance. I should be able to do it from skimming the code.

@mcuma
Copy link

mcuma commented Sep 25, 2020

I confirm that the System Status works both on our test and production servers. Thanks for getting this fixed so quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants