What's still needed for running Flux as the system RM in production #5165

ryanday36 · 2023-05-09T23:58:19Z

ryanday36
May 9, 2023

Features / requirements for the Flux system instance that we want / need before we run it as the system level resource manager on production systems. This list is intended to only include feature requests, so doesn't include issues like the recently discovered ability of users to hang up the KVS. Items with a '*' have relatively straightforward workarounds or are otherwise lower priority.

cluster management features

Restart without killing running jobs:

Related is the ability to change more configuration options without restarting, but being able to restart without killing jobs would make that less important.

Contain user jobs:

~~Limit users to X% of memory to avoid OOMs. Constrain jobs to assigned resources on clusters where users can share nodes.~~

Cluster usage:

Track overall cluster usage as a function of time for reporting (%down, %idle, %running, %reserved)

*Backup / archive job records:

Preferably in some sort of plain text. Note: I can do this outside of Flux.

*Flux administrators:

Give specified users the ability to run system level commands, see others jobs, etc. Note: it might make more sense to just give specified users 'sudo flux ...'

user management features

Preemptable jobs:

AKA 'standby' qos / queue. Allow users to submit jobs that can be killed automatically by the system instance if another job needs the resources.

Exempt jobs:

Allow jobs to be modified to exceed ~~system and~~ accounting limits.

Exempt / expedite users:

Give specified user+banks the ability to submit jobs that are preemptable or exceed limits or have 'expedite' priority.

Advanced reservations (DATs):

Specify a subset of a cluster's resources that can only be accessed by a specified user list in a specified time range.

Enforce more limits:

Enforce Flux accounting limits that aren't yet enforced (e.g. total nodes across all of a user+banks jobs)

*Projects (wckeys):

A means of tracking usage that is orthogonal to banks. users are assigned projects that they may tag their jobs with, one of which is a default project that is assigned to jobs where the user doesn't specify a project. Note: I can do this outside of Flux.

*Usage reporting tools:

Collated usage by user+bank and user+project per day to allow fast usage reporting. Note: I can also do this outside of Flux.

job management options

Modify jobs:

Allow users to change ~~time limits~~, dependencies, job names, etc. on jobs after submission.

*Topology aware scheduling and binding:

Especially for CORAL 2, if a user asks for 2 cpus and 1 gpu, ensure that they're in the same NUMA domain. Knowing where NICs are could also be important.
Possibly related, on CORAL 2, users have asked for the ability to oversubscribe GPUs.
Note: mpibind does a lot of this for us at LLNL.

Signal jobs:

~~Allow users to send their jobs a signal at a specified time relative to the end time of their job for purposes of triggering restart dumps, etc.~~

Job resiliency:

Allow jobs to continue running if a node fails (if requested by the user).

Bank info:

~~Give users access to information about banks, usage, priority, etc. Note: this might be done. I need to play with the latest Flux accounting.~~

Email notification:

Allow users to request email notification for job events (start, end, fail, etc.).

python bindings in TCE:

~~I'm not sure how much this is on the Flux team and how much it's on DEG, but it would be good to get the /usr/tce python to be able to 'import flux' easily.~~

vsoch · 2023-05-10T00:31:34Z

vsoch
May 10, 2023
Maintainer

I think we are good for Python bindings! https://pypi.org/project/flux-python/ I've tested these on corona with a few pythons (including the one in the tce location I think?) so I think for a final step, I'd like to ask for a list of Python locations / versions / Flux verisons that should be supported, and then I can write up really really specific instructions for each, so it's a no-brainer.

10 replies

lee218llnl May 10, 2023

[lee218@tioga11:~]$ ls -1 /usr/tce/packages/python/
python
python-2.7.18
python-3.10.8
python-3.9.12

You probably don't need to support python 2.7.18. The non-versioned python is just a pointer to 3.9.12, so in short, please test just the 3.9.12 and 3.10.8 versions. These tce pythons are the same python installations that are on non-Cray TOSS 4 systems.

vsoch May 10, 2023
Maintainer

@trws I like the idea of wheels, but it assumes that flux is in a hard coded (predetermined) path - right? Is that something we've standardized to allow us to do that even?

trws May 10, 2023
Maintainer

I don't think it has to, need to do some work to make sure but at a minimum we compile it without rpaths such that it will find whatever is in either the default library path or LD_LIBRARY_PATH, and then we might be able to configure it to set it to something specific on load.

garlick May 10, 2023
Maintainer

You probably don't need to support python 2.7.18

Good! FYI flux-core requires python >= 3.6.

vsoch May 10, 2023
Maintainer

Hello thread! I'm done with the final builds / testing / documentation for the systems above, and this is probably ready for a DEG person to test:

https://github.com/flux-framework/flux-python/blob/main/LLNL.md

Please open any issues on the board there (and this is the same for users) and I will help promptly!

These are source builds (.tar.gz archives) but I'd like to start the wheels promptly. What I think we want to do is have several fluxenv images with different python versions, and per discussion about these base images, @grondo can you tell me:

Where they are currently being built
Where we can built them on GitHub for ghcr.io

And I'll start on that. The other option, if we don't want to refactor that fully, is to have the base images built alongside Flux python, and what I'll do is start with a modified fluxenv and install different pythons to it. This might actually be the better option to keep the system libraries the same on, for example, a common focal base, but then to vary the python version flux is installed alongside.

Let me know your thoughts / preferences!

grondo · 2023-05-10T17:32:37Z

grondo
May 10, 2023
Maintainer

@ryanday36. On Job Resiliency - just so we capture the requirement - is the option for a job to continue after a node failure required just for batch jobs (we have support for that almost 100% complete) or for normal flux run, flux submit jobs as well?

3 replies

ryanday36 May 10, 2023
Author

I think just batch/alloc jobs. That would support the use cases that I'm aware of, so I'd definitely call this done once we support it for batch jobs and wait for someone to make a case for it before trying to argue for anything with run/submit jobs.

garlick May 10, 2023
Maintainer

To be clear, support for resilent batch/alloc jobs means the flux subinstance can continue to function after losing a non-critical node. It doesn't necessarily mean that a "full size" MPI job running in the instance has any support for resiliency - the default behavior within the subinstance is the same as the system instance: if a rank terminates early, the job is (eventually) terminated.

Do we need something like slurm's --kill-on-bad-exit=0 to complete the picture? Or is the SCR case covered by the above and so we're good as is?

ryanday36 May 10, 2023
Author

I don't want to say don't do --kill-on-bad-exit=0, because there's bound to be someone who really wants it, but as far as I know it's not a high priority.

chu11 · 2023-05-10T17:57:56Z

chu11
May 10, 2023
Maintainer

@ryanday36 On Backup / archive job records, "Preferably in some sort of plain text." Do you mean just not in a binary blob so its easier for others to read directly from the database? Or do you mean easily accessible info for users from a database, e.g. like a giant json blob isn't good, but "nnodes" and "nodelist" in a separate database column is ok.

In #4336 (WIP) the database columns I did were:


const char *sql_create_table = "CREATE TABLE if not exists jobs("
                               "  id CHAR(16) PRIMARY KEY,"
                               "  t_inactive REAL,"
                               "  jobdata JSON,"
                               "  eventlog TEXT,"
                               "  jobspec JSON,"
                               "  R JSON"
    ");";

2 replies

ryanday36 May 10, 2023
Author

The use case for this one is for five or ten years from now when someone wants to know how job sizes have changed as we've moved through hardware generations or similar. I recently got a bit burned by all of our Slurm db archives being binary blobs and not being able to read back some of them in with the current combination of Slurm and mariadb versions in TOSS. I probably could've gotten them read back in by getting someone to spin me up a VM with older software versions, but fortunately no one wanted the data that badly. I now just have cron jobs that write job data out to text files every month, which makes me feel better. I've got the '*' on this one because I'll probably do the same thing with Flux in addition to Flux's backups / archives.

chu11 May 10, 2023
Maintainer

Understood. We've had some discussions (#4914 for starters) how to do long term job storage. I'm reluctant to jump on a DB right now because this exact reason, we don't want to jump on a solution and change it later.

Since we do store almost everything in json, wondering if there's some "long term archiving" kinda format we could do that is just text files. I'm sure there's some common format / lib out there somewhere.

ryanday36 · 2024-01-29T17:40:11Z

ryanday36
Jan 29, 2024
Author

I updated this to strike through things that I believe are done. Let me know if there's anything done that I missed. Hopefully we can talk about it in the team meeting this week.

1 reply

grondo Jan 31, 2024
Maintainer

Other items that could be considered complete:

Signal jobs (See --signal=SIG@TIME documentation)
Job resilency: batch/alloc jobs will automatically continue when losing a non-critical broker. A job resistant to any node failure except rank 0 can be launched by choosing a flat tree topology, e.g. kary:N where N is at least the number of brokers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's still needed for running Flux as the system RM in production #5165

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

What's still needed for running Flux as the system RM in production #5165

ryanday36 May 9, 2023

cluster management features

Restart without killing running jobs:

Contain user jobs:

Cluster usage:

*Backup / archive job records:

*Flux administrators:

user management features

Preemptable jobs:

Exempt jobs:

Exempt / expedite users:

Advanced reservations (DATs):

Enforce more limits:

*Projects (wckeys):

*Usage reporting tools:

job management options

Modify jobs:

*Topology aware scheduling and binding:

Signal jobs:

Job resiliency:

Bank info:

Email notification:

python bindings in TCE:

Replies: 4 comments · 16 replies

vsoch May 10, 2023 Maintainer

lee218llnl May 10, 2023

vsoch May 10, 2023 Maintainer

trws May 10, 2023 Maintainer

garlick May 10, 2023 Maintainer

vsoch May 10, 2023 Maintainer

grondo May 10, 2023 Maintainer

ryanday36 May 10, 2023 Author

garlick May 10, 2023 Maintainer

ryanday36 May 10, 2023 Author

chu11 May 10, 2023 Maintainer

ryanday36 May 10, 2023 Author

chu11 May 10, 2023 Maintainer

ryanday36 Jan 29, 2024 Author

grondo Jan 31, 2024 Maintainer

ryanday36
May 9, 2023

Replies: 4 comments 16 replies

vsoch
May 10, 2023
Maintainer

vsoch May 10, 2023
Maintainer

trws May 10, 2023
Maintainer

garlick May 10, 2023
Maintainer

vsoch May 10, 2023
Maintainer

grondo
May 10, 2023
Maintainer

ryanday36 May 10, 2023
Author

garlick May 10, 2023
Maintainer

ryanday36 May 10, 2023
Author

chu11
May 10, 2023
Maintainer

ryanday36 May 10, 2023
Author

chu11 May 10, 2023
Maintainer

ryanday36
Jan 29, 2024
Author

grondo Jan 31, 2024
Maintainer