Replies: 4 comments 16 replies
-
I think we are good for Python bindings! https://pypi.org/project/flux-python/ I've tested these on corona with a few pythons (including the one in the tce location I think?) so I think for a final step, I'd like to ask for a list of Python locations / versions / Flux verisons that should be supported, and then I can write up really really specific instructions for each, so it's a no-brainer. |
Beta Was this translation helpful? Give feedback.
-
@ryanday36. On Job Resiliency - just so we capture the requirement - is the option for a job to continue after a node failure required just for batch jobs (we have support for that almost 100% complete) or for normal |
Beta Was this translation helpful? Give feedback.
-
@ryanday36 On Backup / archive job records, "Preferably in some sort of plain text." Do you mean just not in a binary blob so its easier for others to read directly from the database? Or do you mean easily accessible info for users from a database, e.g. like a giant json blob isn't good, but "nnodes" and "nodelist" in a separate database column is ok. In #4336 (WIP) the database columns I did were:
|
Beta Was this translation helpful? Give feedback.
-
I updated this to strike through things that I believe are done. Let me know if there's anything done that I missed. Hopefully we can talk about it in the team meeting this week. |
Beta Was this translation helpful? Give feedback.
-
Features / requirements for the Flux system instance that we want / need before we run it as the system level resource manager on production systems. This list is intended to only include feature requests, so doesn't include issues like the recently discovered ability of users to hang up the KVS. Items with a '*' have relatively straightforward workarounds or are otherwise lower priority.
cluster management features
Restart without killing running jobs:
Related is the ability to change more configuration options without restarting, but being able to restart without killing jobs would make that less important.
Contain user jobs:
Limit users to X% of memory to avoid OOMs. Constrain jobs to assigned resources on clusters where users can share nodes.Cluster usage:
Track overall cluster usage as a function of time for reporting (%down, %idle, %running, %reserved)
*Backup / archive job records:
Preferably in some sort of plain text. Note: I can do this outside of Flux.
*Flux administrators:
Give specified users the ability to run system level commands, see others jobs, etc. Note: it might make more sense to just give specified users 'sudo flux ...'
user management features
Preemptable jobs:
AKA 'standby' qos / queue. Allow users to submit jobs that can be killed automatically by the system instance if another job needs the resources.
Exempt jobs:
Allow jobs to be modified to exceed
system andaccounting limits.Exempt / expedite users:
Give specified user+banks the ability to submit jobs that are preemptable or exceed limits or have 'expedite' priority.
Advanced reservations (DATs):
Specify a subset of a cluster's resources that can only be accessed by a specified user list in a specified time range.
Enforce more limits:
Enforce Flux accounting limits that aren't yet enforced (e.g. total nodes across all of a user+banks jobs)
*Projects (wckeys):
A means of tracking usage that is orthogonal to banks. users are assigned projects that they may tag their jobs with, one of which is a default project that is assigned to jobs where the user doesn't specify a project. Note: I can do this outside of Flux.
*Usage reporting tools:
Collated usage by user+bank and user+project per day to allow fast usage reporting. Note: I can also do this outside of Flux.
job management options
Modify jobs:
Allow users to change
time limits, dependencies, job names, etc. on jobs after submission.*Topology aware scheduling and binding:
Especially for CORAL 2, if a user asks for 2 cpus and 1 gpu, ensure that they're in the same NUMA domain. Knowing where NICs are could also be important.
Possibly related, on CORAL 2, users have asked for the ability to oversubscribe GPUs.
Note: mpibind does a lot of this for us at LLNL.
Signal jobs:
Allow users to send their jobs a signal at a specified time relative to the end time of their job for purposes of triggering restart dumps, etc.Job resiliency:
Allow jobs to continue running if a node fails (if requested by the user).
Bank info:
Give users access to information about banks, usage, priority, etc. Note: this might be done. I need to play with the latest Flux accounting.Email notification:
Allow users to request email notification for job events (start, end, fail, etc.).
python bindings in TCE:
I'm not sure how much this is on the Flux team and how much it's on DEG, but it would be good to get the /usr/tce python to be able to 'import flux' easily.Beta Was this translation helpful? Give feedback.
All reactions