-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additonal manual jobs cause REMOTE_ERROR
state for jobflow-remote jobs when submission maximum is reached
#135
Comments
I have no example at hand now but the error I retrieved via |
Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: |
Hi! I have set this max number of jobs already, because I have way more jobs than the limit. Unfortunately it doesn't help, so I would really appreciate it if there would be a way to keep track of all the jobs in the cluster (and a certain queue). |
I see. Indeed I had the doubt this could be the case. |
Oh I see. This sounds indeed not so easy to implement. |
Hi 😀
we have a submission limit of 40 or 20 jobs (depending on the queue of our HPC cluster) and when I start some other additional VASP jobs manually, that leads to reaching that limit, the submission of the next jobs from the jobflow-remote queue fail and go into
REMOTE_ERROR
state. Therefore, I have to retry the jobs when I'm below the limit again, but then I cannot let the jobs run over night or over the weekend and have to constantly watch the workflow. I'm using the interactive branch.Do you have an idea how to solve this problem? I temporary solved it by using a bash line that is retrying all jobs with
REMOTE_ERROR
state every few hours, but it's not really a desirable solution. Did you ever face a similar issue?The text was updated successfully, but these errors were encountered: