Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error handling for pre_run #220

Open
ml-evs opened this issue Nov 27, 2024 · 2 comments
Open

Better error handling for pre_run #220

ml-evs opened this issue Nov 27, 2024 · 2 comments

Comments

@ml-evs
Copy link
Member

ml-evs commented Nov 27, 2024

In #160 I repeatedly ran into a hard-to-debug issue where SGE was defaulting to the wrong shell, and thus pre_run was failing. There was not any good user feedback for this issue, instead jobs would fail without any outputs. This should be checked explicitly in the code with a better error message.

@gpetretto
Copy link
Contributor

Thanks for reporting this issue. The problem here is that the pre_run is executed inside the script. At the moment jobflow-remote determines that the job did not complete successfully. I am not sure how it could determine if the error happened during the pre_run or during the execution of the Job. I suppose in your case you there were errors in both parts: an error while trying to activate the environement and then an error for the unknown jf command. So it would still be hard to tell the cause.
In a normal execution, the job would have gone in the REMOTE_ERROR state and the user can probably figure out relatively easily the origin of the problem by inspecting the queue.out and queue.err files (for which there is also the jf job queue-out command). Maybe there should be something specific for the tests, where additional information is extracted from the container, before it is siwtched off?

@ml-evs
Copy link
Member Author

ml-evs commented Nov 27, 2024

Thanks for reporting this issue. The problem here is that the pre_run is executed inside the script. At the moment jobflow-remote determines that the job did not complete successfully. I am not sure how it could determine if the error happened during the pre_run or during the execution of the Job. I suppose in your case you there were errors in both parts: an error while trying to activate the environement and then an error for the unknown jf command. So it would still be hard to tell the cause. In a normal execution, the job would have gone in the REMOTE_ERROR state and the user can probably figure out relatively easily the origin of the problem by inspecting the queue.out and queue.err files (for which there is also the jf job queue-out command). Maybe there should be something specific for the tests, where additional information is extracted from the container, before it is siwtched off?

Yeah, I was imagining some kind of preflight check, but that won't work well with different compute/login node environments. Perhaps just automatically doing the equivalent of jf job queue-out when storing the the error in the local database?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants