Better error handling for `pre_run` #220

ml-evs · 2024-11-27T14:28:02Z

In #160 I repeatedly ran into a hard-to-debug issue where SGE was defaulting to the wrong shell, and thus pre_run was failing. There was not any good user feedback for this issue, instead jobs would fail without any outputs. This should be checked explicitly in the code with a better error message.

The text was updated successfully, but these errors were encountered:

gpetretto · 2024-11-27T16:27:48Z

Thanks for reporting this issue. The problem here is that the pre_run is executed inside the script. At the moment jobflow-remote determines that the job did not complete successfully. I am not sure how it could determine if the error happened during the pre_run or during the execution of the Job. I suppose in your case you there were errors in both parts: an error while trying to activate the environement and then an error for the unknown jf command. So it would still be hard to tell the cause.
In a normal execution, the job would have gone in the REMOTE_ERROR state and the user can probably figure out relatively easily the origin of the problem by inspecting the queue.out and queue.err files (for which there is also the jf job queue-out command). Maybe there should be something specific for the tests, where additional information is extracted from the container, before it is siwtched off?

ml-evs · 2024-11-27T20:33:23Z

Thanks for reporting this issue. The problem here is that the pre_run is executed inside the script. At the moment jobflow-remote determines that the job did not complete successfully. I am not sure how it could determine if the error happened during the pre_run or during the execution of the Job. I suppose in your case you there were errors in both parts: an error while trying to activate the environement and then an error for the unknown jf command. So it would still be hard to tell the cause. In a normal execution, the job would have gone in the REMOTE_ERROR state and the user can probably figure out relatively easily the origin of the problem by inspecting the queue.out and queue.err files (for which there is also the jf job queue-out command). Maybe there should be something specific for the tests, where additional information is extracted from the container, before it is siwtched off?

Yeah, I was imagining some kind of preflight check, but that won't work well with different compute/login node environments. Perhaps just automatically doing the equivalent of jf job queue-out when storing the the error in the local database?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error handling for `pre_run` #220

Better error handling for `pre_run` #220

ml-evs commented Nov 27, 2024

gpetretto commented Nov 27, 2024

ml-evs commented Nov 27, 2024

Better error handling for pre_run #220

Better error handling for pre_run #220

Comments

ml-evs commented Nov 27, 2024

gpetretto commented Nov 27, 2024

ml-evs commented Nov 27, 2024

Better error handling for `pre_run` #220

Better error handling for `pre_run` #220