Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build fail on node quits badly. #763

Open
dmageeLANL opened this issue Apr 12, 2024 · 2 comments
Open

Build fail on node quits badly. #763

dmageeLANL opened this issue Apr 12, 2024 · 2 comments
Assignees

Comments

@dmageeLANL
Copy link
Collaborator

Build on a node and the build fails but the status file looks like:

1712858170.457788 STATUS_CREATED Created status file.
1712858170.458920 CREATED Test directory and status file created.
1712858170.637588 BUILD_CREATED Builder created.
1712858170.642490 CREATED Test directory setup complete.
1712858171.290539 SCHEDULED Test kicked off (individually (flex)) under slurm scheduler.
1712858553.553697 BUILD_CREATED Builder created.
1712858904.348879 BUILD_CREATED Builder created.
1712858905.195227 BUILD_WAIT Waiting on lock for build 20796ce600e76a40.
1712858905.195858 BUILDING Starting build 20796ce600e76a40.
1712858905.196529 BUILDING Copying source directory /usr/projects/hpcml/test/dmagee/venado-acceptance/test_src/parthenon for build /usr/projects/hpcml/test/dmagee/venado-acceptance/../working_dir/builds/20796ce600e76a40 as the build directory.
1712858909.576306 BUILDING Generating dynamically created files.
1712858924.474080 BUILD_FAILED Build returned a non-zero result.
1712858925.552891 PREPPING_RUN Converting run template into run script.
1712858925.553972 RUNNING Starting the run script.
1712858925.557292 RUNNING Currently running.
1712858928.135093 RESULTS Parsing 7 result types.
1712858928.138710 RESULTS Performing 2 result evaluations.
1712858928.140112 RESULTS_ERROR Unexpected error parsing results: 'unsupported operand type(s) for /: 'NoneType' and 'int''... (This is a bug, you should report it.) See 'pav log kickoff 34' for the full error.

And then the build directory never gets moved to the 'failed' build directory and there's no record of the build anywhere. So if you go into test_runs/{number}/build/build you can't use ccmake to examine the artifacts. The build.log exists though.

Also, it doesn't stop the other builds. Perhaps there should be a 'fail_all` option for the build (and run) key where, if it's set, one build fail causes all runs to fail and if it isn't, they all act independently so build failures don't stop the series.

I think the solution would be to have it do what it seems it intends to and:

  • Move the failed build to the failed build directory.
  • Notify the user of the path to the failed build directory.
  • Have the cancel event actually call scancel and kill the other jobs.

But the second part of that is debatable. I think it's better to have the default setting be: don't kill other jobs if one fails, but that should be an option.

Also, we can see here that the run fails (of course, there's no binary), but it still continues and only throws an error at the results stage. That's dubious.

@dmageeLANL dmageeLANL self-assigned this Apr 12, 2024
@dmageeLANL
Copy link
Collaborator Author

Ok Ok, I get it now. You aren't moving the build dir to the fail dir, you're moving the build dir to the fail dir to the test directory/build. Ok, but why even move it to fail dir then? Just, if the build fails, move it to the test_dir/build. It's annoying that ccmake doesn't work because that's a convenient way to look at the config. But the raw CmakeCache.txt is just as good (I forgot they were the same thing).

Still, the fact that it keeps going after the build fails is bad and doesn't provide the correct info to the user. And it's weird that it writes BUILD_CREATED to status after RESULTS_ERROR several times. Not a definite number of times but somewhere between 2 and 6.

@dmageeLANL
Copy link
Collaborator Author

Wait, no. It adds a BUILD_CREATED line each time you pav cat <test> status. I just got like 13:

1712973075.890722 BUILD_CREATED Builder created.
1712973079.961097 BUILD_CREATED Builder created.
1712973083.182661 BUILD_CREATED Builder created.
1712973100.436680 BUILD_CREATED Builder created.
1712973104.653990 BUILD_CREATED Builder created.
1712973107.683803 BUILD_CREATED Builder created.
1712973110.989965 BUILD_CREATED Builder created.
1712973116.880021 BUILD_CREATED Builder created.
1712973125.673175 BUILD_CREATED Builder created.
1712973134.203118 BUILD_CREATED Builder created.
1712973137.567348 BUILD_CREATED Builder created.
1712973140.363867 BUILD_CREATED Builder created.
1712973144.783790 BUILD_CREATED Builder created.
1712973150.634879 BUILD_CREATED Builder created.
1712973155.402243 BUILD_CREATED Builder created.

What? And then I wrote this and went back and catted it again and yep there was only one more despite a minute passing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant