Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for corrupted / zombie files in ldmx-sw? #1406

Closed
tvami opened this issue Aug 24, 2024 · 5 comments · Fixed by #1443
Closed

Check for corrupted / zombie files in ldmx-sw? #1406

tvami opened this issue Aug 24, 2024 · 5 comments · Fixed by #1443
Assignees
Labels
framework core processing module Framework

Comments

@tvami
Copy link
Member

tvami commented Aug 24, 2024

Is your feature request related to a problem? Please describe.

This LDMX-Software/LDCS#23 made me think if we should have a built-in check for input files being corrupted / zombie.

Describe the solution you'd like

I'd think here
https://github.com/LDMX-Software/ldmx-sw/blob/trunk/Framework/src/Framework/Process.cxx#L279
is where this should go, but I dont know enough about the EventFile class, and a quick look almost suggests that it's like a TFile wrapper. So this makes me think if we could do

if (!inFile || inFile->IsZombie()) continue;
@tomeichlersmith
Copy link
Member

EventFile does not inherit from TFile but implementing a quick isCorrupted() function for it would make sense.

We actually already check for existence of both the file and the event tree in the EventFile constructor.

// open file with only reading enabled
file_ = new TFile(fileName_.c_str());
// double check that file is open
if (!file_->IsOpen()) {
EXCEPTION_RAISE("FileError", "Input file '" + fileName_ +
"' is not readable or does not exist.");
}
// Get the tree name from the configuration
auto tree_name{params.getParameter<std::string>("tree_name")};
tree_ = static_cast<TTree *>(file_->Get(tree_name.c_str()));
if (!tree_) {
EXCEPTION_RAISE("FileError", "File '" + fileName_ +
"' does not have a TTree named '" +
tree_name + "' in it.");
}
entries_ = tree_->GetEntriesFast();

Perhaps we include a check for IsZombie in there as well?

Personally, I really don't like the idea of continuing smoothly past a corrupted file. I think the Framework should end the program if a corrupted file is encountered in order to inform the user of this corruption. Maybe we can get around the failing-on-the-last-file annoyance by making sure to Write the output file at the end of every input file?

@tomeichlersmith tomeichlersmith added the framework core processing module Framework label Aug 28, 2024
@tvami
Copy link
Member Author

tvami commented Aug 29, 2024

I really don't like the idea of continuing smoothly past a corrupted file.

Well for the use case I ran into this, this would be not great. I had to ask LK to delete the input file and until they did it, I couldn't move on. (My use case was making a BDT skim, where I have 50 input files per output file)

So continuing on the bad file, and reporting that it was bad in the end of the workflow would be my preferred solution.

@tomeichlersmith
Copy link
Member

This is showing my own ignorance around Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

I'm just speaking from my own (non-Rucio, non-LDCS) experience where I make a file list and just remove the bad files from the list as I encounter them (if I can't personally delete them).

In any case, I'm comfortable with incorporating this skipping behavior, but I would like it to be designed with new users in mind as well. i.e. just making sure the error message is very clear and we don't accidentally silence the other error messages (no file or no tree). This might be a case where we add another flag to the constructor about "strictness" 🤔

@tvami
Copy link
Member Author

tvami commented Aug 29, 2024

Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

My understanding from @bryngemark is that this cannot be done.

This might be a case where we add another flag to the constructor about "strictness"

Yes another flag could work, so in default we'd exit if the file is problematic, but if the person doing this knows about the flag they can let the code skip files (the pool for people who use LDCS is anyway pretty limited)

@bryngemark
Copy link
Contributor

That's right, there is currently no smooth way out of this with LDCS. What we do is specify an input data set, and a number of files per job, and then aCT queries the rucio database for actual file names and paths and chooses which files a given job will run on. Resubmission of a failed job (yet to be fully implemented in a completely useful way) would repeat the same job with the same input files. aCT does not know about the reason for the job failure.

A flag sounds like a sensible solution to me, and would be hugely useful when we run harsh skims. For book-keeping, though, it would be great if the number of successfully opened files could be reported somewhere the metadata harvesting for rucio could find it. Otherwise, we'll soon be lost when it comes to EOT calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework core processing module Framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants