Check for corrupted / zombie files in ldmx-sw? #1406

tvami · 2024-08-24T22:51:10Z

Is your feature request related to a problem? Please describe.

This LDMX-Software/LDCS#23 made me think if we should have a built-in check for input files being corrupted / zombie.

Describe the solution you'd like

I'd think here
https://github.com/LDMX-Software/ldmx-sw/blob/trunk/Framework/src/Framework/Process.cxx#L279
is where this should go, but I dont know enough about the EventFile class, and a quick look almost suggests that it's like a TFile wrapper. So this makes me think if we could do

if (!inFile || inFile->IsZombie()) continue;

The text was updated successfully, but these errors were encountered:

tomeichlersmith · 2024-08-28T20:24:27Z

EventFile does not inherit from TFile but implementing a quick isCorrupted() function for it would make sense.

We actually already check for existence of both the file and the event tree in the EventFile constructor.

ldmx-sw/Framework/src/Framework/EventFile.cxx

Lines 53 to 69 in d74f8cc

    
           // open file with only reading enabled 
        
           file_ = new TFile(fileName_.c_str()); 
        
           // double check that file is open 
        
           if (!file_->IsOpen()) { 
        
             EXCEPTION_RAISE("FileError", "Input file '" + fileName_ + 
        
                                              "' is not readable or does not exist."); 
        
           } 
        
           // Get the tree name from the configuration 
        
           auto tree_name{params.getParameter<std::string>("tree_name")}; 
        
           tree_ = static_cast<TTree *>(file_->Get(tree_name.c_str())); 
        
           if (!tree_) { 
        
             EXCEPTION_RAISE("FileError", "File '" + fileName_ + 
        
                                              "' does not have a TTree named '" + 
        
                                              tree_name + "' in it."); 
        
           } 
        
           entries_ = tree_->GetEntriesFast();

Perhaps we include a check for IsZombie in there as well?

Personally, I really don't like the idea of continuing smoothly past a corrupted file. I think the Framework should end the program if a corrupted file is encountered in order to inform the user of this corruption. Maybe we can get around the failing-on-the-last-file annoyance by making sure to Write the output file at the end of every input file?

tvami · 2024-08-29T01:30:22Z

I really don't like the idea of continuing smoothly past a corrupted file.

Well for the use case I ran into this, this would be not great. I had to ask LK to delete the input file and until they did it, I couldn't move on. (My use case was making a BDT skim, where I have 50 input files per output file)

So continuing on the bad file, and reporting that it was bad in the end of the workflow would be my preferred solution.

tomeichlersmith · 2024-08-29T13:31:37Z

This is showing my own ignorance around Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

I'm just speaking from my own (non-Rucio, non-LDCS) experience where I make a file list and just remove the bad files from the list as I encounter them (if I can't personally delete them).

In any case, I'm comfortable with incorporating this skipping behavior, but I would like it to be designed with new users in mind as well. i.e. just making sure the error message is very clear and we don't accidentally silence the other error messages (no file or no tree). This might be a case where we add another flag to the constructor about "strictness" 🤔

tvami · 2024-08-29T13:51:18Z

Rucio and LDCS - is it possible to mark files as bad and re-run a workflow so that bad files are skipped?

My understanding from @bryngemark is that this cannot be done.

This might be a case where we add another flag to the constructor about "strictness"

Yes another flag could work, so in default we'd exit if the file is problematic, but if the person doing this knows about the flag they can let the code skip files (the pool for people who use LDCS is anyway pretty limited)

bryngemark · 2024-09-02T13:46:19Z

That's right, there is currently no smooth way out of this with LDCS. What we do is specify an input data set, and a number of files per job, and then aCT queries the rucio database for actual file names and paths and chooses which files a given job will run on. Resubmission of a failed job (yet to be fully implemented in a completely useful way) would repeat the same job with the same input files. aCT does not know about the reason for the job failure.

A flag sounds like a sensible solution to me, and would be hugely useful when we run harsh skims. For book-keeping, though, it would be great if the number of successfully opened files could be reported somewhere the metadata harvesting for rucio could find it. Otherwise, we'll soon be lost when it comes to EOT calculations.

tomeichlersmith added the framework core processing module Framework label Aug 28, 2024

tvami mentioned this issue Sep 6, 2024

Add more features to ECAL DQM & validation module #1440

Merged

2 tasks

tomeichlersmith linked a pull request Sep 6, 2024 that will close this issue

check for corrupted zombie files in ldmx sw #1443

Merged

2 tasks

tomeichlersmith mentioned this issue Sep 6, 2024

check for corrupted zombie files in ldmx sw #1443

Merged

2 tasks

tomeichlersmith closed this as completed in #1443 Sep 6, 2024

tomeichlersmith closed this as completed in a113e1b Sep 6, 2024

tvami assigned tomeichlersmith Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for corrupted / zombie files in ldmx-sw? #1406

Check for corrupted / zombie files in ldmx-sw? #1406

tvami commented Aug 24, 2024

tomeichlersmith commented Aug 28, 2024

tvami commented Aug 29, 2024

tomeichlersmith commented Aug 29, 2024

tvami commented Aug 29, 2024

bryngemark commented Sep 2, 2024

Check for corrupted / zombie files in ldmx-sw? #1406

Check for corrupted / zombie files in ldmx-sw? #1406

Comments

tvami commented Aug 24, 2024

tomeichlersmith commented Aug 28, 2024

tvami commented Aug 29, 2024

tomeichlersmith commented Aug 29, 2024

tvami commented Aug 29, 2024

bryngemark commented Sep 2, 2024