Best way to evaluate sum of all genWeight in NanoAOD #620

jrueb · 2021-11-29T16:04:46Z

jrueb
Nov 29, 2021

This is a NanoAOD topic I'm interested in how others have solved it.
The issue is that you want to scale your MC according to some luminosity and cross section. The formula to get to the final weight w of an event is w = genWeight * L * 𝜎 / sum(genWeight of this process), where L is the luminosity and 𝜎 is the cross section of the process. However, this requires the sum(genWeight of this process). NanoAOD provides genEventSumw as a single number per file for this purpose. Now one can either get the sum after the Coffea processor has run over all events and then apply them in a manual postprocessing step, or do a manual preprocessing step that collects the sum. Both don't seem very advantageous.

So is there a proper way of doing this in Coffea or am I missing something? The nicest way I could think of is having this done already in Coffea's preprocessing, where it also reads out the number of events per file.

lgray · 2021-11-29T16:16:51Z

lgray
Nov 29, 2021
Maintainer

As you point out - this is an analysis preference and in particular depends on the ntuple format being used, so assuming it within coffea is not the right place for it.

Beyond that, there is another problem where the summary results (for negative event weights and such) in the Run tree of the NanoAOD is accumulated over all events in the file and the processing jobs are over subsets (chunks) and an individual chunk may fail. I forget if the per-event 1/-1 weight is available in NanoAOD itself. If it is that makes things much easier and obvious. The solution may simply to be alter nanoaod to have the event weight in a more suitable place for dask-based processing.

2 replies

jrueb Nov 30, 2021
Author

As you point out - this is an analysis preference and in particular depends on the ntuple format being used, so assuming it within coffea is not the right place for it.

What I had in mind was a functionality that would let the user add additional steps to do for preprocessing.

I forget if the per-event 1/-1 weight is available in NanoAOD itself.

It sounds like what you mean is the genWeight branch. If so, sure it is available. The genEventSumw is there for convenience.

NJManganelli Dec 1, 2021
Maintainer

I've been thinking about this subject as well.

Being able to get the corrected sums and norms from the Runs tree would be convenient if it could be done in preprocessing without too much risk (I don't know how this plays with file access patterns, what part of the file it's already opening to begin with and such), but it only goes 5% in solving things and is not a panacea.

The ugly solution I used in the past was precalculating all of this info and storing it in config files (yaml, for readability and prettier programmatic updates/round-tripping than JSON). That, of course, relies on successfully processing all (potentially skimmed) events downstream, so that's one weak point.

That system of mine also didn't need to account for, at the time, splitting data into mva training and inference subsets and then appropriately scaling things, especially heterogenously (you might only train on a significant fraction of your largest few backgrounds, but want all of your MC from other backgrounds to go into the Likelihood estimation, for example). If you have to do btag shape-reweighting, you also need near-final event weights to calculate (probably 1 or 2D parameterized) renormalization maps for each MC dataset and systematic variation, which is an element you can't easily do postprocessing for (unless your histograms all have a superset of the axes and matching binning of those renorm maps, I suppose). And if you want to make ttrees with ~final event weights, that's also an annoying constraint, but perhaps the KISS solution is either re-processing (e.g. save your new ttree iteratively, then somehow scale the array down for all events in a kind of postprocessing step) or attaching as your own meta info besides the ttree at the end ("MyRuns").

So yes, non-trivial. I don't see any clean solutions for my workflow that don't rely on multiple passes in some way or another, for at least some of those elements. If you're processing the unfiltered datasets instead of skims, then just keeping track of genWeight sums in process(), followed by dividing (slices) of your histograms in postprocess should work for many use-cases, so long as you haven't done any aggregation of different datasets into groups yet.

nsmith- · 2022-04-20T14:51:35Z

nsmith-
Apr 20, 2022
Maintainer

Provided the input file is not already filtered (skimmed), one can always re-sum the genWeight in the process() function. This is how I usually operate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to evaluate sum of all genWeight in NanoAOD #620

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Best way to evaluate sum of all genWeight in NanoAOD #620

jrueb Nov 29, 2021

Replies: 2 comments · 2 replies

lgray Nov 29, 2021 Maintainer

jrueb Nov 30, 2021 Author

NJManganelli Dec 1, 2021 Maintainer

nsmith- Apr 20, 2022 Maintainer

jrueb
Nov 29, 2021

Replies: 2 comments 2 replies

lgray
Nov 29, 2021
Maintainer

jrueb Nov 30, 2021
Author

NJManganelli Dec 1, 2021
Maintainer

nsmith-
Apr 20, 2022
Maintainer