Efficient production of template histograms for statistical analysis #469
Replies: 9 comments 28 replies
-
Sticking to trying to answer questions and continue the discussion a bit. Here are my 2 cents: Q1: We did some quick checks of these, nothing in depth. We found: Unless the histograms have a very large number of bins this is largely a matter of what a user perceives as friendly. The answer was a bit different with the coffea.hist rather than hist/boost.histogram, the former could be very slow in a few cases that would suggest splitting the processors. Of course the second option is always computationally limited by having to loop over the data again, which is a fundamental scaling property... Unless you have some high-level optimization routines to squish that pattern into something efficient. Systematic variations are fairly efficiently handled with categorical binning, and is always significantly faster than instantiating a whole new histogram to keep track of. I think really high-dim histograms with categorical bins for variations is a better design pattern, but different people do book keeping in different ways, and there's not really be a comprehensive study of what common ways people actually like doing this. I suspect a variety of ways to doing this could be presented all with the same efficient back end. You could do the object-oriented SOA trick for histograms to make hist_collection["somename"] give you a "histogram" while still largely being an array indexing problem in the back. Q2: Largely coffea (tries) to provide good tools for analyzers to do things efficiently and while making very few suggestions about how to do it, no one likes being given an opinion without realizing it. So this level of tool is something we should think where it actually needs to live, but during development can be a coffea subpackage if we want that. What you're asking for here it developing a thing which creates and manipulates histograms according to a specific rule set with good justification that rule set is efficient. I suspect the hard part is convincing people they wanna do it that way, rather than actually writing code! Such a thing is always potentially useful, but we should be concerned about it being used instead. The latter requires lots of pondering about design, presentation, and use cases. Q3: Answering this sort of has flavors of what I was talking about in Q1 and Q2. A nice example of doing good bookkeeping by hand is in https://github.com/nsmith-/boostedhiggs/blob/cc/boostedhiggs/hbbprocessor.py . I think this could be extracted and thought about in the context of automation. What this really suggests to me is that there is a separate package that we want to build that just handles the mux/demux of systematics from objects to stacks of variated objects (as awkward arrays) with categories as axes. I don't think it even needs to be too-heavy of a thing (but you may always pile on syntactic sugar). from coffea import nanoevents
import awkward as ak
import numpy as np
fin = './nano106Xv8_on_mini106X_2017_mc_NANO_py_NANO_46.root'
factory = nanoevents.NanoEventsFactory.from_root(fin)
events = factory.events()
def make_pt_variants(muons):
systematic = np.array([1.0, 1.05, 0.95]) # this could be some function per muon too
return muons.pt * systematic[None, None, :]
muons = events.Muon
mask = make_pt_variants(muons) > 6.5
print(mask)
centrals = muons[mask[:,:,0]]
ups = muons[mask[:,:,1]]
downs = muons[mask[:,:,-1]] @simonepigazzini I think you may be interested in this conversation. |
Beta Was this translation helpful? Give feedback.
-
Moving some discussion on slack to here. Some further thinking:
some listing of core functionalities:
use:
accessibility / sharing:
With the hope that something like that would produce a focused but well rounded systematics calculation package that allows you to build further tools atop it. Some possibilities for what's needed in a prototype:
|
Beta Was this translation helpful? Give feedback.
-
Lindsey wrote a nice reply already, that mostly agrees with my view. I have a few things to add though. Regarding Q1, I think the main factor in preferring one processor (or at least one iteration of opening a given file and analyzing some chunk of it in a work element) is how expensive it is to get the data into memory. We spend a lot of time opening, reading, and decompressing, so we should try to do a lot with the data once it is uncompressed in memory and ready to use. We have yet to see a case where the compute needs outweigh the data locality in deciding which dimension to parallelize, but that may be possible. The other point I wanted to make was that I picked the |
Beta Was this translation helpful? Give feedback.
-
In some discussion with @sam-may one possible good piece of interface would be to introduce something like: @awkward.mixin_class(behavior)
class HasSystematic:
def systematics(self):
if 'systematics' in ak.fields(self)
return self['systematics']
self = ak.with_field(self, 'systematics', {})
return self["systematics"]
def add_systematic(self, name, what, varying_function):
# same thing except adding a systematic variation to self |
Beta Was this translation helpful? Give feedback.
-
FYI for people following there will be further in-person discussion here (12 May 2021): https://indico.cern.ch/event/1033225/ along with a presentation from people on MINERvA on how they handle systematics. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I have a couple points to add to this discussion and a couple questions. First, @alexander-held on your question of performance of processing together vs. multiple processors, we had a similar question in developing a columnar framework for H->gg. Our photons, for example, have many different systematic variations which result in new collections of photons, e.g.
and then performing the selection in a fully columnar way across all branches. vs.
I used this script to compare the speed of the two methods as a function of the number of systematics. As the attached plot shows, for a modest number of systematics (5-10) we could expect a factor of 2-3 improvement in speed from the fully columnar way, and this increases to a factor of 7-8 for a very large number of systematics. Given this nearly order of magnitude increase in speed, I am very motivated to pursue this in the context of the Hgg framework. My main concern with the fully columnar method is the bookkeeping (which is obvious in the "loop" method): how do we organize and keep track of all of these syst variations? I think this is achievable, but had a couple (possibly very naive) questions:
how could I rename my
?
Thanks in advance & please let me know if anything in my post is unclear. |
Beta Was this translation helpful? Give feedback.
-
After today's discussion in the Analysis Systems meeting, I've put together a notebook with the core functionality that I was walking through. The binder setup works now, binder defaults to pypy. Have a look. Anyway, you can get it all from https://github.com/CoffeaTeam/coffea/blob/systematics_work/binder/systematics_wip.ipynb. What I'd really like to try next is understanding:
I'll keep y'all updated. |
Beta Was this translation helpful? Give feedback.
-
More thoughts on interfaces to statistical tools. I'd like to be able to build elements in an ensemble by doing something like this: def get_universe(*args): # assuming up/down 1sigma systematics only, but this easily translates
syst_table = build_table(*args) # take input event, physics objects, etc, make a table
universe = np.clip(
np.random.normal(size=len(syst_table),
-1.5,
1.5) \
.astype(np.int32)
return *syst_table[:, universe] # explode table back into parts promoting chosen variations as returned objects This way we just have to agree on where things are put and that build_table can rearrange those things into a table indexed by systematics first. Event in an eventwise loop. |
Beta Was this translation helpful? Give feedback.
-
Hi, sorry the discussion at vCHEP had to be cut off -- we have been discussing this topic a bit for RDataFrame but I still have to really put my head down and think things through properly (and get feedback from people that know the physics usecases better than me, etc.). As RDF builds a computation graph, we can walk that graph before starting the event loop to figure out how many and which combinations of variations we need to calculate. Users "just" have to tell RDF which quantities vary and how (providing a callback). Back in December I had this short deck of slides to support that discussion: https://eguiraud.web.cern.ch/eguiraud/decks/20201210_systematics_ppp , not sure how relevant it is for coffea/awkward (e.g. due to C++/Python and event-wise/column-wise differences). |
Beta Was this translation helpful? Give feedback.
-
Hi, I am interested in using
coffea
to produce template histograms for statistical analysis. I am curious to hear how others approach this, in particular regarding two aspects:I believe that in particular this second aspect may be interesting to a wide range of
coffea
users. I describe my use case below, alongside some considerations regarding acoffea
implementation and a few questions.Template histograms for statistical analysis
I want to use
coffea
to create all histograms needed for statistical analysis. This means processing many datasets/samples1 in multiple channels and including systematics. Samples and systematics may be channel-dependent (but typically not strongly so), while usually at least some systematics are sample-dependent. A subset of the histograms defined by channels ⊗ samples ⊗ systematics is needed in practice.Current setup
At the moment I have a configuration file to define implicitly what histograms are needed (including information about them, like cuts needed), and code that turns this into a list of instructions to build every histogram. I can take this list of instructions and ship them to a simple function using uproot to process these instructions separately from each other. The instructions are sufficient to create each histogram (observable, binning, cuts, weights). This means that the function processing the instructions can be very generic2.
coffea
versionThe most naive way to port this to
coffea
seems to be a processor that receives the (observable, binning, cuts, weights) instructions to create one histogram at a time. I would call something likerun_uproot_job
once per histogram. I expect this approach to be inefficient, especially for weight-based systematics. It should be faster to build a nominal histogram plus another one where just the weight changes than building both histograms with two different processors (and reading the same columns / applying the same cuts).In the examples I’ve seen with
coffea
, multi-dimensional histogram are commonly used (with e.g. dataset, channel, systematic as axes, and the observable I’m interested in fitting as another axis). This can avoid this inefficiency from e.g. duplicated column reading. All logic to determine which histograms need to be produced and how to produce them (e.g. which systematics are needed per dataset, which cuts to apply per channel) is implemented in the processor, possibly based on metadata provided to the processor. One could callrun_uproot_job
once and rely oncoffea
to efficiently construct thousands of histograms.Efficient analysis-dependent processors?
I would like to avoid duplicating information related to the statistical model to construct. This information includes for example which samples to consider per channel, and which systematic uncertainties to apply to them. Writing a complex processor that efficiently produces many template histograms in parallel seems possible, but requires some information related to the statistical model. Changing the model may mean changing the processor where information is hardcoded. The same information may also need to be changed in parallel in the code that constructs the model from the template histograms.
A better approach?
With the model specified in a central place, that information could be used to both steer
coffea
and to build the statistical model from the template histograms. A fairly genericcoffea
processor could be used to build template histograms from instructions like (observable, binning, cuts, weights). The challenge in this approach is efficiency: how can template histograms be grouped to be efficiently built together, minimizing duplication of computation? The following workflow seems possible in principle:Writing a good tool for the second step might be a challenge. In practice I am guessing that performance improvements of factors 10-100 are possible with good grouping, compared to a naive approach where each template histogram is built independently from others.
When thinking about histograms as objects building from associated instructions, it may also be possible to create a hash from those instructions. If the statistical model changes slightly, the new instruction could be hashed and compared to the hash previously used. When they match, there is no need to re-construct a particular histogram. In principle this could be done before grouping histograms together, and lead to different grouping depending on which histograms need to be built. This approach may also work with complex processors that build multi-dimensional histograms for many channels/systematics at once, but it seems a lot less straightforward to implement to me.
Questions
Q1: Are you aware of performance comparisons for the two approaches:
Q2: Does some code already exists that could group histogram building instructions together in efficient ways? Would this potentially be useful to other users?
Q3: How do other users handle the logic for systematics? Object-based systematics that affect all samples in the same way seem straightforward, e.g. in the processor I say that instead of just using the nominal branch to build the nominal histogram, it should also use the jet_energy_scale_up branch to build the histogram for the corresponding systematic variation. Sample- (or channel-) specific systematics seem less straightforward to implement. Two-point modeling systematics (e.g. using a different MC generator to build the histogram, therefore using a different input file) also do not fit well into this. One could define additional samples for such systematics and skip the creation of systematics histograms (for e.g. object-based variations) for them. Do people usually just put all this logic into their processors?
Footnotes
1 A "sample" may refer to a single process and be built from a single or many files, it may be restricted to a subset of events from a process, or contain multiple processes.
2 At the moment I assume that the observables needed for the histograms and for the cuts are either already present in the input or easy enough to calculate, and that I do not need something complicated like a kinematic reconstruction to build each histogram. If such complicated calculations are required, they should probably be implemented directly in a
coffea
processor instead of being handed to a generic function processing instructions.Beta Was this translation helpful? Give feedback.
All reactions