Efficient production of template histograms for statistical analysis #469

alexander-held · 2021-03-19T15:07:13Z

alexander-held
Mar 19, 2021
Maintainer

Hi, I am interested in using coffea to produce template histograms for statistical analysis. I am curious to hear how others approach this, in particular regarding two aspects:

hardcoding a lot of analysis logic in a processor vs a more generic processor structure,
grouping of histograms to avoid duplicated computations.

I believe that in particular this second aspect may be interesting to a wide range of coffea users. I describe my use case below, alongside some considerations regarding a coffea implementation and a few questions.

Template histograms for statistical analysis

I want to use coffea to create all histograms needed for statistical analysis. This means processing many datasets/samples¹ in multiple channels and including systematics. Samples and systematics may be channel-dependent (but typically not strongly so), while usually at least some systematics are sample-dependent. A subset of the histograms defined by channels ⊗ samples ⊗ systematics is needed in practice.

Current setup

At the moment I have a configuration file to define implicitly what histograms are needed (including information about them, like cuts needed), and code that turns this into a list of instructions to build every histogram. I can take this list of instructions and ship them to a simple function using uproot to process these instructions separately from each other. The instructions are sufficient to create each histogram (observable, binning, cuts, weights). This means that the function processing the instructions can be very generic².

`coffea` version

The most naive way to port this to coffea seems to be a processor that receives the (observable, binning, cuts, weights) instructions to create one histogram at a time. I would call something like run_uproot_job once per histogram. I expect this approach to be inefficient, especially for weight-based systematics. It should be faster to build a nominal histogram plus another one where just the weight changes than building both histograms with two different processors (and reading the same columns / applying the same cuts).

In the examples I’ve seen with coffea, multi-dimensional histogram are commonly used (with e.g. dataset, channel, systematic as axes, and the observable I’m interested in fitting as another axis). This can avoid this inefficiency from e.g. duplicated column reading. All logic to determine which histograms need to be produced and how to produce them (e.g. which systematics are needed per dataset, which cuts to apply per channel) is implemented in the processor, possibly based on metadata provided to the processor. One could call run_uproot_job once and rely on coffea to efficiently construct thousands of histograms.

Efficient analysis-dependent processors?

I would like to avoid duplicating information related to the statistical model to construct. This information includes for example which samples to consider per channel, and which systematic uncertainties to apply to them. Writing a complex processor that efficiently produces many template histograms in parallel seems possible, but requires some information related to the statistical model. Changing the model may mean changing the processor where information is hardcoded. The same information may also need to be changed in parallel in the code that constructs the model from the template histograms.

A better approach?

With the model specified in a central place, that information could be used to both steer coffea and to build the statistical model from the template histograms. A fairly generic coffea processor could be used to build template histograms from instructions like (observable, binning, cuts, weights). The challenge in this approach is efficiency: how can template histograms be grouped to be efficiently built together, minimizing duplication of computation? The following workflow seems possible in principle:

generate list of instructions for building every template histogram needed, one instruction per histogram
analyze instructions to group them together
send groups of instructions off to processor

Writing a good tool for the second step might be a challenge. In practice I am guessing that performance improvements of factors 10-100 are possible with good grouping, compared to a naive approach where each template histogram is built independently from others.

When thinking about histograms as objects building from associated instructions, it may also be possible to create a hash from those instructions. If the statistical model changes slightly, the new instruction could be hashed and compared to the hash previously used. When they match, there is no need to re-construct a particular histogram. In principle this could be done before grouping histograms together, and lead to different grouping depending on which histograms need to be built. This approach may also work with complex processors that build multi-dimensional histograms for many channels/systematics at once, but it seems a lot less straightforward to implement to me.

Questions

Q1: Are you aware of performance comparisons for the two approaches:

processing as much as possible together, and creating multiple histograms at once where only small things change (e.g. weight or observable),
processing histograms independently of each other in separate processors?

Q2: Does some code already exists that could group histogram building instructions together in efficient ways? Would this potentially be useful to other users?

Q3: How do other users handle the logic for systematics? Object-based systematics that affect all samples in the same way seem straightforward, e.g. in the processor I say that instead of just using the nominal branch to build the nominal histogram, it should also use the jet_energy_scale_up branch to build the histogram for the corresponding systematic variation. Sample- (or channel-) specific systematics seem less straightforward to implement. Two-point modeling systematics (e.g. using a different MC generator to build the histogram, therefore using a different input file) also do not fit well into this. One could define additional samples for such systematics and skip the creation of systematics histograms (for e.g. object-based variations) for them. Do people usually just put all this logic into their processors?

Footnotes

¹ A "sample" may refer to a single process and be built from a single or many files, it may be restricted to a subset of events from a process, or contain multiple processes.
² At the moment I assume that the observables needed for the histograms and for the cuts are either already present in the input or easy enough to calculate, and that I do not need something complicated like a kinematic reconstruction to build each histogram. If such complicated calculations are required, they should probably be implemented directly in a coffea processor instead of being handed to a generic function processing instructions.

lgray · 2021-04-13T18:19:34Z

lgray
Apr 13, 2021
Maintainer

Sticking to trying to answer questions and continue the discussion a bit. Here are my 2 cents:

Q1: We did some quick checks of these, nothing in depth. We found: Unless the histograms have a very large number of bins this is largely a matter of what a user perceives as friendly. The answer was a bit different with the coffea.hist rather than hist/boost.histogram, the former could be very slow in a few cases that would suggest splitting the processors.

Of course the second option is always computationally limited by having to loop over the data again, which is a fundamental scaling property... Unless you have some high-level optimization routines to squish that pattern into something efficient.

Systematic variations are fairly efficiently handled with categorical binning, and is always significantly faster than instantiating a whole new histogram to keep track of. I think really high-dim histograms with categorical bins for variations is a better design pattern, but different people do book keeping in different ways, and there's not really be a comprehensive study of what common ways people actually like doing this. I suspect a variety of ways to doing this could be presented all with the same efficient back end.

You could do the object-oriented SOA trick for histograms to make hist_collection["somename"] give you a "histogram" while still largely being an array indexing problem in the back.

Q2: Largely coffea (tries) to provide good tools for analyzers to do things efficiently and while making very few suggestions about how to do it, no one likes being given an opinion without realizing it. So this level of tool is something we should think where it actually needs to live, but during development can be a coffea subpackage if we want that.

What you're asking for here it developing a thing which creates and manipulates histograms according to a specific rule set with good justification that rule set is efficient. I suspect the hard part is convincing people they wanna do it that way, rather than actually writing code!

Such a thing is always potentially useful, but we should be concerned about it being used instead. The latter requires lots of pondering about design, presentation, and use cases.

Q3: Answering this sort of has flavors of what I was talking about in Q1 and Q2. A nice example of doing good bookkeeping by hand is in https://github.com/nsmith-/boostedhiggs/blob/cc/boostedhiggs/hbbprocessor.py . I think this could be extracted and thought about in the context of automation.

What this really suggests to me is that there is a separate package that we want to build that just handles the mux/demux of systematics from objects to stacks of variated objects (as awkward arrays) with categories as axes.

I don't think it even needs to be too-heavy of a thing (but you may always pile on syntactic sugar).
As stream-of-consciouness-ish example:

from coffea import nanoevents
import awkward as ak
import numpy as np

fin = './nano106Xv8_on_mini106X_2017_mc_NANO_py_NANO_46.root'
factory = nanoevents.NanoEventsFactory.from_root(fin)
events = factory.events()

def make_pt_variants(muons):
    systematic = np.array([1.0, 1.05, 0.95])  # this could be some function per muon too
    return muons.pt * systematic[None, None, :]
    
muons = events.Muon
mask = make_pt_variants(muons) > 6.5
print(mask)

centrals = muons[mask[:,:,0]]
ups = muons[mask[:,:,1]]
downs = muons[mask[:,:,-1]]

@simonepigazzini I think you may be interested in this conversation.

9 replies

lgray Apr 14, 2021
Maintainer

Updated the above, I really wanted to underline how easy it was to get expressive semantics and multiplexing of cuts with awkward array.

simonepigazzini Apr 15, 2021

Hi folks,

thanks for sharing this, I must admit I have to dig in a bit more and make myself more familiar with coffea, but I have a trivial question concerning this (very interesting) way of implementing syst variations. Imagine this setup:

you have Muons and Jets
syst variations affecting Muons do not affect any Jet observable
if I understand correctly the syst array above for Muons would look something like systematics = np.array([1, 1.05, 0.95, 1, 1]) while the one for Jets would be like systematics = np.array([1, 1, 1, 1.2, 0.8]). This assuming in position 1,2 we have the muon syst variation while in position 3,4 the jet's ones, does this match what you are sketching out in the example?
Would this imply unnecessary memory consumption to some extent (the repeated ones in the syst array causing duplicates when one wants to access the varied values) or this can somehow be handled efficiently by akward? (Of course just duplicating the data for every syst variation has the same problem and this approach is much clearer, so not really an issue but more a curiosity).

Thank a lot

simone

lgray Apr 15, 2021
Maintainer

Hi Simone:

Here's a quick example

from coffea import nanoevents
import awkward as ak
import numpy as np

fin = './nano106Xv8_on_mini106X_2017_mc_NANO_py_NANO_46.root'
factory = nanoevents.NanoEventsFactory.from_root(fin)
events = factory.events()

def muon_pt_variants(muons):
    systematic = np.array([1.0, 1.05, 0.95])  # this could be some function per muon too
    return muons.pt * systematic[None, None, :]

def jet_pt_variants(jets):
    systematic = np.array([1.0, 1.2, 0.8])  # this could be some function per jet too
    return jets.pt * systematic[None, None, :]

muons = events.Muon
jets = events.Jet
muon_mask = muon_pt_variants(muons) > 6.5
jet_mask = jet_pt_variants(jets) > 30

muon_jets = [[ak.cartestian([muons[muon_mask[:,:,i]], jets[jet_mask[:,:,j]]], axis=1) for j in range(-1, 2)] for i in range (-1, 2)]
# modulo some code golf and being more clever with awkward.

And with this we have a table of muon + jet pairs multiplexed over systematics for muons and jets individually.

Unless there's a bug in awkward or I missed a rendered view here, it is all achieved through re-indexing the original arrays and is pretty memory efficient.

simonepigazzini Apr 15, 2021

Hi @lgray. This is very clear, thank you.

simone

lgray Apr 15, 2021
Maintainer

You can even do more user-friendly stuff like replace the muon / jet pt with the corresponding variant.

from coffea import nanoevents
import awkward as ak
import numpy as np

fin = './nano106Xv8_on_mini106X_2017_mc_NANO_py_NANO_46.root'
factory = nanoevents.NanoEventsFactory.from_root(fin)
events = factory.events()

def muon_pt_variants(muons):
    systematic = np.array([1.0, 1.05, 0.95])  # this could be some function per muon too
    return muons.pt * systematic[None, None, :]

def jet_pt_variants(jets):
    systematic = np.array([1.0, 1.2, 0.8])  # this could be some function per jet too
    return jets.pt * systematic[None, None, :]

muons = events.Muon
jets = events.Jet
varied_muon_pts = muon_pt_variants(muons) 
varied_jet_pts = jet_pt_variants(jets)
muon_mask = varied_muon_pts > 6.5
jet_mask = varied_jet_pts > 30

muon_jets = [[
    ak.cartestian([
            ak.with_field(muons, varied_muon_pts[:,:,i], 'pt')[muon_mask[:,:,i]], 
            ak.with_field(jets, varied_jet_pts[:,:,j], 'pt')[jet_mask[:,:,j]]
         ], 
         axis=1
    )
    for j in range(-1, 2)] for i in range (-1, 2)
]
# modulo some code golf and being more clever with awkward.

I believe a large degree, and probably all, of this can be well and generally automated, with some significant thought.

lgray · 2021-04-14T17:57:34Z

lgray
Apr 14, 2021
Maintainer

Moving some discussion on slack to here.
Here's some notes from @alexander-held working on a prototype thinking about what is wanted:
https://github.com/alexander-held/coffea-to-cabinetry/blob/main/coffea-to-cabinetry.ipynb

Some further thinking:

One challenge here is that it is not yet clear to me how to best integrate analysis code into the whole picture. I have envisioned so far that users write processors that implement the analysis and calculate all columns, and then a yml like the one in that notebook defines the fit model. The code reading the yml would use the provided processor to fill histograms. This part is definitely not fully thought out.
For experimental systematics, there is indeed not much flexibility and it seems possible to figure out from the analysis what is needed - e.g. I use a given muon isolation working point in my analysis code, so that directly determines that I need the associated systematic uncertainty in my fit model.
For theory systematics that is a lot harder, and I would argue at least in some cases not practical. For a statistics-limited BSM search, it could be fine to use whatever default recommended ttbar model + the recommended systematic uncertainties, pulled from some external place that lists those. For a ttbar mass measurement, you’ll need all kinds of more custom treatment. Additionally, while defining the fit model it’s quite common to try out different setups. The kind of structure where you define systematics like in that example makes those tests straightforward in my experience.

some listing of core functionalities:
implementation:

the implementation per analysis will determine what and how physics objects (including generator quantities) are being used and calculate systematics for them
the implementation per physics object will determine what systematics to calculate for a given physics object observable/property based on input analysis
the implementation per physics object property aims to efficiently implement the calculation of all variations of a given systematics for a given particle
implementation of systematic variations should be interoperable with awkward array
implementation of systematics should encourage efficient memory and processor use to minimize impact on analysis speed

use:

I would like, by default, that my code is analyzed and a list of systematics to apply is generated and implemented
I should be able to blacklist systematics that I do not want applied
I should be able to whitelist systematics that I do want applied
I should be able to apply systematics by hand / ad-hoc if needed (but with strong encouragement to build it into the system)

accessibility / sharing:

I would like to be able to know easily if someone has calculated this systematic before
I would like to be able to define my own systematics and share them
Using other's systematics implementations should be like importing packages
Applying systematics should be efficient
Applying systematics by hand should be syntactically friendly
The data table made by applying systematics at any scale should make sense and be approachable to a human
Minimal dependencies

With the hope that something like that would produce a focused but well rounded systematics calculation package that allows you to build further tools atop it.

Some possibilities for what's needed in a prototype:

that you can efficiently implement the systematics in a friendly way
that you can take an some physics object and statically analyze it for systematics
that you can make a user-friendly interface for building and applying systematics

0 replies

nsmith- · 2021-04-14T21:34:06Z

nsmith-
Apr 14, 2021
Maintainer

Lindsey wrote a nice reply already, that mostly agrees with my view. I have a few things to add though.

Regarding Q1, I think the main factor in preferring one processor (or at least one iteration of opening a given file and analyzing some chunk of it in a work element) is how expensive it is to get the data into memory. We spend a lot of time opening, reading, and decompressing, so we should try to do a lot with the data once it is uncompressed in memory and ready to use. We have yet to see a case where the compute needs outweigh the data locality in deciding which dimension to parallelize, but that may be possible.
There is some additional consideration, once the data is in memory, how we approach filling histograms for several systematics. In the code Lindsey linked above (no pressure) I adopted a split style: for event weight systematics I developed the Weights utility to be able to modify the global event weight and retain a record of the systematic shifts associated with each type of correction, so that at the end of the analysis I can loop over each shift and fill a histogram category with the respective weight modification. Meanwhile for shift systematics like JEC, I decided to just wrap the entire processing function with an outer loop that swaps out the NanoEvents Jet and MET collections with their shifted counterparts, leaving the rest the same. This is not optimal, because I waste time recomputing intermediate information that may not depend on the systematic (e.g. lepton counts for cleaning). Ideally, objects that have associated shift systematics would propagate those automatically through arithmetic. This is a tractable problem and would be a good use of awkward array mixin and virtual array functionality, that I hope to investigate soon™️. Hmm, I guess that segued into Q3.

The other point I wanted to make was that I picked the ProcessorABC idiom up from ROOT MakeClass/MakeSelector. I figured, if these encapsulations of user code were good enough for decades, they are a safe starting point for us. I don't think it's the ultimate idiom for encapsulating user code, but did not want to hold coffea back by trying to find it. Certainly RDataFrame is an interesting foray into this, teaching users to explicitly build the computation graph. We could do the same, though I'm still a bit of a fan of implicit building through delayed/lazy operations. It depends on the resolution: you don't want to force everyone to make a node for each 4-vector addition (too much busywork) but at the same time want to discourage putting all the work into one super-node that can't be optimized without advanced introspection (MakeSelector/Processor). There's a happy medium somewhere in there.

0 replies

lgray · 2021-04-24T03:25:08Z

lgray
Apr 24, 2021
Maintainer

In some discussion with @sam-may one possible good piece of interface would be to introduce something like:

@awkward.mixin_class(behavior)
class HasSystematic:

    def systematics(self):
        if 'systematics' in ak.fields(self)
            return self['systematics']
        self = ak.with_field(self, 'systematics', {})
        return self["systematics"]

    def add_systematic(self, name, what, varying_function):        
        # same thing except adding a systematic variation to self

3 replies

lgray Apr 26, 2021
Maintainer

Thinking on this more you'd want this to know about ValueSystematic vs. WeightSystematic for objects since they combine together differently, but both still follow the "swiss-cross" bookkeeping rule (you almost never want to know what happens when you put two JES's up since we assume they're largely uncorrelated). Since doing more than +1/-1 sigma or full cartesian products is the rare case it would be easy to write the exception for it when exploding all the histograms.

alexander-held Apr 26, 2021
Maintainer Author

Many systematics in practice follow this quite restricted pattern of varying a single source up/down. HistFactory in ATLAS also relies on this pattern. Some things, like NNPDF uncertainties, do not fit into this pattern very well.

It would be great to have an interface that is not restricted to this subset of possible ways to provide systematic uncertainties, but allows the handling of more general template-based cases. In the context of cabinetry, some thoughts about describing more general cases can be found in this issue and this gist. The cases distinguished there are:

single source varied up / down by 1 sigma
single source varied up / down by arbitrary values (e.g. [-1.5σ, -0.5σ, +0.5σ, +1.5σ])
direct product between multiple sources, e.g. source 1 varied by [-1σ, 1σ] and source 2 by [-0.5σ, +0.5σ] resulting in [(-1σ, -0.5σ), (-1σ, +0.5σ), (+1σ, -0.5σ), (+1σ, +0.5σ)]
meshgrid of arbitrary number of sources evaluated at arbitrary values

lgray Apr 26, 2021
Maintainer

I will certainly take a look at what you have!

For sure! I had a typo above, I meant to say that the up/down practice would be default, and then it's pretty straightforward to allow it to be flexible with this mixing (varying_function can return anything FWIW). One thing to be careful with is how we label or index things to indicate what they are, dealing with non-unit-sigma variations could be a pain. I suppose you could return two tensors one the values and the second the number of sigmas at each index. Could put you in a corner for memory use though.

lgray · 2021-04-26T18:52:51Z

lgray
Apr 26, 2021
Maintainer

FYI for people following there will be further in-person discussion here (12 May 2021): https://indico.cern.ch/event/1033225/ along with a presentation from people on MINERvA on how they handle systematics.

0 replies

sam-may · 2021-04-29T16:47:28Z

sam-may
Apr 29, 2021

Hi all,

I have a couple points to add to this discussion and a couple questions.

First, @alexander-held on your question of performance of processing together vs. multiple processors, we had a similar question in developing a columnar framework for H->gg. Our photons, for example, have many different systematic variations which result in new collections of photons, e.g. pt varied up/down. We were curious about the performance difference between

Dealing with this in the most columnar way possible (as already suggested by @lgray and @nsmith- with the * systematic[None, None, :] trick), having branches like:

events.Photon.pt.nominal
events.Photon.pt.syst1_up
...

and then performing the selection in a fully columnar way across all branches.

vs.

Dealing with this in a more intuitive (subjective :) ) way: creating multiple sets of events objects and looping through these, i.e. we might loop through a dictionary like this

"nominal_events" : events, # Photon.pt points to nominal pt
"photons_syst1_up_events : events_photons_syst1_up # Photon.pt points to pt varied up by syst1
...

I used this script to compare the speed of the two methods as a function of the number of systematics. As the attached plot shows, for a modest number of systematics (5-10) we could expect a factor of 2-3 improvement in speed from the fully columnar way, and this increases to a factor of 7-8 for a very large number of systematics.

syst_comparison.pdf

Given this nearly order of magnitude increase in speed, I am very motivated to pursue this in the context of the Hgg framework.

My main concern with the fully columnar method is the bookkeeping (which is obvious in the "loop" method): how do we organize and keep track of all of these syst variations? I think this is achievable, but had a couple (possibly very naive) questions:

In the example of constructing some dummy systematics as:

events.Muon.pt * np.array([1.0, 1.05, 0.95])[None, None, :]

how could I rename my events.Muon.pt object to have something like:

events.Muon.pt # nominal pt
events.Muon.systematics.pt.nominal # nominal pt
events.Muon.systematics.pt.up # up varied pt
...

?

How could I implement some custom function for calculating the systematics? In other words suppose I want to do something like (pseudocode):

events.Muon.pt * np.array([1.0, arbitary function/lookup table, arbitary function/lookup table])[None, None, :]

Thanks in advance & please let me know if anything in my post is unclear.

7 replies

lgray Apr 30, 2021
Maintainer

Honestly, I think this idea of many collections of photons is a wonky way to think about it. It's a nice way to work if you're constrained by needing to put stuff in CMSSW collections, but there are much more natural, expressive, and efficient ways to organize the data when you're just considering a columnar store.

lgray May 3, 2021
Maintainer

So I've been thinking a bit more about this and I think what the way we wanna do this at the end of the day is evaluating the systematics in the [:, :, [systematics list]] form and then making that form accessible as objects with the appropriate variation.

This should let us keep the speed while also allowing code written down stream to be written easily, otherwise you have to keep putting systematics loops in everything and that's annoying. Having some function(s) that is(are) your analysis payload that just need to be run in a systematics loop with different objects passed in seems more natural and efficient to me.

sam-may May 3, 2021

Hi @lgray, all,

I've been playing with this quite a bit for the past week and some challenges have arisen that I didn't initially expect. I summarized them in this thread, for anyone who is curious.

Basically, it boils down to:

"Exploding" the systematics properly is non-trivial when dealing with multiple sets of objects, each with multiple fields which have their own variations. In particular, when we want to make combinations of these objects (e.g. calculating mass of electron/photon pairs).
The initial improvement I quoted from doing things in a fully vectorized way, nearly an order of magnitude, was a bit of an overestimation. When we consider this in the context of: there are other cuts which do not need to be re-calculated, there are other aspects to the script (reading/writing files, actually calculating the systematics variations from json/function, etc), my back-of-the-envelope estimate is that in a realistic scenario (realistic for H->gg, at least), the improvement is closer to a factor of ~2.

Dealing with systematics in a fully vectorized way is more efficient in terms of computation, that is very clear. And an analysis can certainly be implemented with systematics dealt with in a fully vectorized way. But, it has been quite difficult for me to get things implemented properly. Maybe there is some obvious way that I am missing for the troublesome cases I describe, please let me know if you think that's the case.

lgray May 3, 2021
Maintainer

That's fine - then we can focus on user interface and expression rather than speed.... But why not have both?

So I think I was wrong in my suggestion on the other discussion that this is from python loops, it is likely more about being very careful when to render columns and copy them. I'd suggest giving a bit of time to being careful with how you do that in your script, trigger as few copies as possible in your python for loop. A python loop of length 20 should be pretty fast, you worry more when this number is more like ~1000 or has pyroot in the loop.

My 2 cents. I think most of this stuff is not as mutually exclusive as you are writing. You can do this with the speed of the vectorized method and with a reasonable/clear user interface if you use ak.virtuals to index into whatever column of the systematic you want by a getattr or something else, and then cache the tuple of sytematic variations for later re-indexing if you want a different one.

Similarly, with awkward behaviors you can control per systematic type how you explode them and build up a list that way using behaviors per systematic type (i.e. make a new subclass that has def explode()) defined for a specific systematics record array type. Then build up the list of systematic variations based on how your objects are decorated. Then analyzers just focus on writing analysis functions to do what they want, and so long as they annotate their objects correctly a rules-based system can take care of the rest.

Neither of these are really terrible patterns to code and the mechanics are largely covered by a base class so you can keep the threshold for users making contributions pretty low.

lgray May 3, 2021
Maintainer

Moreover - if you can define the exploding rules per systematic type you can micro-optimize for further vectorization where possible, like when filling histograms.

lgray · 2021-05-12T23:18:57Z

lgray
May 12, 2021
Maintainer

After today's discussion in the Analysis Systems meeting, I've put together a notebook with the core functionality that I was walking through. The binder setup works now, binder defaults to pypy. Have a look.

Anyway, you can get it all from https://github.com/CoffeaTeam/coffea/blob/systematics_work/binder/systematics_wip.ipynb.
This is really a work in progress and is not even close to a final product, but starts to define a shape.

What I'd really like to try next is understanding:

how exploding the systematics works in practice
moving systematics handling deeper into nanoevents and making it more possible/clear how to specify various kinds of systematics.

I'll keep y'all updated.

4 replies

lgray May 13, 2021
Maintainer

@sam-may @simonepigazzini you may wanna have a look, feedback would be nice. It's early stages but it gives a feeling.

lgray May 13, 2021
Maintainer

OK - now I've updated this to be within nanoevents and all the basic physics object inherit from it. I may move it deeper since it's fairly light weight.

Presently working on adding weight systematics to the Up/Down systematic class.

@alexander-held Perhaps you could try implementing some of the more exotic systematics you mentioned. I can try to go off of what you talked about above as well.

lgray May 13, 2021
Maintainer

Just to have something that's somewhat guided, I've made a draft PR here: #529 .

It is more or less an extension of this discussion.

lgray May 16, 2021
Maintainer

Updated - binder works now.

lgray · 2021-05-19T17:23:12Z

lgray
May 19, 2021
Maintainer

More thoughts on interfaces to statistical tools. I'd like to be able to build elements in an ensemble by doing something like this:

def get_universe(*args): # assuming up/down 1sigma systematics only, but this easily translates
    syst_table = build_table(*args) # take input event, physics objects, etc, make a table
    universe = np.clip(
         np.random.normal(size=len(syst_table), 
         -1.5, 
         1.5) \
   .astype(np.int32)
   return *syst_table[:, universe] # explode table back into parts promoting chosen variations as returned objects

This way we just have to agree on where things are put and that build_table can rearrange those things into a table indexed by systematics first. Event in an eventwise loop.

0 replies

eguiraud · 2021-05-19T17:37:03Z

eguiraud
May 19, 2021

Hi, sorry the discussion at vCHEP had to be cut off -- we have been discussing this topic a bit for RDataFrame but I still have to really put my head down and think things through properly (and get feedback from people that know the physics usecases better than me, etc.).

As RDF builds a computation graph, we can walk that graph before starting the event loop to figure out how many and which combinations of variations we need to calculate. Users "just" have to tell RDF which quantities vary and how (providing a callback). Back in December I had this short deck of slides to support that discussion: https://eguiraud.web.cern.ch/eguiraud/decks/20201210_systematics_ppp , not sure how relevant it is for coffea/awkward (e.g. due to C++/Python and event-wise/column-wise differences).

5 replies

lgray May 19, 2021
Maintainer

This discussion has boiled down into asking questions about what the interface should look like.
I don't think there's any reason we should have the boundary between data reduction / histograming and high-level statistical analysis code depend on what software framework is being used.

The user interface you're indicating for systematics in RDataFrame is more or less event-loop the analogue of #529, except far less integrated into the interpretation layer of the data format.

eguiraud May 19, 2021

I think what can/should be common between the frameworks is how the data reduction looks like (dictionaries/lists of histograms vs many-dim histograms with categorical axes vs ...). And of course also a high-level description of the list of systematics (e.g. in YAML format like in cabinetry) would be framework-agnostic, and different frameworks/tools could all speak systematics-in-yaml.

more or less event-loop the analogue of #529

one difference is that in RDF users would say (if the design I have in mind concretizes) "vary x like this" but then can still refer to x as if it was just the nominal quantity, not a triplet -- the "duplication" of calculations happens internally.

lgray May 19, 2021
Maintainer

one difference is that in RDF users would say (if the design I have in mind concretizes) "vary x like this" but then can still refer to x as if it was just the nominal quantity, not a triplet -- the "duplication" of calculations happens internally.

Yep, we aim to have that too, it's a pleasant interface. Requires knowing if a given quantity has variations that point at it. Understanding what the bookkeeping of the systematics needs to look like, how you build up the list, etc, etc. needs to come first though. Stacking variations for better memory contiguity and tweaking the interface to act "natural" can come later, it's fairly easy to construct views of arrays with different indexing. I there's enough example in this discussion to show it can be done in general.

I'm first tackling the problem of applying the systematics in a easy-tracked, lazily evaluated way. The equivalent of your compute graph tracing just done with virtual arrays and some indexing. So far that's been pretty straightforward to get a working example. Using yaml or whatever to guide that is a thin shim.

Ditto for filling histograms with some enormous record containing systematic variations. Though there's a simplification, here if our data has some naming convention / schema then we don't even need to know how anything was made, just read and fill into variations.

pieterdavid May 20, 2021

@eguiraud I've been using a lazy automatic version of your (2-3) slide - it makes the graphs grow fast, but works well otherwise.
What's not obvious to me in your proposal is if there is a name (or similar) for each variation. There are cases where a variation does not come from just one place, e.g. a correction that needs to be applied to two leptons, or a jet energy correction variation that also affects a b-tagging scalefactor (so when using the jesTotalup jet pt variation, the corresponding variation of one of the weight factors should be used) - there may be other ways to do this, but a name for each variation to do this grouping, which can then also be used as dictionary key in ROOT::RDF::VariationsFor or equivalent, seems to cover all cases I've run into so far.

eguiraud May 20, 2021

it makes the graphs grow fast

I think in many cases we could have each node of the graph do more rather than having more nodes in the graph.

Applying variations to multiple variables simultaneously: I had thought of building a "variation group" object in that case, with constructors that let you specify how a bunch of variables need to be modified. Then you pass the variation group to the framework and that provides the computation graph with the semantic information about what needs to vary how.
Mentioning the idea here in case it's something interesting for coffea as well -- but I'll definitely look into how bamboo does things and probably ping you, thanks @pieterdavid

Efficient production of template histograms for statistical analysis #469

alexander-held Mar 19, 2021 Maintainer

Template histograms for statistical analysis

Current setup

coffea version

Efficient analysis-dependent processors?

A better approach?

Questions

Footnotes

Replies: 9 comments · 28 replies

lgray Apr 13, 2021 Maintainer

lgray Apr 14, 2021 Maintainer

lgray Apr 15, 2021 Maintainer

lgray Apr 15, 2021 Maintainer

lgray Apr 14, 2021 Maintainer

nsmith- Apr 14, 2021 Maintainer

lgray Apr 24, 2021 Maintainer

lgray Apr 26, 2021 Maintainer

alexander-held Apr 26, 2021 Maintainer Author

lgray Apr 26, 2021 Maintainer

lgray Apr 26, 2021 Maintainer

lgray Apr 30, 2021 Maintainer

lgray May 3, 2021 Maintainer

lgray May 3, 2021 Maintainer

lgray May 3, 2021 Maintainer

lgray May 12, 2021 Maintainer

lgray May 13, 2021 Maintainer

lgray May 13, 2021 Maintainer

lgray May 13, 2021 Maintainer

lgray May 16, 2021 Maintainer

lgray May 19, 2021 Maintainer

lgray May 19, 2021 Maintainer

lgray May 19, 2021 Maintainer

alexander-held
Mar 19, 2021
Maintainer

`coffea` version

Replies: 9 comments 28 replies

lgray
Apr 13, 2021
Maintainer

lgray Apr 14, 2021
Maintainer

lgray Apr 15, 2021
Maintainer

lgray Apr 15, 2021
Maintainer

lgray
Apr 14, 2021
Maintainer

nsmith-
Apr 14, 2021
Maintainer

lgray
Apr 24, 2021
Maintainer

lgray Apr 26, 2021
Maintainer

alexander-held Apr 26, 2021
Maintainer Author

lgray Apr 26, 2021
Maintainer

lgray
Apr 26, 2021
Maintainer

lgray Apr 30, 2021
Maintainer

lgray May 3, 2021
Maintainer

lgray May 3, 2021
Maintainer

lgray May 3, 2021
Maintainer

lgray
May 12, 2021
Maintainer

lgray May 13, 2021
Maintainer

lgray May 13, 2021
Maintainer

lgray May 13, 2021
Maintainer

lgray May 16, 2021
Maintainer

lgray
May 19, 2021
Maintainer

lgray May 19, 2021
Maintainer

lgray May 19, 2021
Maintainer