Problems that are challenging with columnar approach #830

kmohrman · 2023-06-06T00:03:27Z

kmohrman
Jun 6, 2023

This post is meant to start a discussion about problems that are challenging to solve in columnar ways, as discussed at the 6/5/23 coffea users' meeting.

To start things off, here are two examples of cases that have been challenging for me during various projects over the past couple years:

Cases that seem to call for nested ak.where statements. This came up for me a few times when I was working on a project that involved trying to identify hadronic tops. An example line is here (conceptually, this line is just trying to find the mass of the two jets in the triplet that are not the b jet). For me, nested ak.where statements are difficult because there is too much happening on a single line (it's hard to write, hard to read, and hard to debug). Might have been better to find the index of the b in each triplet and construction a mask based on the indices, but then we run into the challenge described in 2.
Cases with many different sets of indices (e.g. from functions whose names involve arg) to keep track of. An example came up when I was trying to implement some WWZ 4l selection. I was trying to get a hold of the leptons that are consistent with coming from the Z and get a hold of the leptons that are consistent with coming from the Ws. This might be just me, but for me columnar stuff is much easier for cases where you want to ask something like "is there a same flavor opposite sign pair close to the Z" than it is for cases where you actually want to get a hold of those leptons. For the former case, you can use things like max or min or any. But for the latter case, you have to get a hold of the indices and to me, that makes the code both harder to think about and harder to read. E.g. for the WWZ stuff, this usage of ak.argcombinations and ak.local_index is not very intuitive or readable for me (even though I implemented it just a few weeks ago, at this point I'd have to put in a bunch of print statements to remind myself which indices correspond to what). But again, I might have been missing a more straightforward solution to this problem.

Looking forward to hearing if anyone has any thoughts on these challenges and to hearing about other problems that others find to be difficult to solve with the columnar approach.

lgray · 2023-06-06T00:58:30Z

lgray
Jun 6, 2023
Maintainer

Reproducing the lines of code here for less clicking, etc.

Here's the nested ak.where stuff from above:

jj_mass = ak.where(
    jjb_triplets.i0.btagDeepFlavB > btagwpl,
    (jjb_triplets.i1 + jjb_triplets.i2).mass,
    (
        ak.where(
            jjb_triplets.i1.btagDeepFlavB > btagwpl,
            (jjb_triplets.i0 + jjb_triplets.i2).mass,
            (jjb_triplets.i0 + jjb_triplets.i1).mass,
        )
    ),
)

Then in the argcombo / local_index:

# Attach the local index to the lepton objects
lep_collection["idx"] = ak.local_index(lep_collection, axis=1)

# Make all pairs of leptons
ll_pairs = ak.combinations(lep_collection, 2, fields=["l0", "l1"])
ll_pairs_idx = ak.argcombinations(lep_collection, 2, fields=["l0", "l1"])

# Check each pair to see how far it is from the Z
dist_from_z_all_pairs = abs((ll_pairs.l0 + ll_pairs.l1).mass - 91.2)

# Mask out the pairs that are not SFOS (so that we don't include them when finding the one that's closest to Z)
# And then of the SFOS pairs, get the index of the one that's cosest to the Z
sfos_mask = ll_pairs.l0.pdgId == -ll_pairs.l1.pdgId
dist_from_z_sfos_pairs = ak.mask(dist_from_z_all_pairs, sfos_mask)
sfos_pair_closest_to_z_idx = ak.argmin(dist_from_z_sfos_pairs, axis=-1, keepdims=True)

# Construct a mask (of the shape of the original lep array) corresponding to the leps that are part of the Z candidate
mask = lep_collection.idx == ak.flatten(ll_pairs_idx.l0[sfos_pair_closest_to_z_idx])
mask = mask | (
    lep_collection.idx == ak.flatten(ll_pairs_idx.l1[sfos_pair_closest_to_z_idx])
)
mask = ak.fill_none(mask, False)

2 replies

lgray Jun 6, 2023
Maintainer

@nsmith- @jpivarski

lgray Jun 6, 2023
Maintainer

The second one I can read through pretty straightforwardly, but I'll chalk that up to rather long term experience. I think the messy part is needing to manipulate in parallel the argcombinations list, which can easily be confusing. Maybe there's a nicer way to do that?

The first one might be more expressive either in numba or using indices + awkward expressions. Since you're selecting the triplets such that there's only one b-tag you already have knowledge from the mask you apply as to which jets you should be summing for the invariant mass. You should be able to profit from that and keep a more coherent picture in your workflow, as opposed to throwing it away and using ak.where.

I'll find some time to whittle at alternative solutions, but those are my initial thoughts.

nsmith- · 2023-06-08T20:28:31Z

nsmith-
Jun 8, 2023
Maintainer

For case 1, this reminds me also of the FuncADL Q6 problem:

For events with at least three jets, plot the p_T of the trijet four-momentum that has the invariant mass closest to 172.5 GeV in each event and plot the maximum b-tagging discriminant value among the jets in this trijet.

The latter plot is implemented with

maxBtag = np.maximum(
    trijet.j1.btag, np.maximum(trijet.j2.btag, trijet.j3.btag,),
)

much like the ak.where case you have to reduce among permutations of the combination. Dealing with the indices in record form is usually more difficult, so a long time ago I proposed adding ak.stack scikit-hep/awkward#200 to pivot tuple records to a new axis. Then, for example, your code could be written:

jjb_triplets = ak.stack(ak.combinations(jet_collection, 3))
# remove any triplets with more or less than 1 b
jjb_triplets = jjb_triplets[ak.sum(jjb_triplets.btagDeepFlavB > btagwpl, axis=2) == 1]
# select b jet (flatten the singletons array after masking)
is_b_candidate = jjb_triplets.btagDeepFlavB > btagwpl
b_cand = ak.flatten(jjb_triplets[is_b_candidate], axis=2)
# compute mass of remaining two jets (sum() is a coffea vector mixin method)
jj_mass = jjb_triplets[~is_b_candidate].sum().mass

An implementation for ak.stack that would work in this case is

tmp = ak.combinations(jet_collection, 3)
jjb_triplets = ak.concatenate([tmp.i0, tmp.i1, tmp.i2], axis=-1)

1 reply

nsmith- Jun 8, 2023
Maintainer

Case 2 reminds me also of an ADL benchmark,

For events with at least three light leptons and a same-flavor opposite-charge light lepton pair, find such a pair that has the invariant mass closest to 91.2 GeV in each event and plot the transverse mass of the system consisting of the missing tranverse momentum and the highest-p_T light lepton not in this pair.

implmented in coffea with basically the same set of primitives (ak.argcombinations, ak.localindex) as in your example. I'm not sure what extra awkward primitives would help here, perhaps ak.permutations could be useful? One would still need to create an array of the dilepton combination along with the remaining leptons after removing the dilepton, and then filter permutations of those.

In the coffea docs there is one example of giving up on array index shenanigans and using a numba function to select all valid permutations for subsequent filtering to select the best candidate permutation:

@numba.njit
def find_4lep(events_leptons, builder):
    """Search for valid 4-lepton combinations from an array of events * leptons {charge, ...}

    A valid candidate has two pairs of leptons that each have balanced charge
    Outputs an array of events * candidates {indices 0..3} corresponding to all valid
    permutations of all valid combinations of unique leptons in each event
    (omitting permutations of the pairs)
    """
    for leptons in events_leptons:
        builder.begin_list()
        nlep = len(leptons)
        for i0 in range(nlep):
            for i1 in range(i0 + 1, nlep):
                if leptons[i0].charge + leptons[i1].charge != 0:
                    continue
                for i2 in range(nlep):
                    for i3 in range(i2 + 1, nlep):
                        if len({i0, i1, i2, i3}) < 4:
                            continue
                        if leptons[i2].charge + leptons[i3].charge != 0:
                            continue
                        builder.begin_tuple(4)
                        builder.index(0).integer(i0)
                        builder.index(1).integer(i1)
                        builder.index(2).integer(i2)
                        builder.index(3).integer(i3)
                        builder.end_tuple()
        builder.end_list()

    return builder

jpivarski · 2023-06-08T20:35:34Z

jpivarski
Jun 8, 2023
Maintainer

There's a general class of "iterate until converged" problems that are all hard for array-oriented programming. Here is my write-up of it and a set of exercises for working through it, all in pure NumPy/no Awkward.

0 replies

lgray · 2023-06-15T11:40:55Z

lgray
Jun 15, 2023
Maintainer

Just to bring this out of slack:

I was thinking about combinatorics embedded DSLs and it occurred to be that something that might make sense is an extension to (or something heavily inspired by) the einsum language.
Something like: (i,j),(m,n,k)->k != i,j, m,n==i,j would be "make (i,j) combinations, make (m,n,k) combinations, yield (m,n,k) combinations such that k is not i or j. Then, if you want to introduce physics knowledge you can have some kwargs for functions that eat combinatorics output
e.g. (we can figure out the name of the operation later):

ak.combexpr(
    leptons,
    "f((i,j)),g((m,n,k))->k != i,j, m,n == i,j",
    f=select_dileptons,
    g=select_third_lepton,
)

could easily be the majority of code in ADL benchmark #8.

I think this could be expanded to multiple inputs: ak.combexpr(leptons, jets, "leptons(i,j)*jets(l,m)") for something that means make the outer product of dileptons with dijets, for instance... Perhaps some shorthands in product like if an arg's name is not given in a product it's assuming positional, and if there's only one arg like you're making 4-lepton pairs with dilepton selection on the way then it just reuses the first arg.
Anyway - just writing down a set of thoughts that finally congealed into something meaningful and concise. Let me know what you think!

5 replies

lgray Jun 15, 2023
Maintainer

Though - when considering functions to apply maybe something a bit cleaner would be allowing syntax like:

ak.combexpr(
    leptons,
    "(i,j,k)-> f((i,j)) & g((k)) & h((i,j,k))",
    f=select_dileptons,
    g=select_third_lepton,
    h=select_trilepton_system,
)

This may lead to a nice factorization of physics code as well?

lgray Jun 15, 2023
Maintainer

an expr like f((i,j,k)) means make N choose three combinations and pass the objects to f, f(i,j,k) could mean pass the indices to f? Whittle-y detail at this level of theorizing though.

jpivarski Jun 15, 2023
Maintainer

Naturally, we should mention PartiQL and AwkwardQL, which were early attempts at this. Also the talk, Pattern matching for decay trees.

I think it would be better to try to do this without indexes (i, j, l, m), because a reasonably complicated decay would introduce lots of indexes and it would be hard to keep them all straight. A solution that scales well should be able to make a long-chain $B^0$ decay tractable (maybe 3 or 4 steps in the decay). Even better if the desired decay tree structure can be represented as nested objects or lists in the source code. @agoose77 and I have been talking about reviving that, making use of Numba iteration over Awkward Arrays and production of new arrays with the LayoutBuilder that is done or nearly done now (scikit-hep/awkward#2408). Another thing we've been talking about is dropping the specialized language in strings and making an embedded language out of Python objects. That would give us a place to insert constraints (e.g. "only consider combinations whose mass is within this window" as a Python lambda that gets included in the Numba compilation).

We can have that discussion more in the open, though perhaps we should get dask-awkward into the quiescent stage, first.

lgray Jun 20, 2023
Maintainer

Fair enough re: indices!

Going the directions you're talking about probably leads to more robust solutions.

There is a certain charm in being about to crack out a quick einsum-style one liner to solve a nasty problem with four indices or less (that probably covers the common case!).

jpivarski Jun 20, 2023
Maintainer

I tried to use einsum in a survey of programming paradigms, but there was something wrong with the example I used. (Now I'm having a hard time remembering what was wrong with it—an expert in the audience pointed it out to me.)

ekauffma · 2023-06-15T14:26:30Z

ekauffma
Jun 15, 2023

Brining another problem to here from Slack: say you want to select a two jets in each event with different requirements. For example,

Leading jet (highest pT) with $\eta<2.1$
Subleading jet (second highest pT) with $\eta<2.4$

The catch is that instead of throwing out events that don't have such qualifying jets in the zeroth and first indices, you would move on to the next indices to check the requirements there.

@nsmith- helped solve this one...

3 replies

nsmith- Jun 15, 2023
Maintainer

Assume (or ensure via ak.sort) that, per event, jets are ordered descending by $p_T$.

The imperative solution would be,

def imperative_select(event):
    for j1 in event.Jet:
        if abs(j1.eta) >= 2.1:
            continue
        for j2 in event.Jet:
            if abs(j2.eta) >= 2.4 or j2.pt >= j1.pt:
                continue
            return (j1, j2)


def imperative_select_loop(events):
    return [
        imperative_select(event)
        for event in events
    ]

I tried for a while to do this without ak.cartesian but I think it's probably a given that any nested for-loop in imperative implementation implies a cartesian call in awkward. One could use combinations as well. Here are those two approaches:

def cartesian_select(events):
    jets = events.Jet
    leading_jet_cands = jets[abs(jets.eta) < 2.1]
    subleading_jet_cands = jets[abs(jets.eta) < 2.4]
    pairs = ak.cartesian({"j1": leading_jet_cands, "j2": subleading_jet_cands})
    return ak.firsts(pairs[pairs.j1.pt > pairs.j2.pt])

def combinations_select(events):
    jets = events.Jet
    pairs = ak.combinations(jets, 2, fields=["j1", "j2"])
    return ak.firsts(
        pairs[
            (abs(pairs.j1.eta) < 2.1)
            & (abs(pairs.j2.eta) < 2.4)
            & (pairs.j1.pt > pairs.j2.pt)
        ]
    )

For fun, I compared the timing on a random ~6k event NanoAOD file I had:

Imperative: 10.1 s ± 275 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cartesian: 25.7 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Combinations: 24.2 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@agoose77 suggested a speed-up by pre-caching abs(eta):

def cartesian_select_faster(events):
    jets = events.Jet
    jets["abseta"] = abs(jets.eta)
    leading_jet_cands = jets[jets.abseta < 2.1]
    subleading_jet_cands = jets[jets.abseta < 2.4]
    pairs = ak.cartesian({"j1": leading_jet_cands, "j2": subleading_jet_cands})
    return ak.firsts(pairs[pairs.j1.pt > pairs.j2.pt])

but for me it did not make much difference: 24.3 ms ± 208 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

agoose77 Jun 15, 2023

@nsmith- I think the main perf benefit is slicing the more permissive cut (subleading) to obtain the less permissive cut (leading) - this should involve a loop over fewer items.

nsmith- Jun 16, 2023
Maintainer

Ah, so like

def cartesian_select_faster(events):
    jets = events.Jet
    jets["abseta"] = abs(jets.eta)
    subleading_jet_cands = jets[jets.abseta < 2.4]
    leading_jet_cands = subleading_jet_cands[subleading_jet_cands.abseta < 2.1]
    pairs = ak.cartesian({"j1": leading_jet_cands, "j2": subleading_jet_cands})
    return ak.firsts(pairs[pairs.j1.pt > pairs.j2.pt])

I don't notice much difference but also I suspect 6k events might not be enough to get rid of overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems that are challenging with columnar approach #830

{{title}}

Replies: 5 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Problems that are challenging with columnar approach #830

kmohrman Jun 6, 2023

Replies: 5 comments · 11 replies

lgray Jun 6, 2023 Maintainer

lgray Jun 6, 2023 Maintainer

lgray Jun 6, 2023 Maintainer

nsmith- Jun 8, 2023 Maintainer

nsmith- Jun 8, 2023 Maintainer

jpivarski Jun 8, 2023 Maintainer

lgray Jun 15, 2023 Maintainer

lgray Jun 15, 2023 Maintainer

lgray Jun 15, 2023 Maintainer

jpivarski Jun 15, 2023 Maintainer

lgray Jun 20, 2023 Maintainer

jpivarski Jun 20, 2023 Maintainer

ekauffma Jun 15, 2023

nsmith- Jun 15, 2023 Maintainer

agoose77 Jun 15, 2023

nsmith- Jun 16, 2023 Maintainer

kmohrman
Jun 6, 2023

Replies: 5 comments 11 replies

lgray
Jun 6, 2023
Maintainer

lgray Jun 6, 2023
Maintainer

lgray Jun 6, 2023
Maintainer

nsmith-
Jun 8, 2023
Maintainer

nsmith- Jun 8, 2023
Maintainer

jpivarski
Jun 8, 2023
Maintainer

lgray
Jun 15, 2023
Maintainer

lgray Jun 15, 2023
Maintainer

lgray Jun 15, 2023
Maintainer

jpivarski Jun 15, 2023
Maintainer

lgray Jun 20, 2023
Maintainer

jpivarski Jun 20, 2023
Maintainer

ekauffma
Jun 15, 2023

nsmith- Jun 15, 2023
Maintainer

nsmith- Jun 16, 2023
Maintainer