fix: Daskify Elementlinks in PHYSLITE schema #872

nikoladze · 2023-07-28T23:59:20Z

Thanks to the help from @lgray i managed to daskify part of the cross referencing functionality in the PHYSLITE schema. So, for example, this now works:

import os
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema

def _events():
    path = os.path.abspath("tests/samples/DAOD_PHYSLITE_21.2.108.0.art.pool.root")
    factory = NanoEventsFactory.from_root(
        {path: "CollectionTree"},
        schemaclass=PHYSLITESchema,
        permit_dask=True,
    )
    return factory.events()

events = _events()

>>> events.Electrons.trackParticles.pt.compute()
<Array [[], [], ..., [[2.78e+04, 1.87e+04]]] type='40 * var * var * ?float32'>

👍

However, there are a couple of issues ...

(this is what the currently failing test test_load_single_field_of_linked shows) In some constellations the information on the necessary column for loading the offsets to calculate the global index is being lost. For example if i manually run the _get_global_index function:

from coffea.nanoevents.methods.physlite import _get_global_index
gi = _get_global_index(
    events.CaloCalTopoClusters,
    events.Electrons._eventindex,
    events.Electrons.caloClusterLinks.m_persIndex,
)
import dask_awkward as dak

we see it knows it has to load the rawE column because the _get_global_index function picks the first column to get the offsets

>>> dak.necessary_columns(gi)
{'from-uproot-c0bb25dde97be6a3f9868eea400e2eb2': ['CaloCalTopoClusters.rawE', 'Electrons.caloClusterLinks.m_persIndex', 'Electrons._eventindex']}

However in the full picture when i use this to get the linked caloClusters this gets lost once i then select a single field of the caloClusters:

>>> dak.necessary_columns(events.Electrons.caloClusters.calE)
{'from-uproot-d7496f2416955eff723b64b6e72e9c5b': ['Electrons.caloClusterLinks.m_persKey',
  'Electrons._eventindex',
  'CaloCalTopoClusters.calE',
  'Electrons.caloClusterLinks.m_persIndex']}

... no more rawE in there and consequently it will fail when i do .compute(), complaining it can't find rawE

Not quite sure yet where the information gets lost - let me know if you have any ideas

The concept of linking into a location based on what the m_persKey field in the ElementLink tells us does not play well with dask since we don't know then which columns to load beforehand and worst case not even the type of the resulting array. So for now the linking into multiple collections which we don't know beforehand doesn't work. This is probably not a huge issue because i think we only have this for the Truth Collection. And for that one it could be possible to work around it by just loading all corresponding truth columns and make a homogeneous type with only common fields. But for now i left this broken ...

nikoladze · 2023-07-30T11:20:32Z

another thing i'm not quite sure about - to implement the .trackParticle (without plural) property of the Electron (which just picks the first trackParticle) i had to put in a switch and use _dask_array_ instead of self in case it is passed because self is passed to be a typetracer in this case. I don't quite understand why that is ... Is it because i'm using another behavior method inside?

corresponding code:

https://github.com/CoffeaTeam/coffea/blob/06a1627c12c8867363696d1ab3b3da5bbb8b8066/src/coffea/nanoevents/methods/physlite.py#L205-L212

lgray · 2023-07-30T11:39:53Z

nesting behavior methods is full of nastiness in dask awkward - for your own sanity - you probably need to figure out how to flatten your calling structure a bit.

lgray · 2023-07-30T11:43:05Z

or I should say - rather - instead of using the behavior directly - you need to call the thing that is making the trackingParticles (with an s) property and call that directly using _dask_array_ so it has the right context and is building a task graph.

right - I figured this out before.

Anyway - keep in mind when writing this stuff that

when you call a behavior you're calling a behavior of the metadata for the dask awkward array (i.e. a typetracer) and that will not build a task graph
therefore - you must figure out how to snake the dask array down
nested behaviors are immediately a no go, so call the functions to produce the things you need (since you're building a task graph then!)

lgray · 2023-07-31T13:27:24Z

I had some further thoughts on this, that are related to #822, where we can have a hook in the nanoevents schema that opens one input partition and determines a bit of metadata about how things link up using a small slice of raw data.

This would have a small use penalty but nothing major since the processing is light.

If it makes the initial data exploration loop too slow one can always keep things local with a .compute() or cache with a .persist().

Would this be helpful?

nikoladze · 2023-08-22T12:39:27Z

finally coming back to this

or I should say - rather - instead of using the behavior directly - you need to call the thing that is making the trackingParticles (with an s) property and call that directly using _dask_array_ so it has the right context and is building a task graph.

ok, i modified it now to do that

I had some further thoughts on this, that are related to #822, where we can have a hook in the nanoevents schema that opens one input partition and determines a bit of metadata about how things link up using a small slice of raw data.

This would have a small use penalty but nothing major since the processing is light.

If it makes the initial data exploration loop too slow one can always keep things local with a .compute() or cache with a .persist().

Would this be helpful?

i was thinking maybe to have this as an explicit step where i pass one file and then it looks through all of its ElementLink branches and spits out where they point to. Not sure if i want to do this everytime on the first partition. Hopefully this also doesn't change so often.

For the changes i currently have in this PR there is still a bug somewhere for events.Electrons.caloClusters.calE. The issue is when i run

events.Electrons.caloClusters.calE.compute()

It complains it doesn't have the column rawE (which is used to determine the offsets). Playing a bit with it i found it works when i pass optimize_graph=False. Looking at dask.visualize(events.Electrons.caloClusters.calE, optimize_graph=True) the result looks a bit odd:

maybe sth wrong with the graph optimization? And, as mentioned before, dak.necessary_columns also doesn't list the rawE ...

lgray · 2023-08-22T16:46:46Z

something screwed up when you merged master! can you rebase this PR to current coffea master?

for more information, see https://pre-commit.ci

nikoladze · 2023-08-30T13:45:07Z

Seems the issue i had with necessary columns being missed is fixed if i explicitly (in case of typetracer) touch the column that i use for calculating the offsets in the link target

https://github.com/CoffeaTeam/coffea/blob/8ec38cfda7c374568ab06025c423aef0272caa9d/src/coffea/nanoevents/methods/physlite.py#L105-L108

This uses awkward.typetracer.touch_data which is currently only available on the awkward main branch. Looking at the implementation of awkward.typetracer.length_zero_if_typetracer (which i also use in that function) it seems this also touches data, but i feed in load_column.layout.offsets.data instead of the whole load_column. For some reason this seemed not to be sufficient to let dask_awkward know we need this column.

…k 2.3.3

…mentlink calculation

lgray · 2023-09-07T12:17:46Z

@nikoladze does this PR or #888 need any further addition? or should they be merged?

nikoladze · 2023-09-08T06:39:23Z

They can be both merged - i'll open new PRs when i work on the remaining issues. Thanks!

nikoladze changed the title ~~Daskify Elementlinks in PHYSLITE schema~~ fix: Daskify Elementlinks in PHYSLITE schema Jul 29, 2023

nikoladze mentioned this pull request Jul 29, 2023

Tracking issue for PHYSLITE schema #540

Closed

9 tasks

nikoladze and others added 9 commits August 23, 2023 09:49

global index fetching working

9e832e0

track particles working

29c359e

trackParticle

109e73a

cleanup and add caloclusters

2e16a9e

comment about multiple elementlinks

f4d6668

cleanup tests and add test for single field of linked collection

4da309f

[pre-commit.ci] auto fixes from pre-commit.com hooks

dbccef0

for more information, see https://pre-commit.ci

pylint

dbfadd8

flat calling structure for trackParticle(s) behavior methods

ab5164b

nikoladze force-pushed the dev-dak-elementlinks branch from 6f32947 to ab5164b Compare August 23, 2023 07:55

fix column touching for _get_target_offsets

8ec38cf

nikoladze added 3 commits August 31, 2023 15:24

make test actually fail

e6127d5

use layout._touch_data since public touch_data not yet available in a…

c4385b1

…k 2.3.3

try to avoid loading double-jagged columns for getting offsets in ele…

e2dd3f0

…mentlink calculation

nikoladze mentioned this pull request Aug 31, 2023

fix: allow for collections that contain non-jagged arrays in PHYSLITE schema #888

Merged

lgray and others added 3 commits September 6, 2023 16:20

Merge branch 'master' into dev-dak-elementlinks

ca5794d

Merge branch 'master' into dev-dak-elementlinks

36f9c06

go back to using public touch_data since we have ak 2.4.2 now

6abc42c

lgray merged commit 47a9304 into scikit-hep:master Sep 12, 2023
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Daskify Elementlinks in PHYSLITE schema #872

fix: Daskify Elementlinks in PHYSLITE schema #872

nikoladze commented Jul 28, 2023 •

edited

Loading

nikoladze commented Jul 30, 2023 •

edited

Loading

lgray commented Jul 30, 2023

lgray commented Jul 30, 2023

lgray commented Jul 31, 2023

nikoladze commented Aug 22, 2023

lgray commented Aug 22, 2023

nikoladze commented Aug 30, 2023 •

edited

Loading

lgray commented Sep 7, 2023

nikoladze commented Sep 8, 2023

fix: Daskify Elementlinks in PHYSLITE schema #872

fix: Daskify Elementlinks in PHYSLITE schema #872

Conversation

nikoladze commented Jul 28, 2023 • edited Loading

nikoladze commented Jul 30, 2023 • edited Loading

lgray commented Jul 30, 2023

lgray commented Jul 30, 2023

lgray commented Jul 31, 2023

nikoladze commented Aug 22, 2023

lgray commented Aug 22, 2023

nikoladze commented Aug 30, 2023 • edited Loading

lgray commented Sep 7, 2023

nikoladze commented Sep 8, 2023

nikoladze commented Jul 28, 2023 •

edited

Loading

nikoladze commented Jul 30, 2023 •

edited

Loading

nikoladze commented Aug 30, 2023 •

edited

Loading