Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Daskify Elementlinks in PHYSLITE schema #872

Merged
merged 16 commits into from
Sep 12, 2023

Conversation

nikoladze
Copy link
Contributor

@nikoladze nikoladze commented Jul 28, 2023

Thanks to the help from @lgray i managed to daskify part of the cross referencing functionality in the PHYSLITE schema. So, for example, this now works:

import os
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema

def _events():
    path = os.path.abspath("tests/samples/DAOD_PHYSLITE_21.2.108.0.art.pool.root")
    factory = NanoEventsFactory.from_root(
        {path: "CollectionTree"},
        schemaclass=PHYSLITESchema,
        permit_dask=True,
    )
    return factory.events()

events = _events()
>>> events.Electrons.trackParticles.pt.compute()
<Array [[], [], ..., [[2.78e+04, 1.87e+04]]] type='40 * var * var * ?float32'>

👍

However, there are a couple of issues ...

  1. (this is what the currently failing test test_load_single_field_of_linked shows) In some constellations the information on the necessary column for loading the offsets to calculate the global index is being lost. For example if i manually run the _get_global_index function:
from coffea.nanoevents.methods.physlite import _get_global_index
gi = _get_global_index(
    events.CaloCalTopoClusters,
    events.Electrons._eventindex,
    events.Electrons.caloClusterLinks.m_persIndex,
)
import dask_awkward as dak

we see it knows it has to load the rawE column because the _get_global_index function picks the first column to get the offsets

>>> dak.necessary_columns(gi)
{'from-uproot-c0bb25dde97be6a3f9868eea400e2eb2': ['CaloCalTopoClusters.rawE', 'Electrons.caloClusterLinks.m_persIndex', 'Electrons._eventindex']}

However in the full picture when i use this to get the linked caloClusters this gets lost once i then select a single field of the caloClusters:

>>> dak.necessary_columns(events.Electrons.caloClusters.calE)
{'from-uproot-d7496f2416955eff723b64b6e72e9c5b': ['Electrons.caloClusterLinks.m_persKey',
  'Electrons._eventindex',
  'CaloCalTopoClusters.calE',
  'Electrons.caloClusterLinks.m_persIndex']}

... no more rawE in there and consequently it will fail when i do .compute(), complaining it can't find rawE

Not quite sure yet where the information gets lost - let me know if you have any ideas

  1. The concept of linking into a location based on what the m_persKey field in the ElementLink tells us does not play well with dask since we don't know then which columns to load beforehand and worst case not even the type of the resulting array. So for now the linking into multiple collections which we don't know beforehand doesn't work. This is probably not a huge issue because i think we only have this for the Truth Collection. And for that one it could be possible to work around it by just loading all corresponding truth columns and make a homogeneous type with only common fields. But for now i left this broken ...

@nikoladze nikoladze changed the title Daskify Elementlinks in PHYSLITE schema fix: Daskify Elementlinks in PHYSLITE schema Jul 29, 2023
@nikoladze nikoladze mentioned this pull request Jul 29, 2023
9 tasks
@nikoladze
Copy link
Contributor Author

nikoladze commented Jul 30, 2023

another thing i'm not quite sure about - to implement the .trackParticle (without plural) property of the Electron (which just picks the first trackParticle) i had to put in a switch and use _dask_array_ instead of self in case it is passed because self is passed to be a typetracer in this case. I don't quite understand why that is ... Is it because i'm using another behavior method inside?

corresponding code:

https://github.com/CoffeaTeam/coffea/blob/06a1627c12c8867363696d1ab3b3da5bbb8b8066/src/coffea/nanoevents/methods/physlite.py#L205-L212

@lgray
Copy link
Collaborator

lgray commented Jul 30, 2023

nesting behavior methods is full of nastiness in dask awkward - for your own sanity - you probably need to figure out how to flatten your calling structure a bit.

@lgray
Copy link
Collaborator

lgray commented Jul 30, 2023

or I should say - rather - instead of using the behavior directly - you need to call the thing that is making the trackingParticles (with an s) property and call that directly using _dask_array_ so it has the right context and is building a task graph.

right - I figured this out before.

Anyway - keep in mind when writing this stuff that

  1. when you call a behavior you're calling a behavior of the metadata for the dask awkward array (i.e. a typetracer) and that will not build a task graph
  2. therefore - you must figure out how to snake the dask array down
  3. nested behaviors are immediately a no go, so call the functions to produce the things you need (since you're building a task graph then!)

@lgray
Copy link
Collaborator

lgray commented Jul 31, 2023

I had some further thoughts on this, that are related to #822, where we can have a hook in the nanoevents schema that opens one input partition and determines a bit of metadata about how things link up using a small slice of raw data.

This would have a small use penalty but nothing major since the processing is light.

If it makes the initial data exploration loop too slow one can always keep things local with a .compute() or cache with a .persist().

Would this be helpful?

@nikoladze
Copy link
Contributor Author

finally coming back to this

or I should say - rather - instead of using the behavior directly - you need to call the thing that is making the trackingParticles (with an s) property and call that directly using _dask_array_ so it has the right context and is building a task graph.

ok, i modified it now to do that

I had some further thoughts on this, that are related to #822, where we can have a hook in the nanoevents schema that opens one input partition and determines a bit of metadata about how things link up using a small slice of raw data.

This would have a small use penalty but nothing major since the processing is light.

If it makes the initial data exploration loop too slow one can always keep things local with a .compute() or cache with a .persist().

Would this be helpful?

i was thinking maybe to have this as an explicit step where i pass one file and then it looks through all of its ElementLink branches and spits out where they point to. Not sure if i want to do this everytime on the first partition. Hopefully this also doesn't change so often.

For the changes i currently have in this PR there is still a bug somewhere for events.Electrons.caloClusters.calE. The issue is when i run

events.Electrons.caloClusters.calE.compute()

It complains it doesn't have the column rawE (which is used to determine the offsets). Playing a bit with it i found it works when i pass optimize_graph=False. Looking at dask.visualize(events.Electrons.caloClusters.calE, optimize_graph=True) the result looks a bit odd:

image

maybe sth wrong with the graph optimization? And, as mentioned before, dak.necessary_columns also doesn't list the rawE ...

@lgray
Copy link
Collaborator

lgray commented Aug 22, 2023

something screwed up when you merged master! can you rebase this PR to current coffea master?

@nikoladze nikoladze force-pushed the dev-dak-elementlinks branch from 6f32947 to ab5164b Compare August 23, 2023 07:55
@nikoladze
Copy link
Contributor Author

nikoladze commented Aug 30, 2023

Seems the issue i had with necessary columns being missed is fixed if i explicitly (in case of typetracer) touch the column that i use for calculating the offsets in the link target

https://github.com/CoffeaTeam/coffea/blob/8ec38cfda7c374568ab06025c423aef0272caa9d/src/coffea/nanoevents/methods/physlite.py#L105-L108

This uses awkward.typetracer.touch_data which is currently only available on the awkward main branch. Looking at the implementation of awkward.typetracer.length_zero_if_typetracer (which i also use in that function) it seems this also touches data, but i feed in load_column.layout.offsets.data instead of the whole load_column. For some reason this seemed not to be sufficient to let dask_awkward know we need this column.

@lgray
Copy link
Collaborator

lgray commented Sep 7, 2023

@nikoladze does this PR or #888 need any further addition? or should they be merged?

@nikoladze
Copy link
Contributor Author

They can be both merged - i'll open new PRs when i work on the remaining issues. Thanks!

@lgray lgray merged commit 47a9304 into scikit-hep:master Sep 12, 2023
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants