-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Daskify Elementlinks in PHYSLITE schema #872
Conversation
another thing i'm not quite sure about - to implement the corresponding code: |
nesting behavior methods is full of nastiness in dask awkward - for your own sanity - you probably need to figure out how to flatten your calling structure a bit. |
or I should say - rather - instead of using the behavior directly - you need to call the thing that is making the right - I figured this out before. Anyway - keep in mind when writing this stuff that
|
I had some further thoughts on this, that are related to #822, where we can have a hook in the nanoevents schema that opens one input partition and determines a bit of metadata about how things link up using a small slice of raw data. This would have a small use penalty but nothing major since the processing is light. If it makes the initial data exploration loop too slow one can always keep things local with a Would this be helpful? |
finally coming back to this
ok, i modified it now to do that
i was thinking maybe to have this as an explicit step where i pass one file and then it looks through all of its ElementLink branches and spits out where they point to. Not sure if i want to do this everytime on the first partition. Hopefully this also doesn't change so often. For the changes i currently have in this PR there is still a bug somewhere for
It complains it doesn't have the column maybe sth wrong with the graph optimization? And, as mentioned before, |
something screwed up when you merged master! can you rebase this PR to current coffea master? |
for more information, see https://pre-commit.ci
6f32947
to
ab5164b
Compare
Seems the issue i had with necessary columns being missed is fixed if i explicitly (in case of typetracer) touch the column that i use for calculating the offsets in the link target This uses |
@nikoladze does this PR or #888 need any further addition? or should they be merged? |
They can be both merged - i'll open new PRs when i work on the remaining issues. Thanks! |
Thanks to the help from @lgray i managed to daskify part of the cross referencing functionality in the PHYSLITE schema. So, for example, this now works:
👍
However, there are a couple of issues ...
test_load_single_field_of_linked
shows) In some constellations the information on the necessary column for loading the offsets to calculate the global index is being lost. For example if i manually run the_get_global_index
function:we see it knows it has to load the
rawE
column because the_get_global_index
function picks the first column to get the offsetsHowever in the full picture when i use this to get the linked
caloClusters
this gets lost once i then select a single field of thecaloClusters
:... no more
rawE
in there and consequently it will fail when i do.compute()
, complaining it can't findrawE
Not quite sure yet where the information gets lost - let me know if you have any ideas
m_persKey
field in the ElementLink tells us does not play well with dask since we don't know then which columns to load beforehand and worst case not even the type of the resulting array. So for now the linking into multiple collections which we don't know beforehand doesn't work. This is probably not a huge issue because i think we only have this for the Truth Collection. And for that one it could be possible to work around it by just loading all corresponding truth columns and make a homogeneous type with only common fields. But for now i left this broken ...