Skimming NanoAOD (updated for coffea-202x) #1100
yimuchen
started this conversation in
Show and tell
Replies: 1 comment 2 replies
-
Currently, the repartition step introduces a large memory overhead. An issue has been raised for dask_awkward repository: dask-contrib/dask-awkward#509 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Since the old recipe was made used for coffea-0.7.x, I thought I would share an implementation for coffea-2024 for future reference. The new
uproot.dask_write
makes this process very simple to write, though one might need to play around with the nobs to make sure nothing exceeds the memory usage.dak.Array.repartition
method seems to work well in this case (requires then_to_one
option introduced in this fix: allow for Nones in repartition n_to_one dask-contrib/dask-awkward#518)Additional items to look out for:
dak.Array.repartition
must be placed as the last step beforeuproot.dask_write
preprocess:step_size
x selection efficiency xrepartition:n_to_one
, so ideally, to keep the file sizes of the input and output file to roughly be similar, you would want to haven_to_one
be the inverse of the selection efficiency. You might want to try a small sample set to make sure the file are not too large or too small.Beta Was this translation helpful? Give feedback.
All reactions