Skimming NanoAOD (updated for coffea-202x) #1100

yimuchen · 2024-05-31T13:00:17Z

yimuchen
May 31, 2024

Since the old recipe was made used for coffea-0.7.x, I thought I would share an implementation for coffea-2024 for future reference. The new uproot.dask_write makes this process very simple to write, though one might need to play around with the nobs to make sure nothing exceeds the memory usage.

[?] Avoid fragmented output for skimming with very low efficiency. The dak.Array.repartition method seems to work well in this case (requires the n_to_one option introduced in this fix: allow for Nones in repartition n_to_one dask-contrib/dask-awkward#518)
Add custom fields (for caching if needed). This should be simply adding a new field in the skimmed events?
Added checks to ensure fields are uproot-writable (singularly jagged array), modified from original solution: Skimming NanoEvents #735
[?] Drop unused fields. The current solution isn't quite as clear in meaning as "ak.without_field", which is currently being implemented upstream https://github.com/dask-contrib/dask-awkward/pull/508/files

from coffea.nanoevents import NanoAODSchema
from coffea.dataset_tools import preprocess, apply_to_fileset
import awkward as ak
import dask_awkward as dak 
import dask
import uproot

def is_rootcompat(a):
    """Is it a flat or 1-d jagged array?"""
    t = dak.type(a)
    if isinstance(t, ak.types.NumpyType):
        return True
    if isinstance(t, ak.types.ListType) and isinstance(t.content, ak.types.NumpyType):
        return True

    return False


def uproot_writeable(events):
    """Restrict to columns that uproot can write compactly"""
    out_event = events[list(x for x in events.fields if not events[x].fields)]
    for bname in events.fields:
        if events[bname].fields:
            out_event[bname] = ak.zip(
                {
                    n: ak.without_parameters(events[bname][n])
                    for n in events[bname].fields
                    if is_rootcompat(events[bname][n])
                }
            )
    return out_event

def make_skimmed_events(events):
    # Place your selection logic here
    skimmed = events[<Your Skimming selection here>]
    # Add your custom fields here
    skimmed["my_new_field"] = 137*9.8
    
    # ak.without_field is not yet implemented in dask
    # skimmed = ak.without_field(skimmed, ["DropField1", "DropField2"]) https://github.com/dask-contrib/dask-awkward/pull/508/files
    skimmed_dropped = skimmed[
        list(
            set(
                x
                for x in skimmed.fields
                if x not in ["DropField1", "DropField2"]
            )
        )
    ]

    # Returning the skimmed events
    return skimmed_dropped


print("Running preprocessing")  # To obtain file splitting
dataset_runnable, _ = preprocess(
    {
        "dataset1": {
            "files": {
                f"path_to_file_{index}.root": "Events" for i in range(0, 100) # How ever many 
            }
        },
        "dataset2": {
            "files": {
                f"path_to_file_{index}.root": "Events" for i in range(0, 200) # Another data set
            }
        }
    },
    align_clusters=False,
    step_size=100_000,  # You may want to set this to something slightly smaller to avoid loading too much in memory
    files_per_batch=1,
    skip_bad_files=True,
    save_form=False,
)
print("Computing dask task graph")
skimmed_dict = apply_to_fileset(
    make_skimmed_events, dataset_runnable, schemaclass=NanoAODSchema
)


print("Executing task graph and saving")
for dataset, skimmed in skimmed_dict.items():
    skimmed = uproot_writeable(skimmed)
    skimmed = skimmed.repartition(
        n_to_one=1_000
    )  # Reparititioning so that output file contains ~100_000 eventspartition
    uproot.dask_write(
        skimmed,
        destination="skimtest/",
        prefix=f"{dataset}/skimmed",
    )

Additional items to look out for:

The processing of dak.Array.repartition must be placed as the last step before uproot.dask_write
The number of events that enters the final events file would be preprocess:step_size x selection efficiency x repartition:n_to_one, so ideally, to keep the file sizes of the input and output file to roughly be similar, you would want to have n_to_one be the inverse of the selection efficiency. You might want to try a small sample set to make sure the file are not too large or too small.

yimuchen · 2024-06-03T09:06:54Z

yimuchen
Jun 3, 2024
Author

Currently, the repartition step introduces a large memory overhead. An issue has been raised for dask_awkward repository: dask-contrib/dask-awkward#509

2 replies

yimuchen Jun 6, 2024
Author

Additional update: making sure to repartition only after the the array has been converted seems to solve the memory issue? Still investigating...

lgray Jun 6, 2024
Maintainer

Thanks for continuing to look into it. I've pinged Martin in your issue to get him involved. I agree something seems wrong at this point.

Would follow up in dask-awkward issue with a demonstration to the point.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skimming NanoAOD (updated for coffea-202x) #1100

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Skimming NanoAOD (updated for coffea-202x) #1100

yimuchen May 31, 2024

Replies: 1 comment · 2 replies

yimuchen Jun 3, 2024 Author

yimuchen Jun 6, 2024 Author

lgray Jun 6, 2024 Maintainer

yimuchen
May 31, 2024

Replies: 1 comment 2 replies

yimuchen
Jun 3, 2024
Author

yimuchen Jun 6, 2024
Author

lgray Jun 6, 2024
Maintainer