Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: switch over to dask-based processing idioms, improve dataset handling #882

Merged
merged 119 commits into from
Dec 12, 2023

Conversation

lgray
Copy link
Collaborator

@lgray lgray commented Aug 22, 2023

This will:

  • add a tool for dask-based pre-processing of "filesets" (which are the uproot file-dictionary spec + metadata support)
    • Note that this produces json-serializable output so we can now easily store the results and do preprocessing a minimal number of times instead of every run.
  • add support for dask-driven execution of processors
  • provide tools for users that no longer wish to use processors which make their lives easier

@lgray
Copy link
Collaborator Author

lgray commented Aug 22, 2023

@valsdav Here's the PR with the fileset pre-processor. The query tool you're working on with rucio / servicex backend should produce json that's compatible with this spec (and we can go so far as to define a schema if you like).

Essentially - nests the spec mentioned here in uproot. So that you've got an object that looks like

dataset = {
    "dataset_name": {"files": <the-uproot-spec>, "metadata": {...}, ...},
    "dataset2_name": ...,
    ...
}

The preprocessing step, as it is, strips out files from the file set, and runs a preprocessing step on each file to yield the desired chunking. It then returns the original input fileset with the files it was able to parse and determine the chunking of. It then returns the "available" dataset that it was able to parse and the "complete" dataset is all files which were input and only those reachable are updated, leaving unreachable files as given.

I'm going to add some modifiers on top of this to recover capabilities like max_chunks, etc.

@valsdav
Copy link
Contributor

valsdav commented Aug 23, 2023

Hi @lgray thanks! Starting to work on the query side :)

valsdav and others added 23 commits August 23, 2023 11:41
feat: Dataset querying features using rucio
@ikrommyd
Copy link
Collaborator

Pre-processing fails on a distributed Client with

TypeError: ('Could not serialize object of type HighLevelGraph', '<ToPickle: HighLevelGraph with 1 layers.\n<dask.highlevelgraph.HighLevelGraph object at 0x7f525471cfd0>\n 0. -get-steps-6f666e09968af4341f760475fe63382d\n>')

which fails during all_processed_files = dask.compute(files_to_preprocess)[0] with

AssertionError: bug in Awkward Array: attempt to get length of a TypeTracerArray

See if this has been reported at https://github.com/scikit-hep/awkward/issues

The above exception was the direct cause of the following exception:

throughout the error message. Works fine with local threads or processes

@lgray
Copy link
Collaborator Author

lgray commented Aug 31, 2023

@iasonkrom this is a limitation in the latest awkward array, I think. @agoose77 would know better.

Though I'm a bit confused why it's trying to serialize a typetracer in the first place, that shouldn't be happening.

valsdav and others added 21 commits December 11, 2023 13:04
…nto local_executors_to_dask

Conflicts:
	src/coffea/dataset_tools/dataset_query.py
fix: improvements to dataset_query tools
@lgray
Copy link
Collaborator Author

lgray commented Dec 12, 2023

@nsmith- can you give this another run through? Intend to merge later today unless there is something big.

@lgray lgray enabled auto-merge December 12, 2023 20:52
@lgray lgray merged commit 16dd8df into master Dec 12, 2023
14 checks passed
@lgray lgray deleted the local_executors_to_dask branch December 14, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants