WIP: Migration guide for coffea 0.7 to coffea 2023 #775

lgray · 2023-03-18T08:54:37Z

lgray
Mar 18, 2023
Maintainer

I'll start building up here a migration guide for folks using coffea 0.7 to coffea 2023. There are significant differences in functionality due to the evolution of the awkward array package from v1 to v2, notably that all delayed computation is accomplished through use of the dask via using dask-awkward, dask-histogram via the hist.dask extension.

Using these packages we have been able to maintain the functionality and interfaces of coffea, but fully integrated with the dask task-graph building system. This change is well justified since it brings qualitatively new functionality to coffea (like on-demand skimming) and makes analysis design significantly more flexible (data exploration and analysis scaling that does not need the processor pattern). Usage of the dask-awkward, dask-histogram, and hist.dask packages is mandatory. The new, advanced functionality afforded by these packages is completely opt-in to ease migration through piecewise adoption, and the processors still function as they have in the past with some minor conversion required (wrapper is in preparation, watch #882).

In broad strokes, to migrate an analysis to coffea 2023 you will need to make the following changes:

If you want to get a complete array or histogram object in memory on your local machine so that you can manipulate it use array.compute(), ahistogram.compute(), or dask.compute({"some": array, "another": histogram})
- This should often be done at the end of code that is constructing arrays, do not prematurely .compute() as it can drastically slow down your analysis code (and within a processor it is taken care of for you by coffea's executors).
in your analysis code you should use import hist.dask as hda and you should perform array operations as you used to with bare awkward array (ak.some_function will call dak.some_function if passed a dask-awkward array) and the instantiate histograms using these package with the syntax you are accustomed to from awkward and hist.
- You'll still need to import hist for convenient definition of axes.
- Often it is sufficient to do import awkward as ak and set permit_dask=True in NanoEventsFactory.from_root
  - If you need specific dask_awkward functionality (like getting the list of expected columns to read) you'll need to
- Most array operations that are available in awkward are available in dask_awkward, and if you encounter a problem or missing piece of functionality that you need you should open an issue at the dask-awkward github page
- hist.dask histograms behave like empty hist histograms, and will return a filled hist.hist.Hist when .compute() is called.
Heavy corrections (i.e. histogram lookups or JECs) need to be injected into dask, if you are using a correction based on coffea.lookup_tools.lookup_base (through the evaluator interface or with objects you make yourself) this is done for you automatically. Otherwise you should wrap your correction in a dask.delayed object and on that object do .persist()

yimuchen · 2023-04-12T12:50:08Z

yimuchen
Apr 12, 2023

Right now, correctionlib (commonly used for b tag scale factors) will not work with dask_awkward array [1] as an equivalent of ak.to_numpy is not implemented.

However, there is a full correctionlib_wrapper available in coffea.lookup_tools [2] that is fully dask_awkward compatible. Here you don't even need to use the ak.flatten/unflatten snippet, correctionlib_wrapper properly wraps jagged array.

[1] https://github.com/cms-nanoAOD/correctionlib/blob/master/src/correctionlib/highlevel.py#L93
[2] https://github.com/CoffeaTeam/coffea/blob/master/coffea/lookup_tools/correctionlib_wrapper.py

0 replies

jrueb · 2023-05-11T15:49:28Z

jrueb
May 11, 2023

You are stating that This new, advanced functionality is completely opt-in to ease migration through piecewise adoption, and the processors still function as they have in the past. Can you give some hints on how to do keep past functionality running? How do I use NanoAODSchema without the new functionality, meaning without dask but still not loading the entire file (as it was the case so far)?

1 reply

lgray May 11, 2023
Maintainer Author

NanoAODSchema (or any others) without using dask will be rather slow since it is completely eagerly evaluated now, I cannot change that, and I don't recommend it.

Converting to dask-awkward with coffea 2023 is more or less mandatory. We are here to help you do that if/when you run into problems. The "new, advanced functionality" referred to in the sentence is the skimming, checkpointing, removal/lack-of-need for processors. I'll make the scope of that more clear.

As far as processors - soon to be working on something that will take your processor, and so long as you have switched to dask-awkward and dask-histogram (which is very straightforward) will basically render it for you in the appropriate backend.

Largely we've found that 90% of the work is changing ak. to dak. and using hist.dask instead of hist.

nsmith- · 2023-06-20T22:49:53Z

nsmith-
Jun 20, 2023
Maintainer

I'll just mention that if you want to try using coffea 2023 with LPCJobQueue at the LPC cluster, you can use the standard container and then simply pip install coffea==2023.6.0rc0 inside the shell, and make the cluster with ship_env=True and it should work, e.g.

from distributed import Client
from lpcjobqueue import LPCCondorCluster

cluster = LPCCondorCluster(ship_env=True)
cluster.adapt(minimum=1, maximum=10)
client = Client(cluster)

0 replies

lgray · 2023-07-08T02:22:41Z

lgray
Jul 8, 2023
Maintainer Author

As of dask-awkward 2023.7.0 and awkward 2.3.0 ak.some_function will automatically dispatch to dak.some_function this is a significant reduction in burden for migrating code!

~~We are however working through some growing pains in dask-awkward and need to restabilize the package. It's less robust than previous versions. Will get fixed quickly.~~ Ok that took a few months but now folks essentially do not need to import dask_awkward unless they need access to specific functionality.

0 replies

JaLuka98 · 2023-10-25T13:18:18Z

JaLuka98
Oct 25, 2023

Hey guys,
I made a yml that you can use to setup an environment with coffea 2023 and the latest features. Everything is installed via conda except for coffea and dask-awkward because we need the respective release candidates for those packages, which we get from pip.

name: coffea
channels:
  - conda-forge
dependencies:
  - python=3.10
  - hist
  - dask-histogram
  - uproot
  - xrootd
  - pip:
    - 'dask-awkward>=2023.10.1'
    - 'coffea>=2023.10.0rc1'

Note that the single quotes for the pip package versions are necessary to escape the interpretation of the > character as shell redirection.
Install with conda env create -f <path_to_yml_file> -n custom_environment_name and jump right into it! Note that you can use mamba/micromamba for faster solving.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Migration guide for coffea 0.7 to coffea 2023 #775

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

WIP: Migration guide for coffea 0.7 to coffea 2023 #775

lgray Mar 18, 2023 Maintainer

Replies: 5 comments · 1 reply

yimuchen Apr 12, 2023

jrueb May 11, 2023

lgray May 11, 2023 Maintainer Author

nsmith- Jun 20, 2023 Maintainer

lgray Jul 8, 2023 Maintainer Author

JaLuka98 Oct 25, 2023

lgray
Mar 18, 2023
Maintainer

Replies: 5 comments 1 reply

yimuchen
Apr 12, 2023

jrueb
May 11, 2023

lgray May 11, 2023
Maintainer Author

nsmith-
Jun 20, 2023
Maintainer

lgray
Jul 8, 2023
Maintainer Author

JaLuka98
Oct 25, 2023