Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to CalVer coffea #1

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 148 additions & 30 deletions 01-nanoevents.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
"This is a rendered copy of [nanoevents.ipynb](https://github.com/CoffeaTeam/coffea/blob/master/binder/nanoevents.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Fnanoevents.ipynb)\n",
"\n",
"NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all lazily accessed$^\\dagger$ from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.html#classes), which are community-supplied for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.lazy](https://uproot4.readthedocs.io/en/latest/uproot4.behaviors.TBranch.lazy.html) methods. Currently available schemas include:\n",
"NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all accessed in delayed$^\\dagger$ mode from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.html#classes), which are community-supplied for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.dask](https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html) methods. Currently available schemas include:\n",
"\n",
" - `BaseSchema`, which provides a simple representation of the input TTree, where each branch is available verbatim as `events.branch_name`, effectively the same behavior as `uproot.lazy`. Any branches that uproot supports at \"full speed\" (i.e. that are fully split and either flat or single-jagged) can be read by this schema;\n",
" - `NanoAODSchema`, which is optimized to provide all methods and cross-references in CMS NanoAOD format;\n",
Expand All @@ -21,7 +21,7 @@
"\n",
"$^*$ Vector methods are currently made possible via the [coffea vector](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.methods.vector.html) methods mixin class structure. In a future version of coffea, they will instead be provided by the dedicated scikit-hep [vector](https://vector.readthedocs.io/en/latest/) library, which provides a more rich feature set. The coffea vector methods predate the release of the vector library.\n",
"\n",
"$^\\dagger$ _Lazy_ access refers to only fetching the needed data from the (possibly remote) file when a sub-array is first accessed. The sub-array is then _materialized_ and subsequent access of the sub-array uses a cached value in memory. As such, fully materializing a `NanoEvents` object may require a significant amount of memory.\n",
"$^\\dagger$ _delayed_ access refers to only fetching the needed data from the (possibly remote) file when you're ready to compute the entire set of outputs. Until then, a minimal amount of metadata is used to trace the operations to eventually be performed (frequently called \"lazily\" \"booking\" operations in RDataFrame's approach to this)\n",
"\n",
"\n",
"In this demo, we will use NanoEvents to read a small CMS NanoAOD sample. The events object can be instantiated as follows:"
Expand All @@ -30,17 +30,20 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import awkward as ak\n",
"from coffea.nanoevents import NanoEventsFactory, NanoAODSchema\n",
"\n",
"fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n",
"events = NanoEventsFactory.from_root(\n",
" fname,\n",
" {fname: \"Events\"}, # We pass a dictionary of {filename1: treename1, filename2: treename2, ...} to load from\n",
" schemaclass=NanoAODSchema.v6,\n",
" metadata={\"dataset\": \"DYJets\"},\n",
" delayed=False, # You can turn this to True and insert `` commands at the end of variables\n",
").events()"
]
},
Expand Down Expand Up @@ -68,7 +71,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.Generator.id1"
Expand All @@ -77,7 +82,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# all names can be listed with:\n",
Expand All @@ -103,7 +110,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.Generator.id1?"
Expand All @@ -119,7 +128,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenJet.fields"
Expand All @@ -135,7 +146,20 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenJet.energy"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenJet.energy"
Expand All @@ -151,7 +175,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# find distance between leading jet and all electrons in each event\n",
Expand All @@ -162,17 +188,22 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# find minimum distance\n",
"ak.min(dr, axis=1)"
"drmin = ak.min(dr, axis=1)\n",
"drmin"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# a convenience method for this operation on all jets is available\n",
Expand All @@ -191,7 +222,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(events.Jet.jetId)\n",
Expand All @@ -208,7 +241,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(f\"Raw status flags: {events.GenPart.statusFlags}\")\n",
Expand All @@ -225,7 +260,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.Electron.matched_gen.pdgId"
Expand All @@ -234,7 +271,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.Muon[ak.num(events.Muon)>0].matched_jet.pt"
Expand All @@ -250,7 +289,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenPart.parent.pdgId"
Expand All @@ -266,7 +307,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenPart.parent.children.pdgId\n",
Expand All @@ -283,7 +326,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"events.GenPart[\n",
Expand All @@ -302,18 +347,22 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mmevents = events[ak.num(events.Muon) == 2]\n",
"zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]\n",
"zmm = (mmevents.Muon[:, 0] + mmevents.Muon[:, 1])\n",
"zmm.mass"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# a convenience method is available to sum vectors along an axis:\n",
Expand All @@ -322,15 +371,19 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"tags": []
},
"source": [
"As expected for this sample, most of the dimuon events have a pair invariant mass close to that of a Z boson. But what about the last event? Let's take a look at the generator information:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(mmevents[-1].Muon.matched_gen.pdgId)\n",
Expand All @@ -349,7 +402,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mmevents[-1].Muon.matched_gen.parent.pdgId"
Expand All @@ -365,7 +420,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"print(mmevents.Muon.matched_gen.sum().mass[-1])\n",
Expand All @@ -386,12 +443,73 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mmevents[\"Electron\", \"myvariable\"] = mmevents.Electron.pt + zmm.mass\n",
"mmevents.Electron.myvariable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import dask\n",
"\n",
"# We'll use the latest dask_capable schema to load the same events in delayed (dask) mode\n",
"NanoAODSchema.warn_missing_crossrefs=False\n",
"\n",
"fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n",
"devents = NanoEventsFactory.from_root(\n",
" {fname: \"Events\"},\n",
" schemaclass=NanoAODSchema, #we'll not use the v6 schema here\n",
" metadata={\"dataset\": \"DYJets\"},\n",
" delayed=True,\n",
").events()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"devents.Jet.nearest(devents.Electron)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"a, b = dask.compute(devents.Jet.jetId, devents.Jet.isTight)\n",
"print(a)\n",
"print(b)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"dmmevents = devents[ak.num(devents.Muon) == 2]\n",
"dzmm = (dmmevents.Muon[:, 0] + dmmevents.Muon[:, 1]).compute() # we insert the compute on the intermediate array\n",
"zmm.mass"
]
}
],
"metadata": {
Expand All @@ -410,9 +528,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
"nbformat_minor": 4
}
Loading