CoffeaTeam · NJManganelli · Jul 19, 2024 · Jul 19, 2024 · Jul 19, 2024 · Jul 19, 2024
diff --git a/01-nanoevents.ipynb b/01-nanoevents.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "This is a rendered copy of [nanoevents.ipynb](https://github.com/CoffeaTeam/coffea/blob/master/binder/nanoevents.ipynb). You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/coffeateam/coffea/master?filepath=binder%2Fnanoevents.ipynb)\n",
     "\n",
-    "NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all lazily accessed$^\\dagger$ from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.html#classes), which are community-supplied  for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.lazy](https://uproot4.readthedocs.io/en/latest/uproot4.behaviors.TBranch.lazy.html) methods. Currently available schemas include:\n",
+    "NanoEvents is a Coffea utility to wrap flat nTuple structures (such as the CMS [NanoAOD](https://www.epj-conferences.org/articles/epjconf/pdf/2019/19/epjconf_chep2018_06021.pdf) format) into a single awkward array with appropriate object methods (such as Lorentz vector methods$^*$), cross references, and nested objects, all accessed in delayed$^\\dagger$ mode from the source ROOT TTree via uproot. The interpretation of the TTree data is configurable via [schema objects](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.html#classes), which are community-supplied  for various source file types. These schema objects allow a richer interpretation of the file contents than the [uproot.dask](https://uproot.readthedocs.io/en/latest/uproot._dask.dask.html) methods. Currently available schemas include:\n",
     "\n",
     "   - `BaseSchema`, which provides a simple representation of the input TTree, where each branch is available verbatim as `events.branch_name`, effectively the same behavior as `uproot.lazy`.  Any branches that uproot supports at \"full speed\" (i.e. that are fully split and either flat or single-jagged) can be read by this schema;\n",
     "   - `NanoAODSchema`, which is optimized to provide all methods and cross-references in CMS NanoAOD format;\n",
@@ -21,7 +21,7 @@
     "\n",
     "$^*$ Vector methods are currently made possible via the [coffea vector](https://coffeateam.github.io/coffea/modules/coffea.nanoevents.methods.vector.html) methods mixin class structure. In a future version of coffea, they will instead be provided by the dedicated scikit-hep [vector](https://vector.readthedocs.io/en/latest/) library, which provides a more rich feature set. The coffea vector methods predate the release of the vector library.\n",
     "\n",
-    "$^\\dagger$ _Lazy_ access refers to only fetching the needed data from the (possibly remote) file when a sub-array is first accessed. The sub-array is then _materialized_ and subsequent access of the sub-array uses a cached value in memory. As such, fully materializing a `NanoEvents` object may require a significant amount of memory.\n",
+    "$^\\dagger$ _delayed_ access refers to only fetching the needed data from the (possibly remote) file when you're ready to compute the entire set of outputs. Until then, a minimal amount of metadata is used to trace the operations to eventually be performed (frequently called \"lazily\" \"booking\" operations in RDataFrame's approach to this)\n",
     "\n",
     "\n",
     "In this demo, we will use NanoEvents to read a small CMS NanoAOD sample. The events object can be instantiated as follows:"
@@ -30,17 +30,20 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "import awkward as ak\n",
     "from coffea.nanoevents import NanoEventsFactory, NanoAODSchema\n",
     "\n",
     "fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n",
     "events = NanoEventsFactory.from_root(\n",
-    "    fname,\n",
+    "    {fname: \"Events\"}, # We pass a dictionary of {filename1: treename1, filename2: treename2, ...} to load from\n",
     "    schemaclass=NanoAODSchema.v6,\n",
     "    metadata={\"dataset\": \"DYJets\"},\n",
+    "    delayed=False, # You can turn this to True and insert `` commands at the end of variables\n",
     ").events()"
    ]
   },
@@ -68,7 +71,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.Generator.id1"
@@ -77,7 +82,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# all names can be listed with:\n",
@@ -103,7 +110,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.Generator.id1?"
@@ -119,7 +128,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.GenJet.fields"
@@ -135,7 +146,20 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "events.GenJet.energy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.GenJet.energy"
@@ -151,7 +175,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# find distance between leading jet and all electrons in each event\n",
@@ -162,17 +188,22 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# find minimum distance\n",
-    "ak.min(dr, axis=1)"
+    "drmin = ak.min(dr, axis=1)\n",
+    "drmin"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# a convenience method for this operation on all jets is available\n",
@@ -191,7 +222,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "print(events.Jet.jetId)\n",
@@ -208,7 +241,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "print(f\"Raw status flags: {events.GenPart.statusFlags}\")\n",
@@ -225,7 +260,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.Electron.matched_gen.pdgId"
@@ -234,7 +271,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.Muon[ak.num(events.Muon)>0].matched_jet.pt"
@@ -250,7 +289,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.GenPart.parent.pdgId"
@@ -266,7 +307,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.GenPart.parent.children.pdgId\n",
@@ -283,7 +326,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "events.GenPart[\n",
@@ -302,18 +347,22 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "mmevents = events[ak.num(events.Muon) == 2]\n",
-    "zmm = mmevents.Muon[:, 0] + mmevents.Muon[:, 1]\n",
+    "zmm = (mmevents.Muon[:, 0] + mmevents.Muon[:, 1])\n",
     "zmm.mass"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "# a convenience method is available to sum vectors along an axis:\n",
@@ -322,15 +371,19 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "source": [
     "As expected for this sample, most of the dimuon events have a pair invariant mass close to that of a Z boson. But what about the last event? Let's take a look at the generator information:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "print(mmevents[-1].Muon.matched_gen.pdgId)\n",
@@ -349,7 +402,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "mmevents[-1].Muon.matched_gen.parent.pdgId"
@@ -365,7 +420,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "print(mmevents.Muon.matched_gen.sum().mass[-1])\n",
@@ -386,12 +443,73 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "tags": []
+   },
    "outputs": [],
    "source": [
     "mmevents[\"Electron\", \"myvariable\"] = mmevents.Electron.pt + zmm.mass\n",
     "mmevents.Electron.myvariable"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "import dask\n",
+    "\n",
+    "# We'll use the latest dask_capable schema to load the same events in delayed (dask) mode\n",
+    "NanoAODSchema.warn_missing_crossrefs=False\n",
+    "\n",
+    "fname = \"https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root\"\n",
+    "devents = NanoEventsFactory.from_root(\n",
+    "    {fname: \"Events\"},\n",
+    "    schemaclass=NanoAODSchema, #we'll not use the v6 schema here\n",
+    "    metadata={\"dataset\": \"DYJets\"},\n",
+    "    delayed=True,\n",
+    ").events()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "devents.Jet.nearest(devents.Electron)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "a, b = dask.compute(devents.Jet.jetId, devents.Jet.isTight)\n",
+    "print(a)\n",
+    "print(b)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "dmmevents = devents[ak.num(devents.Muon) == 2]\n",
+    "dzmm = (dmmevents.Muon[:, 0] + dmmevents.Muon[:, 1]).compute() # we insert the compute on the intermediate array\n",
+    "zmm.mass"
+   ]
   }
  ],
  "metadata": {
@@ -410,9 +528,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.13"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }