Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta refactor #74

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .ruff.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ lint.select = [
"PIE", # flake8-pie
"PL", # pylint
"PT", # flake8-pytest-style
# "PTH", # flake8-use-pathlib
"PTH", # flake8-use-pathlib
"RET", # flake8-return
"RUF", # Ruff-specific
"SIM", # flake8-simplify
Expand Down
112 changes: 0 additions & 112 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,115 +3,3 @@
Implementation of an automatic data processing flow for L200
data, based on
[Snakemake](https://snakemake.readthedocs.io/).


## Configuration

Data processing resources are configured via a single site-dependent (and
possibly user-dependent) configuration file, named `config.json` in the
following. You may choose an arbitrary name, though.

Use the included [templates/config.json](templates/config.json) as a template
and adjust the data base paths as necessary. Note that, when running Snakemake,
the default path to the config file is `./config.json`.


## Key-Lists

Data generation is based on key-lists, which are flat text files
(extension ".keylist") containing one entry of the form
`{experiment}-{period}-{run}-{datatype}-{timestamp}` per line.

Key-lists can be auto-generated based on the available DAQ files
using Snakemake targets of the form

* `all-{experiment}.keylist`
* `all-{experiment}-{period}.keylist`
* `all-{experiment}-{period}-{run}.keylist`
* `all-{experiment}-{period}-{run}-{datatype}.keylist`

which will generate the list of available file keys for all l200 files, resp.
a specific period, or a specific period and run, etc.

For example:
```shell
$ snakemake all-l200-myper.keylist
```
will generate a key-list with all files regarding period `myper`.


## File-Lists

File-lists are flat files listing output files that should be generated,
with one file per line. A file-list will typically be generated for a given
data tier from a key-list, using the Snakemake targets of the form
`{label}-{tier}.filelist` (generated from `{label}.keylist`).

For file lists based on auto-generated key-lists like
`all-{experiment}-{period}-{tier}.filelist`, the corresponding key-list
(`all-{experiment}-{period}.keylist` in this case) will be created
automatically, if it doesn't exist.

Example:
```shell
$ snakemake all-mydet-mymeas-tier2.filelist
```

File-lists may of course also be derived from custom keylists, generated
manually or by other means, e.g. `my-dataset-raw.filelist` will be
generated from `my-dataset.keylist`.


## Main output generation

Usually, the main output will be determined by a file-list, resp. a key-list
and data tier. The special output target `{label}-{tier}.gen` is used to
generate all files listed in `{label}-{tier}.filelist`. After the files
are created, the empty file `{label}-{tier}.filelist` will be created to
mark the successful data production.

Snakemake targets like `all-{experiment}-{period}-{tier}.gen` may be used
to automatically generate key-lists and file-lists (if not already present)
and produce all possible output for the given data tier, based on available
tier0 files which match the target.

Example:
```shell
$ snakemake all-mydet-mymeas-tier2.gen
```
Targets like `my-dataset-raw.gen` (derived from a key-list
`my-dataset.keylist`) are of course allowed as well.


## Monitoring

Snakemake supports monitoring by connecting to a
[panoptes](https://github.com/panoptes-organization/panoptes) server.

Run (e.g.)
```shell
$ panoptes --port 5000
```
in the background to run a panoptes server instance, which comes with a
GUI that can be accessed with a web-brower on the specified port.

Then use the Snakemake option `--wms-monitor` to instruct Snakemake to push
progress information to the panoptes server:
```shell
snakemake --wms-monitor http://127.0.0.1:5000 [...]
```

## Using software containers

This dataflow doesn't use Snakemake's internal Singularity support, but
instead supports Singularity containers via
[`venv`](https://github.com/oschulz/singularity-venv) environments
for greater control.

To use this, the path to `venv` and the name of the environment must be set
in `config.json`.

This is only relevant then running Snakemake *outside* of the software
container, e.g. then using a batch system (see below). If Snakemake
and the whole workflow is run inside of a container instance, no
container-related settings in `config.json` are required.
112 changes: 45 additions & 67 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This includes:
- the same for partition level tiers
"""

import pathlib
from pathlib import Path
import os
import json
import sys
Expand All @@ -20,8 +20,8 @@ from collections import OrderedDict
import logging

import scripts.util as ds
from scripts.util.pars_loading import pars_catalog
from scripts.util.patterns import get_pattern_tier_raw
from scripts.util.pars_loading import ParsCatalog
from scripts.util.patterns import get_pattern_tier
from scripts.util.utils import (
subst_vars_in_snakemake_config,
runcmd,
Expand All @@ -31,6 +31,7 @@ from scripts.util.utils import (
metadata_path,
tmp_log_path,
pars_path,
det_status_path,
)

# Set with `snakemake --configfile=/path/to/your/config.json`
Expand All @@ -43,8 +44,9 @@ setup = config["setups"]["l200"]
configs = config_path(setup)
chan_maps = chan_map_path(setup)
meta = metadata_path(setup)
det_status = det_status_path(setup)
swenv = runcmd(setup)
part = ds.dataset_file(setup, os.path.join(configs, "partitions.json"))
part = ds.CalGrouping(setup, Path(det_status) / "cal_partitions.yaml")
basedir = workflow.basedir


Expand All @@ -66,38 +68,13 @@ include: "rules/psp.smk"
include: "rules/hit.smk"
include: "rules/pht.smk"
include: "rules/pht_fast.smk"
include: "rules/ann.smk"
include: "rules/evt.smk"
include: "rules/skm.smk"
include: "rules/blinding_calibration.smk"
include: "rules/qc_phy.smk"


# Log parameter catalogs in validity.jsonl files
hit_par_cat_file = os.path.join(pars_path(setup), "hit", "validity.jsonl")
if os.path.isfile(hit_par_cat_file):
os.remove(os.path.join(pars_path(setup), "hit", "validity.jsonl"))
pathlib.Path(os.path.dirname(hit_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(hit_par_catalog, hit_par_cat_file)

pht_par_cat_file = os.path.join(pars_path(setup), "pht", "validity.jsonl")
if os.path.isfile(pht_par_cat_file):
os.remove(os.path.join(pars_path(setup), "pht", "validity.jsonl"))
pathlib.Path(os.path.dirname(pht_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(pht_par_catalog, pht_par_cat_file)

dsp_par_cat_file = os.path.join(pars_path(setup), "dsp", "validity.jsonl")
if os.path.isfile(dsp_par_cat_file):
os.remove(dsp_par_cat_file)
pathlib.Path(os.path.dirname(dsp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(dsp_par_catalog, dsp_par_cat_file)

psp_par_cat_file = os.path.join(pars_path(setup), "psp", "validity.jsonl")
if os.path.isfile(psp_par_cat_file):
os.remove(psp_par_cat_file)
pathlib.Path(os.path.dirname(psp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(psp_par_catalog, psp_par_cat_file)


localrules:
gen_filelist,
autogen_output,
Expand All @@ -111,36 +88,36 @@ onstart:
shell('{swenv} python3 -B -c "import ' + pkg + '"')

# Log parameter catalogs in validity.jsonl files
hit_par_cat_file = os.path.join(pars_path(setup), "hit", "validity.jsonl")
if os.path.isfile(hit_par_cat_file):
os.remove(os.path.join(pars_path(setup), "hit", "validity.jsonl"))
pathlib.Path(os.path.dirname(hit_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(hit_par_catalog, hit_par_cat_file)

pht_par_cat_file = os.path.join(pars_path(setup), "pht", "validity.jsonl")
if os.path.isfile(pht_par_cat_file):
os.remove(os.path.join(pars_path(setup), "pht", "validity.jsonl"))
pathlib.Path(os.path.dirname(pht_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(pht_par_catalog, pht_par_cat_file)

dsp_par_cat_file = os.path.join(pars_path(setup), "dsp", "validity.jsonl")
if os.path.isfile(dsp_par_cat_file):
os.remove(dsp_par_cat_file)
pathlib.Path(os.path.dirname(dsp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(dsp_par_catalog, dsp_par_cat_file)

psp_par_cat_file = os.path.join(pars_path(setup), "psp", "validity.jsonl")
if os.path.isfile(psp_par_cat_file):
os.remove(psp_par_cat_file)
pathlib.Path(os.path.dirname(psp_par_cat_file)).mkdir(parents=True, exist_ok=True)
ds.pars_key_resolve.write_to_jsonl(psp_par_catalog, psp_par_cat_file)
hit_par_cat_file = Path(pars_path(setup)) / "hit" / "validity.yaml"
if hit_par_cat_file.is_file():
hit_par_cat_file.unlink()
Path(hit_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ds.ParsKeyResolve.write_to_yaml(hit_par_catalog, hit_par_cat_file)

pht_par_cat_file = Path(pars_path(setup)) / "pht" / "validity.yaml"
if pht_par_cat_file.is_file():
pht_par_cat_file.unlink()
Path(pht_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ds.ParsKeyResolve.write_to_yaml(pht_par_catalog, pht_par_cat_file)

dsp_par_cat_file = Path(pars_path(setup)) / "dsp" / "validity.yaml"
if dsp_par_cat_file.is_file():
dsp_par_cat_file.unlink()
Path(dsp_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ds.ParsKeyResolve.write_to_yaml(dsp_par_catalog, dsp_par_cat_file)

psp_par_cat_file = Path(pars_path(setup)) / "psp" / "validity.yaml"
if psp_par_cat_file.is_file():
psp_par_cat_file.unlink()
Path(psp_par_cat_file).parent.mkdir(parents=True, exist_ok=True)
ds.ParsKeyResolve.write_to_yaml(psp_par_catalog, psp_par_cat_file)


onsuccess:
from snakemake.report import auto_report

rep_dir = f"{log_path(setup)}/report-{datetime.strftime(datetime.utcnow(), '%Y%m%dT%H%M%SZ')}"
pathlib.Path(rep_dir).mkdir(parents=True, exist_ok=True)
Path(rep_dir).mkdir(parents=True, exist_ok=True)
# auto_report(workflow.persistence.dag, f"{rep_dir}/report.html")

with open(os.path.join(rep_dir, "dag.txt"), "w") as f:
Expand All @@ -157,15 +134,15 @@ onsuccess:
if os.path.isfile(file):
os.remove(file)

# remove filelists
files = glob.glob(os.path.join(filelist_path(setup), "*"))
for file in files:
if os.path.isfile(file):
os.remove(file)
if os.path.exists(filelist_path(setup)):
os.rmdir(filelist_path(setup))
# # remove filelists
# files = glob.glob(os.path.join(filelist_path(setup), "*"))
# for file in files:
# if os.path.isfile(file):
# os.remove(file)
# if os.path.exists(filelist_path(setup)):
# os.rmdir(filelist_path(setup))

# remove logs
# remove logs
files = glob.glob(os.path.join(tmp_log_path(setup), "*", "*.log"))
for file in files:
if os.path.isfile(file):
Expand All @@ -190,16 +167,17 @@ rule gen_filelist:
lambda wildcards: get_filelist(
wildcards,
setup,
get_pattern_tier_raw(setup),
ignore_keys_file=os.path.join(configs, "ignore_keys.keylist"),
analysis_runs_file=os.path.join(configs, "analysis_runs.json"),
get_pattern_tier(setup, "raw", check_in_cycle=False),
ignore_keys_file=Path(det_status) / "ignored_daq_cycles.yaml",
analysis_runs_file=Path(det_status) / "runlists.yaml",
),
output:
os.path.join(filelist_path(setup), "{label}-{tier}.filelist"),
temp(Path(filelist_path(setup)) / "{label}-{tier}.filelist"),
run:
if len(input) == 0:
print(
"WARNING: No files found for the given pattern\nmake sure pattern follows the format: all-{experiment}-{period}-{run}-{datatype}-{timestamp}-{tier}.gen"
f"WARNING: No files found for the given pattern:{wildcards.label}",
"\nmake sure pattern follows the format: all-{experiment}-{period}-{run}-{datatype}-{timestamp}-{tier}.gen",
)
with open(output[0], "w") as f:
for fn in input:
Expand Down
21 changes: 21 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
SHELL := /bin/bash
SOURCEDIR = source
BUILDDIR = build

all: apidoc
sphinx-build -M html "$(SOURCEDIR)" "$(BUILDDIR)" -W --keep-going

apidoc: clean-apidoc
sphinx-apidoc \
--private \
--module-first \
--force \
--output-dir "$(SOURCEDIR)/api" \
../scripts \
../rules

clean-apidoc:
rm -rf "$(SOURCEDIR)/api"

clean: clean-apidoc
rm -rf "$(BUILDDIR)"
15 changes: 15 additions & 0 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Developers Guide
===============

Snakemake is configured around a series of rules which specify how to generate a file/files from a set of input files.
These rules are defined in the ``Snakefile`` and in the files in the ``rules`` directory.
In general the structure is that a series of rules are defined to run on some calibration data generation
a final ``par_{tier}.yaml`` file at the end which can be used by the ``tier``` rule to generate all the files in the tier.
For most rules there are 2 versions the basic version and the partition version where the first uses a single run
while the latter will group many runs together.
This grouping is defined in the ``cal_grouping.yaml`` file in the `legend-datasets <https://github.com/legend-exp/legend-datasets>`_ repository.

Each rule has specified its inputs and outputs along with how to generate which can be
a shell command or a call to a python function. These scripts are stored in the ``scripts``` directory.
Additional parameters can also be defined.
Full details can be found at `snakemake https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html)`_.
41 changes: 41 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Welcome to legend-dataflow's documentation!
==================================

*legend-dataflow* is a Python package based on Snakemake `<https://snakemake.readthedocs.io/en/stable/index.html>`_
for running the data production of LEGEND.
It is designed to calibrate and optimise hundreds of channels in parallel before
bringing them all together to process the data. It takes as an input the metadata
at `legend metadata <https://github.com/legend-exp/legend-metadata>`_.

Getting started
---------------

It is recommended to install and use the package through the `legend-prodenv <https://github.com/legend-exp/legend-prodenv>`_.

Next steps
----------

.. toctree::
:maxdepth: 1

Package API reference <api/modules>

.. toctree::
:maxdepth: 1

tutorials

.. toctree::
:maxdepth: 1
:caption: Related projects

LEGEND Data Objects <https://legend-pydataobj.readthedocs.io>
Decoding Digitizer Data <https://legend-daq2lh5.readthedocs.io>
Digital Signal Processing <https://dspeed.readthedocs.io>
Pygama <https://pygama.readthedocs.io>

.. toctree::
:maxdepth: 1
:caption: Development

Source Code <https://github.com/legend-exp/legend-dataflow>
Loading