Skip to content

Commit

Permalink
Merge pull request #32 from pepkit/master_Looper181
Browse files Browse the repository at this point in the history
Update docs for Looper 1.8.1
  • Loading branch information
donaldcampbelljr authored Jun 6, 2024
2 parents 8881bea + fa20901 commit 851e4ec
Show file tree
Hide file tree
Showing 22 changed files with 1,370 additions and 856 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@ site
venv/
.venv/

.DS_Store
.DS_Store
/docs/looper/notebooks/hello_looper-master/
/docs/looper/notebooks/master.zip
57 changes: 54 additions & 3 deletions docs/looper/changelog.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,64 @@
# Changelog

This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.
## [1.8.1] -- 2024-06-05

## [1.6.0] -- 2023-09-XX
### Fixed
- added `-v` and `--version` to the CLI
- fixed running project level with `--project` argument

## [1.8.0] -- 2024-06-04

### Added
- looper destroy now destroys individual results when pipestat is configured: https://github.com/pepkit/looper/issues/469
- comprehensive smoketests: https://github.com/pepkit/looper/issues/464
- allow rerun to work on both failed or waiting flags: https://github.com/pepkit/looper/issues/463

### Changed
- looper now works with pipestat v0.6.0 and greater
- looper table and check now use pipestat and therefore require pipestat configuration. [#390](https://github.com/pepkit/looper/issues/390)
- Migrated `argparse` CLI definition to a pydantic basis for all commands. See: https://github.com/pepkit/looper/issues/438
- during project load, check if PEP file path is a file first, then check if it is a registry path: https://github.com/pepkit/looper/issues/456
- Looper now uses FutureYamlConfigManager due to the yacman refactor v0.9.3: https://github.com/pepkit/looper/issues/452

### Fixed
- inferring project name when loading PEP from csv: https://github.com/pepkit/looper/issues/484
- fix inconsistency resolving pipeline interface paths if multiple paths are supplied: https://github.com/pepkit/looper/issues/474
- fix bug with checking for completed flags: https://github.com/pepkit/looper/issues/470
- fix looper destroy not properly destroying all related files: https://github.com/pepkit/looper/issues/468
- looper rerun now only runs failed jobs as intended: https://github.com/pepkit/looper/issues/467
- looper inspect now inspects the looper config: https://github.com/pepkit/looper/issues/462
- Load PEP from CSV: https://github.com/pepkit/looper/issues/456
- looper now works with sample_table_index https://github.com/pepkit/looper/issues/458

## [1.7.1] -- 2024-05-28

### Fixed
- pin pipestat version to be between pipestat>=0.8.0,<0.9.0 https://github.com/pepkit/looper/issues/494


## [1.7.0] -- 2024-01-26

### Added
- `--portable` flag to `looper report` to create a portable version of the html report
- `--lump-j` allows grouping samples into a defined number of jobs

### Changed
- `--lumpn` is now `--lump-n`
- `--lump` is now `--lump-s`
-
## [1.6.0] -- 2023-12-22

### Added
- `looper link` creates symlinks for results grouped by record_identifier. It requires pipestat to be configured. [#72](https://github.com/pepkit/looper/issues/72)
- basic tab completion.

### Changed
- looper now works with pipestat v0.6.0 and greater.
- `looper table`, `check` now use pipestat and therefore require pipestat configuration. [#390](https://github.com/pepkit/looper/issues/390)
- changed how looper configures pipestat [#411](https://github.com/pepkit/looper/issues/411)
- initializing pipeline interface also writes an example `output_schema.yaml` and `count_lines.sh` pipeline

### Fixed
- filtering via attributes that are integers.

## [1.5.1] -- 2023-08-14

Expand Down
381 changes: 233 additions & 148 deletions docs/looper/code/hello-world.md

Large diffs are not rendered by default.

142 changes: 48 additions & 94 deletions docs/looper/defining-a-project.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,128 +2,82 @@

## 1. Start with a basic PEP

To start, you need a project defined in the [standard Portable Encapsulated Project (PEP) format](http://pep.databio.org). Start by [creating a PEP](https://pep.databio.org/en/latest/simple_example/).
To start, you need a project defined in the [standard Portable Encapsulated Project (PEP) format](http://pep.databio.org). Start by [creating a PEP](https://pep.databio.org/spec/simple-example/).

## 2. Connect the PEP to looper
## 2. Specify the Sample Annotation

### 2.1 Specify `output_dir`

Once you have a basic PEP, you can connect it to looper. Just provide the required looper-specific piece of information -- `output-dir`, a parent folder where you want looper to store your results. You do this by adding a `looper` section to your PEP. The `output_dir` key is expected in the top level of the `looper` section of the project configuration file. Here's an example:
This information generally lives in a `project_config.yaml` file.

Simplest example:
```yaml
looper:
output_dir: "/path/to/output_dir"
pep_version: 2.0.0
sample_table: sample_annotation.csv
```
### 2.2 Configure pipestat
*We recommend to read the [pipestat documentation](https://pipestat.databio.org) to learn more about the concepts described in this section*
Additionally, you may configure pipestat, the tool used to manage pipeline results. Pipestat provides lots of flexibility, so there are multiple configuration options that you can provide in `looper.pipestat.sample` or `looper.pipestat.project`, depending on the pipeline level you intend to run.

Please note that all the configuration options listed below *do not* specify the values passed to pipestat *per se*, but rather `Project` or `Sample` attribute names that hold these values. This way the pipestat configuration can change with pipeline submitted for every `Sample` if the PEP `sample_modifiers` are used.

- `results_file_attribute`: name of the `Sample` or `Project` attribute that indicates the path to the YAML results file that will be used to report results into. Default value: `pipestat_results_file`, so the path will be sourced from either `Sample.pipestat_results_file` or `Project.pipestat_results_file`. If the path provided this way is not absolute, looper will make it relative to `{looper.output_dir}`.
- `namespace_attribute`: name of the `Sample` or `Project` attribute that indicates the namespace to report into. Default values: `sample_name` for sample-level pipelines `name` for project-level pipelines , so the path will be sourced from either `Sample.sample_name` or `Project.name`.
- `config_attribute`: name of the `Sample` or `Project` attribute that indicates the path to the pipestat configuration file. It's not needed in case the intended pipestat backend is the YAML results file mentioned above. It's required if the intended pipestat backend is a PostgreSQL database, since this is the only way to provide the database login credentials. Default value: `pipestat_config`, so the path will be sourced from either `Sample.pipestat_config` or `Project.pipestat_config`.
You can also add sample modifiers to the project file `derive` or `imply` attributes:

Non-configurable pipestat options:

- `schema_path`: never specified here, since it's sourced from `{pipeline.output_schema}`, that is specified in the pipeline interface file
- `record_identifier`: is automatically set to `{pipeline.pipeline_name}`, that is specified in the pipeline interface file
For example:
If you have a project that contains samples of different types, then you can use an `imply` modifier in your PEP to select which pipelines you want to run on which samples, like this:


```yaml
name: "test123"
pipestat_results_file: "project_pipestat_results.yaml"
pipestat_config: "/path/to/project_pipestat_config.yaml"
sample_modifiers:
append:
pipestat_config: "/path/to/pipestat_config.yaml"
pipestat_results_file: "RESULTS_FILE_PLACEHOLDER"
derive:
attributes: ["pipestat_results_file"]
sources:
RESULTS_FILE_PLACEHOLDER: "{sample_name}/pipestat_results.yaml"
looper:
output_dir: "/path/to/output_dir"
# pipestat configuration starts here
# the values below are defaults, so they are not needed, but configurable
pipestat:
sample:
results_file_attribute: "pipestat_results_file"
config_attribute: "pipestat_config"
namespace_attribute: "sample_name"
project:
results_file_attribute: "pipestat_results_file"
config_attribute: "pipestat_config"
namespace_attribute: "name"
imply:
- if:
protocol: "RRBS"
then:
pipeline_interfaces: "/path/to/pipeline_interface.yaml"
- if:
protocol: "ATAC"
then:
pipeline_interfaces: "/path/to/pipeline_interface2.yaml"
```
## 3. Link a pipeline to your project

Next, you'll need to point the PEP to the *pipeline interface* file that describes the command you want looper to run.

### Understanding pipeline interfaces

Looper links projects to pipelines through a file called the *pipeline interface*. Any looper-compatible pipeline must provide a pipeline interface. To link the pipeline, you simply point each sample to the pipeline interfaces for any pipelines you want to run.

Looper pipeline interfaces can describe two types of pipeline: sample-level pipelines or project-level pipelines. Briefly, a sample-level pipeline is executed with `looper run`, which runs individually on each sample. A project-level pipeline is executed with `looper runp`, which runs a single job *per pipeline* on an entire project. Typically, you'll first be interested in the sample-level pipelines. You can read in more detail in the [pipeline tiers documentation](pipeline-tiers.md).

### Adding a sample-level pipeline interface

Sample pipelines are linked by adding a sample attribute called `pipeline_interfaces`. There are 2 easy ways to do this: you can simply add a `pipeline_interfaces` column in the sample table, or you can use an *append* modifier, like this:
You can also use `derive` to derive attributes from the PEP:

```yaml
sample_modifiers:
append:
pipeline_interfaces: "/path/to/pipeline_interface.yaml"
```

The value for the `pipeline_interfaces` key should be the *absolute* path to the pipeline interface file. The paths may also contain environment variables. Once your PEP is linked to the pipeline, you just need to make sure your project provides any sample metadata required by the pipeline.

### Adding a project-level pipeline interface

Project pipelines are linked in the `looper` section of the project configuration file:
derive:
attributes: [read1, read2]
sources:
# Obtain tutorial data from http://big.databio.org/pepatac/ then set
# path to your local saved files
R1: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r1.fastq.gz"
R2: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r2.fastq.gz"
```
looper:
pipeline_interfaces: "/path/to/project_pipeline_interface.yaml"
```

### How to link to multiple pipelines

Looper decouples projects and pipelines, so you can have many projects using one pipeline, or many pipelines running on the same project. If you want to run more than one pipeline on a sample, you can simply add more than one pipeline interface, like this:
A more complicated example taken from [PEPATAC](https://pepatac.databio.org/en/latest/):

```yaml
sample_modifiers:
append:
pipeline_interfaces: ["/path/to/pipeline_interface.yaml", "/path/to/pipeline_interface2.yaml"]
```

Looper will submit jobs for both of these pipelines.
pep_version: 2.0.0
sample_table: tutorial.csv
If you have a project that contains samples of different types, then you can use an `imply` modifier in your PEP to select which pipelines you want to run on which samples, like this:


```yaml
sample_modifiers:
derive:
attributes: [read1, read2]
sources:
# Obtain tutorial data from http://big.databio.org/pepatac/ then set
# path to your local saved files
R1: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r1.fastq.gz"
R2: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r2.fastq.gz"
imply:
- if:
protocol: "RRBS"
then:
pipeline_interfaces: "/path/to/pipeline_interface.yaml"
- if:
protocol: "ATAC"
then:
pipeline_interfaces: "/path/to/pipeline_interface2.yaml"
- if:
organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
then:
genome: hg38
prealignment_names: ["rCRSd"]
deduplicator: samblaster # Default. [options: picard]
trimmer: skewer # Default. [options: pyadapt, trimmomatic]
peak_type: fixed # Default. [options: variable]
extend: "250" # Default. For fixed-width peaks, extend this distance up- and down-stream.
frip_ref_peaks: None # Default. Use an external reference set of peaks instead of the peaks called from this run
```


## 5. Customize looper
## 3. Customize looper

That's all you need to get started linking your project to looper. But you can also customize things further. Under the `looper` section, you can provide a `cli` keyword to specify any command line (CLI) options from within the project config file. The subsections within this section direct the arguments to the respective `looper` subcommands. So, to specify, e.g. sample submission limit for a `looper run` command use:
You can also customize things further. Under the `looper` section, you can provide a `cli` keyword to specify any command line (CLI) options from within the project config file. The subsections within this section direct the arguments to the respective `looper` subcommands. So, to specify, e.g. sample submission limit for a `looper run` command use:

```yaml
looper:
Expand Down
4 changes: 2 additions & 2 deletions docs/looper/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ You can add that location to your path by appending it (`export PATH=$PATH:~/.lo

## How can I run my jobs on a cluster?

Looper uses the external package [divvy](http://code.databio.org/divvy) for cluster computing, making it flexible enough to use with any cluster resource environment. Please see the [tutorial on cluster computing with looper and divvy](running-on-a-cluster.md).
Looper uses the external package [divvy](https://pep.databio.org/divvy/) for cluster computing, making it flexible enough to use with any cluster resource environment. Please see the [tutorial on cluster computing with looper and divvy](running-on-a-cluster.md).


## What's the difference between `looper` and `pypiper`?

[`pypiper`](http://pypiper.readthedocs.io) is a more traditional workflow-building framework; it helps you build pipelines to process individual samples. [`looper`](http://looper.readthedocs.io) is completely pipeline-agnostic, and has nothing to do with individual processing steps; it operates groups of samples (as in a project), submitting the appropriate pipeline(s) to a cluster or server (or running them locally). The two projects are independent and can be used separately, but they are most powerful when combined. They complement one another, together constituting a comprehensive pipeline management system.
[`pypiper`](https://pep.databio.org/pypiper/) is a more traditional workflow-building framework; it helps you build pipelines to process individual samples. [`looper`](https://pep.databio.org/looper/) is completely pipeline-agnostic, and has nothing to do with individual processing steps; it operates groups of samples (as in a project), submitting the appropriate pipeline(s) to a cluster or server (or running them locally). The two projects are independent and can be used separately, but they are most powerful when combined. They complement one another, together constituting a comprehensive pipeline management system.

## Why isn't a sample being processed by a pipeline (`Not submitting, flag found: ['*_<status>.flag']`)?

Expand Down
4 changes: 4 additions & 0 deletions docs/looper/grouping-jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,7 @@ It's quite simple: if you want to run 100 samples in a single job submission scr
## Lumping jobs by input file size: `--lump`

But what if your samples are quite different in terms of input file size? For example, your project may include many small samples, which you'd like to lump together with 10 jobs to 1, but you also have a few control samples that are very large and should have their own dedicated job. If you just use `--lumpn` with 10 samples per job, you could end up lumping your control samples together, which would be terrible. To alleviate this problem, `looper` provides the `--lump` argument, which uses input file size to group samples together. By default, you specify an argument in number of gigabytes. Looper will go through your samples and accumulate them until the total input file size reaches your limit, at which point it finalizes and submits the job. This will keep larger files in independent runs and smaller files grouped together.

## Lumping jobs by input file size: `--lumpj`

Or you can lump samples into number of jobs.
14 changes: 7 additions & 7 deletions docs/looper/looper-config.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Example looper config file using local PEP:
pep_config: $HOME/hello_looper-master/project/project_config.yaml
output_dir: "$HOME/hello_looper-master/output"
pipeline_interfaces:
sample: ["$HOME/hello_looper-master/pipeline/pipeline_interface"]
sample: "$HOME/hello_looper-master/pipeline/pipeline_interface"
project: "some/project/pipeline"
```
Expand All @@ -19,18 +19,18 @@ environment variables used by the PEP.
Example looper config file using PEPhub project:
```yaml
pep_config: pephub::databio/looper:default
pep_config: pepkit/hello_looper:default
output_dir: "$HOME/hello_looper-master/output"
pipeline_interfaces:
sample: ["$HOME/hello_looper-master/pipeline/pipeline_interface"]
project: "$HOME/hello_looper-master/project/pipeline"
sample: "$HOME/hello_looper-master/pipeline/pipeline_interface_sample.yaml"
project: "$HOME/hello_looper-master/pipeline/pipeline_interface_project.yaml"
```
Where:
- `output_dir` is pipeline output directory, where results will be saved.
- `pep_config` is a local config file or PEPhub registry path. (registry path should be specified in one
one of supported ways: `namespace/name`, `pephub::namespace/name`, `namespace/name:tag`, or `pephub::namespace/name:tag`)
- `pep_config` is a local config file or PEPhub registry path. (registry path should be specified in
one of supported ways: `namespace/name`, `namespace/name:tag`)
- `pipeline interfaces` is a local path to project or sample pipelines.

To run pipeline, go to the directory of .looper.config and execute command in your terminal:
`looper run --looper-config {looper_config_path}` or `looper runp --looper-config {looper_config_path}`.
`looper run --looper-config {looper_config_path}` or `looper runp --looper-config {looper_config_path}` (project-level pipeline).
15 changes: 14 additions & 1 deletion docs/looper/multiple-pipelines.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,19 @@
# A project with multiple pipelines

In earlier versions of looper (v < 1.0), we used a `protocol_mappings` section to map samples with different `protocol` attributes to different pipelines. In the current pipeline interface (looper v > 1.0), we eliminated the `protocol_mappings`, because this can now be handled using sample modifiers, simplifying the pipeline interface. Now, each pipeline has exactly 1 pipeline interface. You link to the pipeline interface with a sample attribute. If you want the same pipeline to run on all samples, it's as easy as using an `append` modifier like this:
In earlier versions of looper (v < 1.0), we used a `protocol_mappings` section to map samples with different `protocol` attributes to different pipelines. In the current pipeline interface (looper v > 1.0), we eliminated the `protocol_mappings`, because this can now be handled using sample modifiers, simplifying the pipeline interface.
Now, each pipeline has exactly 1 pipeline interface.

The preferred method is to specify pipeline interfaces in the looper config file:

```yaml
pep_config: pephub::databio/looper:default
output_dir: "$HOME/hello_looper-master/output"
pipeline_interfaces:
sample: "$HOME/hello_looper-master/pipeline/pipeline_interface"
project: "$HOME/hello_looper-master/project/pipeline"
```
However, you can also link to the pipeline interface with a sample attribute. If you want the same pipeline to run on all samples, it's as easy as using an `append` modifier like this:

```
sample_modifiers:
Expand Down
Loading

0 comments on commit 851e4ec

Please sign in to comment.