Skip to content

Commit

Permalink
Merge pull request #274 from pepkit/dev
Browse files Browse the repository at this point in the history
1.2.1
  • Loading branch information
stolarczyk authored Aug 26, 2020
2 parents ba5e323 + 75e83a5 commit ac0645a
Show file tree
Hide file tree
Showing 20 changed files with 258 additions and 91 deletions.
31 changes: 31 additions & 0 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

name: Upload Python Package

on:
release:
types: [created]

jobs:
deploy:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
41 changes: 41 additions & 0 deletions .github/workflows/run-pytest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Run pytests

on:
push:
branches: [master, dev]
pull_request:
branches: [master, dev]

jobs:
pytest:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python-version: [3.6, 3.7, 3.8]
os: [ubuntu-latest, macos-latest]

steps:
- uses: actions/checkout@v2

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

- name: Install dev dependancies
run: if [ -f requirements/requirements-dev.txt ]; then pip install -r requirements/requirements-dev.txt; fi

- name: Install test dependancies
run: if [ -f requirements/requirements-test.txt ]; then pip install -r requirements/requirements-test.txt; fi

- name: Install package
run: python -m pip install .

- name: Run pytest tests
run: pytest tests --remote-data --cov=./ --cov-report=xml

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
file: ./coverage.xml
name: py-${{ matrix.python-version }}-${{ matrix.os }}
21 changes: 0 additions & 21 deletions .travis.yml

This file was deleted.

10 changes: 10 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,16 @@

This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format.

## [1.2.1] - 2020-08-26

### Added
- Environment variables expansion in custom sample YAML paths; [Issue 273](https://github.com/pepkit/looper/issues/273)
- `dynamic_variables_script_path` key in the pipeline interface. Path, absolute or relative to the pipeline interface file; [Issue 276](https://github.com/pepkit/looper/issues/276)
### Changed
- Resolve project pipeline interface path by making it relative to the config not current directory; [Issue 268](https://github.com/pepkit/looper/issues/268)
### Fixed
- Unclear error when `output_dir` was not provided in a config `looper` section; [Issue 286](https://github.com/pepkit/looper/issues/286)

## [1.2.0] - 2020-05-26

**This version introduced backwards-incompatible changes.**
Expand Down
8 changes: 6 additions & 2 deletions docs/concentric-templates.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ In the simplest case, looper can run the pipeline by simply running these comman

## The submission template

To extend to submitting the commands to a cluster, it may be tempting to add these details directly to the command template, which cause the jobs to be submitted to SLURM instead of run directly. However, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. The submission template is specified at the level of the computing environment. A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:
To extend to submitting the commands to a cluster, we simply need to add some more information around the command above, specifying things like memory use, job name, *etc.* It may be tempting to add these details directly to the command template, causing the jobs to be submitted to SLURM instead of run directly. This *would* work; however, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. While the *command template* is specified by the pipeline interface, the *submission template* is specified at the level of the computing environment. A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:

```console
#! /usr/bin/bash
Expand All @@ -39,6 +39,8 @@ echo 'Start time:' `date +'%Y-%m-%d %T'`
srun {CODE}
```

In these templates, the `{CODE}` variable is populated by the populated result of the command template -- that's what makes these templates concentric.

## The advantages of concentric templates

Looper first populates the command template, and then provides the output as a variable and used to populate the `{CODE}` variable in the submission template. This decoupling provides substantial advantages:
Expand All @@ -49,7 +51,9 @@ Looper first populates the command template, and then provides the output as a v
4. We can [group multiple individual commands](grouping-jobs.md) into a single submission script.
5. The submission template is universal and can be handled by dedicated submission template software.

In fact, looper uses [divvy](http://divvy.databio.org) to handle submission templates. The divvy submission templates can be used for interactive submission of jobs, or used by other software.
## Looper and divvy

The last point about the submission template being universal is exactly what looper does. Looper uses [divvy](http://divvy.databio.org) to handle submission templates. Besides being useful for looper, this means the divvy submission templates can be used for interactive submission of jobs, or used by other software. It also means to configure looper to work with your computing environment, you just have to configure divvy.

## Populating templates

Expand Down
56 changes: 53 additions & 3 deletions docs/how-to-merge-inputs.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,60 @@
# How to handle multiple input files

*Dealing with multiple input files is described in detail in the [PEP documentation](http://pep.databio.org/en/latest/specification/#project-attribute-subsample_table).*

Breifly:

Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, PEP has two ways to merge these:

1. Use shell expansion characters (like `*` or `[]`) in your file path definitions (good for simple merges)
2. Specify a *sample subannotation table* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
2. Specify a *sample subannotation tables* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).


## Multi-value sample attributes behavior in the pipeline interface command templates

Both sample subannotation tables and shell expansion characters lead to sample attributes with multiple values, stored in a list of strings (`multi_attr1` and `multi_attr1`), as opposed to a standard scenario, where a single value is stored as a string (`single_attr`):

```
Sample
sample_name: sample1
subsample_name: ['0', '1', '2']
multi_attr1: ['one', 'two', 'three']
multi_attr2: ['four', 'five', 'six']
single_attr: test_val
```

### Access individual elements in lists

Pipeline interface author can leverage that fact and access the individual elements, e.g iterate over them and append to a string using the Jinja2 syntax:

```bash
pipeline_name: test_iter
pipeline_type: sample
command_template: >
--input-iter {%- for x in sample.multi_attr1 -%} --test-individual {x} {% endfor %} # iterate over multiple values
--input-single {sample.single_attr} # use the single value as is

```
This results in a submission script that includes the following command:
```bash
--input-iter --test-individual one --test-individual two --test-individual three
--input-single test_val
```
### Concatenate elements in lists
The most common use case is just concatenating the multiple values and separate them with space -- **providing multiple input values to a single argument on the command line**. Therefore, all the multi-value sample attributes that have not been processed with Jinja2 logic are automatically concatenated. For instance, the following command template in a pipeline interface will result in the submission script presented below:
Dealing with multiple input files is described in detail in the [PEP documentation](https://pepkit.github.io/docs/sample_subannotation/).
Pipeline interface:
```bash
pipeline_name: test_concat
pipeline_type: sample
command_template: >
--input-concat {sample.multi_attr1} # concatenate all the values
```
Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline).
Command in the submission script:
```bash
--input-concat one two three
```
22 changes: 22 additions & 0 deletions docs/multiple-pipelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# A project with multiple pipelines

In earlier versions of looper (v < 1.0), we used a `protocol_mappings` section to map samples with different `protocol` attributes to different pipelines. In the current pipeline interface (looper v > 1.0), we eliminated the `protocol_mappings`, because this can now be handled using sample modifiers, simplifying the pipeline interface. Now, each pipeline has exactly 1 pipeline interface. You link to the pipeline interface with a sample attribute. If you want the same pipeline to run on all samples, it's as easy as using an `append` modifier like this:

```
sample_modifiers:
append:
pipeline_interfaces: "test.yaml"
```

But if you want to submit different sampels to different pipelines, depending on a sample attribute, like `protocol`, you can use an implied attribute:

```
sample_modifiers:
imply:
- if:
protocol: [PRO-seq, pro-seq, GRO-seq, gro-seq] # OR
then:
pipeline_interfaces: ["peppro.yaml"]
```

This approach uses only functionality of PEPs to handle the connection to pipelines as sample attributes, which provides full control and power using the familiar sample modifiers. It completely eliminates the need for re-inventing this complexity within looper, which eliminated the protocol mapping section to simplify the looper pipeline interface files. You can read more about the rationale of this change in [issue 244](https://github.com/pepkit/looper/issues/244#issuecomment-611154594).
13 changes: 10 additions & 3 deletions docs/pipeline-interface-specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,13 +199,20 @@ compute:

### sample_yaml_path

Looper produces a yaml file that represents the sample. By default the file is saved in submission directory in `{sample.sample_name}.yaml`. You can override the default by specifying a `sample_yaml_path` attribute in the pipeline interface:
Looper produces a yaml file that represents the sample. By default the file is saved in submission directory in `{sample.sample_name}.yaml`. You can override the default by specifying a `sample_yaml_path` attribute in the pipeline interface. This attribute, like the `command_template`, has access to any of the looper namespaces, in case you want to use them in the names of your sample yaml files.
The result of the rendered template is considered relative to the `looper.output_dir` path, unless it is an absolute path. For example, to save the file in the output directory under a custom name use:

```
sample_yaml_path: {sample.sample_name}.yaml
sample_yaml_path: {sample.genome}_sample.yaml
```

This attribute, like the `command_template`, has access to any of the looper namespaces, in case you want to use them in the names of your sample yaml files.
To save the file elsewhere specify an absolute path:

```
sample_yaml_path: $HOME/results/{sample.genome}_sample.yaml
```



## Validating a pipeline interface

Expand Down
2 changes: 1 addition & 1 deletion docs/variable-namespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Populating the templates

Loper creates job scripts using [concentric templates](concentric-templates.md) consisting of a *command template* and a *submission template*. This layered design allows us to decouple the computing environment from the pipeline, which improves portability. The task of running jobs can be thought of as simply populating the templates with variables. To do this, Looper pools variables from several sources:
Looper creates job scripts using [concentric templates](concentric-templates.md) consisting of a *command template* and a *submission template*. This layered design allows us to decouple the computing environment from the pipeline, which improves portability. The task of running jobs can be thought of as simply populating the templates with variables. These variables are pooled from several sources:

1. the command line, where the user provides any on-the-fly variables for a particular run.
2. the PEP, which provides information on the project and samples.
Expand Down
2 changes: 1 addition & 1 deletion looper/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.2.0"
__version__ = "1.2.1"
10 changes: 6 additions & 4 deletions looper/conductor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

from attmap import AttMap
from eido import read_schema, validate_inputs
from ubiquerg import expandpath
from peppy.const import CONFIG_KEY, SAMPLE_YAML_EXT, SAMPLE_NAME_ATTR

from .processed_project import populate_sample_paths
Expand Down Expand Up @@ -167,15 +168,15 @@ def add_sample(self, sample, rerun=False):
else:
use_this_sample = False
if not use_this_sample:
msg = "> Skipping sample"
msg = "> Skipping sample because no failed flag found"
if flag_files:
msg += ". Flags found: {}".format(flag_files)
_LOGGER.info(msg)

if self.prj.toggle_key in sample \
and int(sample[self.prj.toggle_key]) == 0:
_LOGGER.warning(
"> Skiping sample ({}: {})".
"> Skipping sample ({}: {})".
format(self.prj.toggle_key, sample[self.prj.toggle_key])
)
use_this_sample = False
Expand Down Expand Up @@ -288,7 +289,7 @@ def _get_sample_yaml_path(self, sample):
namespaces = {"sample": sample,
"project": self.prj.prj[CONFIG_KEY],
"pipeline": self.pl_iface}
path = jinja_render_cmd_strictly(pth_templ, namespaces)
path = expandpath(jinja_render_cmd_strictly(pth_templ, namespaces))
return path if os.path.isabs(path) \
else os.path.join(self.prj.output_dir, path)

Expand Down Expand Up @@ -424,7 +425,8 @@ def write_script(self, pool, size):
if self.collate:
_LOGGER.debug("samples namespace:\n{}".format(self.prj.samples))
else:
_LOGGER.debug("sample namespace:\n{}".format(sample))
_LOGGER.debug("sample namespace:\n{}".format(
sample.__str__(max_attr=len(list(sample.keys())))))
_LOGGER.debug("project namespace:\n{}".format(self.prj[CONFIG_KEY]))
_LOGGER.debug("pipeline namespace:\n{}".format(self.pl_iface))
_LOGGER.debug("compute namespace:\n{}".format(self.prj.dcc.compute))
Expand Down
11 changes: 1 addition & 10 deletions looper/looper.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,16 +289,7 @@ def __call__(self, args, rerun=False, **compute_kwargs):
failures = defaultdict(list) # Collect problems by sample.
processed_samples = set() # Enforce one-time processing.
submission_conductors = {}
try:
comp_vars = self.prj.dcc[COMPUTE_KEY].to_map()
except AttributeError:
if not isinstance(self.prj.dcc[COMPUTE_KEY], Mapping):
raise TypeError("Project's computing config isn't a mapping: {}"
" ({})".format(self.prj.dcc[COMPUTE_KEY],
type(self.prj.dcc[COMPUTE_KEY])))
from copy import deepcopy
comp_vars = deepcopy(self.prj.dcc[COMPUTE_KEY])
comp_vars.update(compute_kwargs or {})
comp_vars = compute_kwargs or {}

# Determine number of samples eligible for processing.
num_samples = len(self.prj.samples)
Expand Down
Loading

0 comments on commit ac0645a

Please sign in to comment.