Merge pull request #274 from pepkit/dev

1.2.1
pepkit · Aug 26, 2020 · ac0645a · ac0645a
2 parents ba5e323 + 75e83a5
commit ac0645a
Show file tree

Hide file tree

Showing 20 changed files with 258 additions and 91 deletions.
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -0,0 +1,31 @@
+# This workflows will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+name: Upload Python Package
+
+on:
+  release:
+    types: [created]
+
+jobs:
+  deploy:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+      run: |
+        python setup.py sdist bdist_wheel
+        twine upload dist/*
diff --git a/.github/workflows/run-pytest.yml b/.github/workflows/run-pytest.yml
@@ -0,0 +1,41 @@
+name: Run pytests
+
+on:
+  push:
+    branches: [master, dev]
+  pull_request:
+    branches: [master, dev]
+
+jobs:
+  pytest:
+    runs-on: ${{ matrix.os }}
+    strategy:
+      matrix:
+        python-version: [3.6, 3.7, 3.8]
+        os: [ubuntu-latest, macos-latest]
+
+    steps:
+    - uses: actions/checkout@v2
+
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+
+    - name: Install dev dependancies
+      run: if [ -f requirements/requirements-dev.txt ]; then pip install -r requirements/requirements-dev.txt; fi
+
+    - name: Install test dependancies
+      run: if [ -f requirements/requirements-test.txt ]; then pip install -r requirements/requirements-test.txt; fi
+
+    - name: Install package
+      run: python -m pip install .
+
+    - name: Run pytest tests
+      run: pytest tests --remote-data --cov=./ --cov-report=xml
+
+    - name: Upload coverage to Codecov
+      uses: codecov/codecov-action@v1
+      with:
+        file: ./coverage.xml
+        name: py-${{ matrix.python-version }}-${{ matrix.os }}
diff --git a/.travis.yml b/.travis.yml
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -2,6 +2,16 @@
 
 This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) format. 
 
+## [1.2.1] - 2020-08-26
+
+### Added
+- Environment variables expansion in custom sample YAML paths; [Issue 273](https://github.com/pepkit/looper/issues/273)
+- `dynamic_variables_script_path` key in the pipeline interface. Path, absolute or relative to the pipeline interface file; [Issue 276](https://github.com/pepkit/looper/issues/276)
+### Changed
+- Resolve project pipeline interface path by making it relative to the config not current directory; [Issue 268](https://github.com/pepkit/looper/issues/268)
+### Fixed
+- Unclear error when `output_dir` was not provided in a config `looper` section; [Issue 286](https://github.com/pepkit/looper/issues/286)
+
 ## [1.2.0] - 2020-05-26
 
 **This version introduced backwards-incompatible changes.**

diff --git a/docs/concentric-templates.md b/docs/concentric-templates.md
@@ -16,7 +16,7 @@ In the simplest case, looper can run the pipeline by simply running these comman
 
 ## The submission template
 
-To extend to submitting the commands to a cluster, it may be tempting to add these details directly to the command template, which cause the jobs to be submitted to SLURM instead of run directly. However, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. The submission template is specified at the level of the computing environment.  A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:
+To extend to submitting the commands to a cluster, we simply need to add some more information around the command above, specifying things like memory use, job name, *etc.* It may be tempting to add these details directly to the command template, causing the jobs to be submitted to SLURM instead of run directly. This *would* work; however, this would restrict the pipeline to *only* running via SLURM, since the submission code would be tightly coupled to the command code. Instead, looper retains flexibility by introducing a second template layer, the *submission template*. While the *command template* is specified by the pipeline interface, the *submission template* is specified at the level of the computing environment.  A submission template can also be as simple or complex as required. For a command to be run in a local computing environment, a basic template will suffice:
 
 ```console
 #! /usr/bin/bash
@@ -39,6 +39,8 @@ echo 'Start time:' `date +'%Y-%m-%d %T'`
 srun {CODE}
 ```
 
+In these templates, the `{CODE}` variable is populated by the populated result of the command template -- that's what makes these templates concentric.
+
 ## The advantages of concentric templates
 
 Looper first populates the command template, and then provides the output as a variable and used to populate the `{CODE}` variable in the submission template. This decoupling provides substantial advantages:
@@ -49,7 +51,9 @@ Looper first populates the command template, and then provides the output as a v
 4. We can [group multiple individual commands](grouping-jobs.md) into a single submission script.
 5. The submission template is universal and can be handled by dedicated submission template software.
 
-In fact, looper uses [divvy](http://divvy.databio.org) to handle submission templates. The divvy submission templates can be used for interactive submission of jobs, or used by other software.
+## Looper and divvy
+
+The last point about the submission template being universal is exactly what looper does. Looper uses [divvy](http://divvy.databio.org) to handle submission templates. Besides being useful for looper, this means the divvy submission templates can be used for interactive submission of jobs, or used by other software. It also means to configure looper to work with your computing environment, you just have to configure divvy.
 
 ## Populating templates
 

diff --git a/docs/how-to-merge-inputs.md b/docs/how-to-merge-inputs.md
@@ -1,10 +1,60 @@
 # How to handle multiple input files
 
+*Dealing with multiple input files is described in detail in the [PEP documentation](http://pep.databio.org/en/latest/specification/#project-attribute-subsample_table).*
+
+Breifly:
+
 Sometimes you have multiple input files that you want to merge for one sample. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Rather than putting multiple lines in your sample annotation sheet, which causes conceptual and analytical challenges, PEP has two ways to merge these:
 
 1. Use shell expansion characters (like `*` or `[]`) in your file path definitions (good for simple merges)
-2. Specify a *sample subannotation table* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
+2. Specify a *sample subannotation tables* which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).
+
+
+## Multi-value sample attributes behavior in the pipeline interface command templates
+
+Both sample subannotation tables and shell expansion characters lead to sample attributes with multiple values, stored in a list of strings (`multi_attr1` and `multi_attr1`), as opposed to a standard scenario, where a single value is stored as a string (`single_attr`):
+
+```
+Sample
+sample_name: sample1
+subsample_name: ['0', '1', '2']
+multi_attr1: ['one', 'two', 'three']
+multi_attr2: ['four', 'five', 'six']
+single_attr: test_val
+```
+
+### Access individual elements in lists
+
+Pipeline interface author can leverage that fact and access the individual elements, e.g iterate over them and append to a string using the Jinja2 syntax:
+
+```bash
+pipeline_name: test_iter
+pipeline_type: sample
+command_template: >
+  --input-iter {%- for x in sample.multi_attr1 -%} --test-individual {x} {% endfor %} # iterate over multiple values
+  --input-single {sample.single_attr} # use the single value as is
+
+```
+
+This results in a submission script that includes the following command:
+```bash
+--input-iter  --test-individual one  --test-individual two  --test-individual three 
+--input-single  test_val
+```
+
+### Concatenate elements in lists
+
+The most common use case is just concatenating the multiple values and separate them with space -- **providing multiple input values to a single argument on the command line**. Therefore, all the multi-value sample attributes that have not been processed with Jinja2 logic are automatically concatenated. For instance, the following command template in a pipeline interface will result in the submission script presented below:
 
-Dealing with multiple input files is described in detail in the [PEP documentation](https://pepkit.github.io/docs/sample_subannotation/). 
+Pipeline interface:	
+```bash
+pipeline_name: test_concat
+pipeline_type: sample
+command_template: >
+  --input-concat {sample.multi_attr1} # concatenate all the values
+```
 
-Note: to handle different *classes* of input files, like read1 and read2, these are *not* merged and should be handled as different derived columns in the main sample annotation sheet (and therefore different arguments to the pipeline).
+Command in the submission script:
+```bash
+--input-concat  one two three
+```
diff --git a/docs/multiple-pipelines.md b/docs/multiple-pipelines.md
@@ -0,0 +1,22 @@
+# A project with multiple pipelines
+
+In earlier versions of looper (v < 1.0), we used a `protocol_mappings` section to map samples with different `protocol` attributes to different pipelines. In the current pipeline interface (looper v > 1.0), we eliminated the `protocol_mappings`, because this can now be handled using sample modifiers, simplifying the pipeline interface. Now, each pipeline has exactly 1 pipeline interface. You link to the pipeline interface with a sample attribute. If you want the same pipeline to run on all samples, it's as easy as using an `append` modifier like this: 
+
+```
+sample_modifiers:
+  append:
+    pipeline_interfaces: "test.yaml"
+```
+
+But if you want to submit different sampels to different pipelines, depending on a sample attribute, like `protocol`, you can use an implied attribute:
+
+```
+sample_modifiers:
+  imply:
+    - if:
+        protocol: [PRO-seq, pro-seq, GRO-seq, gro-seq] # OR
+      then:
+        pipeline_interfaces: ["peppro.yaml"]
+```
+
+This approach uses only functionality of PEPs to handle the connection to pipelines as sample attributes, which provides full control and power using the familiar sample modifiers. It completely eliminates the need for re-inventing this complexity within looper, which eliminated the protocol mapping section to simplify the looper pipeline interface files. You can read more about the rationale of this change in [issue 244](https://github.com/pepkit/looper/issues/244#issuecomment-611154594).
diff --git a/docs/pipeline-interface-specification.md b/docs/pipeline-interface-specification.md
@@ -199,13 +199,20 @@ compute:
 
 ### sample_yaml_path
 
-Looper produces a yaml file that represents the sample. By default the file is saved in submission directory in `{sample.sample_name}.yaml`. You can override the default by specifying a `sample_yaml_path` attribute in the pipeline interface:
+Looper produces a yaml file that represents the sample. By default the file is saved in submission directory in `{sample.sample_name}.yaml`. You can override the default by specifying a `sample_yaml_path` attribute in the pipeline interface. This attribute, like the `command_template`, has access to any of the looper namespaces, in case you want to use them in the names of your sample yaml files. 
+The result of the rendered template is considered relative to the `looper.output_dir` path, unless it is an absolute path. For example, to save the file in the output directory under a custom name use:
 
 ```
-sample_yaml_path: {sample.sample_name}.yaml
+sample_yaml_path: {sample.genome}_sample.yaml
 ```
 
-This attribute, like the `command_template`, has access to any of the looper namespaces, in case you want to use them in the names of your sample yaml files.
+To save the file elsewhere specify an absolute path:
+
+```
+sample_yaml_path: $HOME/results/{sample.genome}_sample.yaml
+```
+
+
 
 ## Validating a pipeline interface
 

diff --git a/docs/variable-namespaces.md b/docs/variable-namespaces.md
@@ -2,7 +2,7 @@
 
 ## Populating the templates
 
-Loper creates job scripts using [concentric templates](concentric-templates.md) consisting of a *command template* and a *submission template*. This layered design allows us to decouple the computing environment from the pipeline, which improves portability. The task of running jobs can be thought of as simply populating the templates with variables. To do this, Looper pools variables from several sources: 
+Looper creates job scripts using [concentric templates](concentric-templates.md) consisting of a *command template* and a *submission template*. This layered design allows us to decouple the computing environment from the pipeline, which improves portability. The task of running jobs can be thought of as simply populating the templates with variables. These variables are pooled from several sources: 
 
 1. the command line, where the user provides any on-the-fly variables for a particular run.
 2. the PEP, which provides information on the project and samples.

diff --git a/looper/_version.py b/looper/_version.py
@@ -1 +1 @@
-__version__ = "1.2.0"
+__version__ = "1.2.1"
diff --git a/looper/conductor.py b/looper/conductor.py
@@ -8,6 +8,7 @@
 
 from attmap import AttMap
 from eido import read_schema, validate_inputs
+from ubiquerg import expandpath
 from peppy.const import CONFIG_KEY, SAMPLE_YAML_EXT, SAMPLE_NAME_ATTR
 
 from .processed_project import populate_sample_paths
@@ -167,15 +168,15 @@ def add_sample(self, sample, rerun=False):
                 else:
                     use_this_sample = False
             if not use_this_sample:
-                msg = "> Skipping sample"
+                msg = "> Skipping sample because no failed flag found"
                 if flag_files:
                     msg += ". Flags found: {}".format(flag_files)
                 _LOGGER.info(msg)
 
         if self.prj.toggle_key in sample \
                 and int(sample[self.prj.toggle_key]) == 0:
             _LOGGER.warning(
-                "> Skiping sample ({}: {})".
+                "> Skipping sample ({}: {})".
                     format(self.prj.toggle_key, sample[self.prj.toggle_key])
             )
             use_this_sample = False
@@ -288,7 +289,7 @@ def _get_sample_yaml_path(self, sample):
         namespaces = {"sample": sample,
                       "project": self.prj.prj[CONFIG_KEY],
                       "pipeline": self.pl_iface}
-        path = jinja_render_cmd_strictly(pth_templ, namespaces)
+        path = expandpath(jinja_render_cmd_strictly(pth_templ, namespaces))
         return path if os.path.isabs(path) \
             else os.path.join(self.prj.output_dir, path)
 
@@ -424,7 +425,8 @@ def write_script(self, pool, size):
         if self.collate:
             _LOGGER.debug("samples namespace:\n{}".format(self.prj.samples))
         else:
-            _LOGGER.debug("sample namespace:\n{}".format(sample))
+            _LOGGER.debug("sample namespace:\n{}".format(
+                sample.__str__(max_attr=len(list(sample.keys())))))
         _LOGGER.debug("project namespace:\n{}".format(self.prj[CONFIG_KEY]))
         _LOGGER.debug("pipeline namespace:\n{}".format(self.pl_iface))
         _LOGGER.debug("compute namespace:\n{}".format(self.prj.dcc.compute))

diff --git a/looper/looper.py b/looper/looper.py
@@ -289,16 +289,7 @@ def __call__(self, args, rerun=False, **compute_kwargs):
         failures = defaultdict(list)  # Collect problems by sample.
         processed_samples = set()  # Enforce one-time processing.
         submission_conductors = {}
-        try:
-            comp_vars = self.prj.dcc[COMPUTE_KEY].to_map()
-        except AttributeError:
-            if not isinstance(self.prj.dcc[COMPUTE_KEY], Mapping):
-                raise TypeError("Project's computing config isn't a mapping: {}"
-                                " ({})".format(self.prj.dcc[COMPUTE_KEY],
-                                               type(self.prj.dcc[COMPUTE_KEY])))
-            from copy import deepcopy
-            comp_vars = deepcopy(self.prj.dcc[COMPUTE_KEY])
-        comp_vars.update(compute_kwargs or {})
+        comp_vars = compute_kwargs or {}
 
         # Determine number of samples eligible for processing.
         num_samples = len(self.prj.samples)