Skip to content

Latest commit

 

History

History
105 lines (87 loc) · 3.91 KB

Week_3.md

File metadata and controls

105 lines (87 loc) · 3.91 KB

Week 3 (15 June to 22 June)

Summary

This week I started to document everything that I have done until now, including this week.

But besides that, the main goal for this week is to generate an output file that summarizes the task outputs (associating folder names, task names, inputs and outputs, and params parsed to each task).

Summary_log.txt

Linking folders and tasks

First of all, it urged to create a way to match tasks with the folders were its outputs were generated by bionode-watermill. Therefore, first I used the miniUid variable of lib/sagas/lifecycle.js which corresponded to the first 8 letter of the uid generated by the task and matches the folders created within data. However, this fails for tasks that use bionode-ncbi that for some reason do not output to ./data but instead output to ./. So, to circumvent this, I used the full path for the directory of the outputs, available in the object originalTask, particularly in originalTask.dir. This way not only we get a more informative path (the full path to output directory of the task). This way, bionode-ncbi related tasks do not give a wrong output directory.

I/O and params

Tasks also have associated other parameters that can be useful for the user to associate with the output folders and the tasks itself. For instance, which inputs were given? What is the expected pattern for the output? Are there any other parameters that were passed to the task?

This is currently controlled by input, output and params. And in fact this is also available in originalTask. However, all these parameters can be strings or objects so in order to avoid errors or strange behaviors in logging this, this variables had to be stringified before adding to file. This way rendering something like this:

params: {"db":"sra","accession":"ERR045788"}

or

params: "ERR045788"

For now the output should look like the following example:

{
  "folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers/data/9fc99c0",
  "taskName": "Download reference genome for Streptococcus pneumoniae",
  "input": null,
  "output": "*_genomic.fna.gz",
  "params": {
    "url": "http://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/045/GCF_000007045.1_ASM704v1/GCF_000007045.1_ASM704v1_genomic.fna.gz"
  }
}
{
  "folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers",
  "taskName": "Download SRA ERR045788",
  "input": null,
  "output": "**/*.sra",
  "params": {
    "db": "sra",
    "accession": "ERR045788"
  }
}
{
  "folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers/data/ef7ee47",
  "taskName": "fastq-dump **/*.sra",
  "input": "**/*.sra",
  "output": [
    "*_1.fastq.gz",
    "*_2.fastq.gz"
  ],
  "params": {}
}

So, the output is NDJSON like file, following the philosophy of other bionode modules.

This was part of pull request #57 that became deprecated later with pull request #62.

What can be improved?

  • We need to check bionode-ncbi behavior within bionode-watermill since its outputs are being generated outside ./data, the expected main output directory.
  • Output should include the actual files that are the input and output (resolvedInput and resolvedOutput).
  • It should be added the 'command' that is run by each task.
  • Also it should be improved the visualization of which tasks run in parallel or one after another. What are the relationships between tasks.