This week I started to document everything that I have done until now, including this week.
But besides that, the main goal for this week is to generate an output file that summarizes the task outputs (associating folder names, task names, inputs and outputs, and params parsed to each task).
First of all, it urged to create a way to match tasks with the folders were
its outputs were generated by bionode-watermill. Therefore, first I used the
miniUid
variable of lib/sagas/lifecycle.js
which corresponded to the
first 8 letter of the uid generated by the task and matches the folders
created within data. However, this fails for tasks that use bionode-ncbi that
for some reason do not output to ./data
but instead output to ./
. So, to
circumvent this, I used the full path for the directory of the outputs,
available in the object originalTask
, particularly in originalTask.dir
.
This way not only we get a more informative path (the full path to output
directory of the task). This way, bionode-ncbi related tasks do not give a
wrong output directory.
Tasks also have associated other parameters that can be useful for the user to associate with the output folders and the tasks itself. For instance, which inputs were given? What is the expected pattern for the output? Are there any other parameters that were passed to the task?
This is currently controlled by input
, output
and params
. And in fact
this is also available in originalTask
. However, all these parameters can
be strings or objects so in order to avoid errors or strange
behaviors in logging this, this variables had to be stringified before
adding to file. This way rendering something like this:
params: {"db":"sra","accession":"ERR045788"}
or
params: "ERR045788"
For now the output should look like the following example:
{
"folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers/data/9fc99c0",
"taskName": "Download reference genome for Streptococcus pneumoniae",
"input": null,
"output": "*_genomic.fna.gz",
"params": {
"url": "http://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/045/GCF_000007045.1_ASM704v1/GCF_000007045.1_ASM704v1_genomic.fna.gz"
}
}
{
"folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers",
"taskName": "Download SRA ERR045788",
"input": null,
"output": "**/*.sra",
"params": {
"db": "sra",
"accession": "ERR045788"
}
}
{
"folderName": "/home/tiago/bin/bionode-watermill-tiagofilipe12/bionode-watermill/examples/pipelines/two-mappers/data/ef7ee47",
"taskName": "fastq-dump **/*.sra",
"input": "**/*.sra",
"output": [
"*_1.fastq.gz",
"*_2.fastq.gz"
],
"params": {}
}
So, the output is NDJSON like file, following the philosophy of other
bionode
modules.
This was part of pull request #57 that became deprecated later with pull request #62.
- We need to check bionode-ncbi behavior within bionode-watermill since its
outputs are being generated outside
./data
, the expected main output directory. - Output should include the actual files that are the input and output (resolvedInput and resolvedOutput).
- It should be added the 'command' that is run by each task.
- Also it should be improved the visualization of which tasks run in parallel or one after another. What are the relationships between tasks.