Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CWL problem: input files listing using JavaScript with more than one step in workflow #156

Open
novikovant opened this issue Nov 6, 2020 · 7 comments

Comments

@novikovant
Copy link

We have a problem with such expression in workflows using REANA (dev version 0.7.0a1).
run-cwl-workflow 0.7.0a1 with cwltool 1.0.20191022103248

class: ExpressionTool
requirements: { InlineJavascriptRequirement: {} }
expression: '${return {"files": inputs.dir.listing};}'

This works well when used alone (in one step workflow), but in multisteps workflow produces error.

cwltool | MainThread | INFO | [step list_input] start
ERROR [step list_input] Output is missing expected field file:///var/reana/users/00000000-0000-0000-0000-000000000000/workflows/761dfcbc-43ed-4b27-8bb7-5b72b9b7ba44/workflow.json#main/list_input/files
cwltool | MainThread | ERROR | [step list_input] ...
WARNING [step list_input] completed permanentFail

Example workflow below.
If only single step remains, it also transfers input files to output. But as is (multistep) - it produces ERROR immediatelly without any transfers.

jobdata.yml (upload any file or files of any size)

input_dir:
  class: Directory
  path: data/

Workflow

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
  ScatterFeatureRequirement: {}
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
inputs:
  input_dir: Directory 
steps:
  list_input:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        dir: Directory
      expression: '${return {"files": inputs.dir.listing};}'
      outputs:
        files: File[]
    in:
      dir: input_dir
    out: [files]
  wf:
    scatter: inf
    in:
      inf: list_input/files
    run:
      class: Workflow
      inputs:
        inf: File
      steps:
       count_file:
          run: lssh.yaml
          in:
            input_file: inf
          out: [out_file]
      outputs:
        out_file:
          type: File
          outputSource: count_file/out_file
    out: [out_file]
outputs:
  out_files:
     type: File[]
     outputSource: wf/out_file

And wf/lssh.yaml is is not really important (not executed).

@novikovant
Copy link
Author

Can anyone help us with this problem?
Do you have any questions?

@alintulu
Copy link
Member

Hi @novikovant!

Sorry for the late response. I just tried out your workflow. However since I don't have lssh.yaml I had to amend it, otherwise I ran into Field "run" contains undefined reference to file:///var/reana/users/00000000-0000-0000-0000-000000000000/workflows/8583a802-760b-4dc5-a117-adee2288d7ba/lssh.yaml.

I used the same jobdata.yml file as you, and my workflow file looks like

$ cat workflow.cwl
#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
  ScatterFeatureRequirement: {}
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
inputs:
  input_dir: Directory 
steps:
  list_input:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        dir: Directory
      expression: '${return {"files": inputs.dir.listing};}'
      outputs:
        files: File[]
    in:
      dir: input_dir
    out: [files]
  wf:
    scatter: inf
    in:
      inf: list_input/files
    run:
      class: Workflow
      inputs:
        inf: File
      steps:
       count_file:
          run:
            class: CommandLineTool
            inputs:
              input_file: File
            baseCommand: /bin/sh
            arguments:
              - prefix: -c
                valueFrom: |
                  echo $(inputs.input_file) > out_file
            outputs:
              out_file:
                type: File
                outputBinding:
                  glob: 'out_file'
          in:
            input_file: inf
          out: [out_file]
      outputs:
        out_file:
          type: File
          outputSource: count_file/out_file
    out: [out_file]
outputs:
  out_files:
     type: File[]
     outputSource: wf/out_file

I ran it using the following reana.yaml

$ cat reana.yaml
version: 0.6.0
inputs:
  parameters:
    input: jobdata.yml
  directories:
    - data
workflow:
  type: cwl
  file: workflow.cwl

I also had a data/ directory

$ ls data/
file1  file2  file3

Running locally on REANA 0.7.0a1 with run-cwl-workflow 0.7.0a1 and cwltool 1.0.20191022103248 I got the same error message (on a different field) however I also saw docker pulling issues in the output:

$ reana-client logs
ERROR | [step count_file_2] Output is missing expected field file:///var/reana/users/00000000-0000-0000-0000-000000000000/workflows/1b6dfa95-596d-4108-b5c8-3268
e5700d99/workflow.json#main/wf/6fdfd0e2-c129-4e3a-8c37-799c71238bca/count_file/out_file
...
Container job failed, error: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/frolvlad/alpine-bash:latest": 
failed to resolve reference "docker.io/frolvlad/alpine-bash:latest": failed to do request: Head https://registry-1.docker.io/v2/frolvlad/alpine-bash/manifests/latest: dial tcp: 
lookup registry-1.docker.io on 172.18.0.1:53: read udp 172.18.0.2:57280->172.18.0.1:53: i/o timeout
....

If you do not specify a docker image, REANA defaults to using frolvlad/alpine-bash. To debug I tried setting an image I knew I had loaded onto my kind plane, by adding

requirements:
  DockerRequirement:
    dockerPull:
      reanahub/reana-env-root6:6.18.04

This worked! The job finished successfully

$ reana-client logs
...
2020-11-17 11:06:33,733 | cwltool | MainThread | INFO | Final process status is success
workflow done
....
$ reana-client ls
NAME                                                                SIZE   LAST-MODIFIED      
workflow.json                                                       2488   2020-11-17T11:06:14
inputs.json                                                         54     2020-11-17T11:06:14
data/file1                                                          0      2020-11-17T11:05:54
data/file2                                                          0      2020-11-17T11:05:58
data/file3                                                          0      2020-11-17T11:06:02
outputs/out_file                                                    543    2020-11-17T11:06:24
outputs/out_file_2                                                  543    2020-11-17T11:06:28
outputs/out_file_2_3                                                543    2020-11-17T11:06:28
cwl/docker_outdir/out_file                                          543    2020-11-17T11:06:28
cwl/docker_stagedir/stg8f290b89-867e-4cf0-b08c-2c216ad2f50e/file1   0      2020-11-17T11:05:54
cwl/docker_stagedir/stgcbbd982f-f6ae-4092-8d96-e9f78ce24520/file3   0      2020-11-17T11:06:02
cwl/docker_stagedir/stg66c3a462-4bfd-41b0-b556-b10ccb9838d4/file2   0      2020-11-17T11:05:58

If you don't think the docker pulling is the issue, maybe I could help debug better if I had access to lssh.yaml?

@novikovant
Copy link
Author

novikovant commented Nov 17, 2020

Thanks for the response and testing!
Indeed, your variant works. But we cannot use it, so, please, continue testing.
Ok, now I see that lssh.yaml (and Docker image) is important. Sorry for not providing full information.

Here it is. There is base Ubuntu image (may be this is too much for this example; possible, it is not the problem).
There are additional "InitialWorkDirRequirement" and different way to use "bash" (I think that we need the variant below).

We tried with Ubuntu image and with your image and with your way of using bash in external cwl_step (like lssh.yaml). Don`t work.
So, possible reason - that problem when steps are in the external yamls (and Dir listing as the first (inline); other external steps work fine in large workflows).
May be unrelated, but we also had problem with "cwltool --pack" (on Reana side) for very complex directory structure of initial pipeline (in reana.yaml it was defined as workflows/workflow.yaml and the last was with many external yamls and several with relative paths).

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
label: count lines in file
hints:
  DockerRequirement:
    dockerPull: ubuntu:18.04
requirements:
  InitialWorkDirRequirement:
    listing:
      - entryname: run.sh
        entry: >-
          wc -l $(inputs.input_file.path) > $(inputs.out_file_name);
baseCommand: ["/bin/bash","run.sh"]
inputs:
  input_file:
    type: File
  out_file_name:
    type: string
    default: lines.txt
outputs:
  out_file:
    type: File
    outputBinding:
      glob: $(inputs.out_file_name)

@alintulu
Copy link
Member

alintulu commented Nov 17, 2020

Thank you for the information, I can now reproduce your error. I will have a closer look into it and let you know!

@alintulu
Copy link
Member

alintulu commented Nov 18, 2020

Good and bad news! I was able to locate the issue, seems like REANA does not currently support some of the CWL construct in your workflow. However I was able to run the same computational steps by amending it as follows

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: Workflow

requirements:
  ScatterFeatureRequirement: {}
  SubworkflowFeatureRequirement: {}
  StepInputExpressionRequirement: {}
  InlineJavascriptRequirement: {}
inputs:
  input_dir: Directory 
steps:
  list_input:
    run:
      class: ExpressionTool
      requirements: { InlineJavascriptRequirement: {} }
      inputs:
        dir: Directory
      expression: '${return {"files": inputs.dir.listing};}'
      outputs:
        files: File[]
    in:
      dir: input_dir
    out: [files]
  wf:
    scatter: inf
    in:
      inf: list_input/files
    run:
      class: Workflow
      inputs:
        inf: File
      steps:
       count_file:
          run:
            class: CommandLineTool
            label: count lines in file
            hints:
              DockerRequirement:
                dockerPull: ubuntu:18.04
            baseCommand: "/bin/bash"
            arguments:
              - prefix: -c
                valueFrom: |
                  wc -l $(inputs.input_file.path) > $(inputs.out_file_name)
            inputs:
              input_file: File
              out_file_name:
                type: string
                default: lines.txt
            outputs:
              out_file:
                type: File
                outputBinding:
                  glob: $(inputs.out_file_name)
          in:
            input_file: inf
          out: [out_file]
      outputs:
        out_file:
          type: File
          outputSource: count_file/out_file
    out: [out_file]
outputs:
  out_files:
     type: File[]
     outputSource: wf/out_file

As you suspected, the new way of using bash was not compatible. I also had issues moving the count_file step into an external lssh.yaml file. We have plenty of examples on REANA doing so (e.g. reanahub/reana-demo-root6-roofit) however I was unable to do so in this example. Perhaps due to the complex directory structure. Is this helpful for you?

@novikovant
Copy link
Author

novikovant commented Nov 18, 2020

I understand you.
Thanks for the support, but this is not the solutions for us. We have very complex WF, with many external steps. And they work fine with REANA (this is great!) except that one with directory listing.
As I have aslo mentioned, the way of calling "bash" is not the problem. It works both ways inline and does not in external (also both ways).
So, it would be nice if your team will solve this sometimes, but now please continue with other more important tasks.
Personally, I think that such systems as REANA must use own DMS (data management system), and you`ve already done some steps to that (allowing NFS as shared FS, job_data_path, hostmounts), but some of them still to be tested. Then there are questions of speed ups and queries for multijobs parallel I/O, etc.

Possible workaround for now - is to manually write all files to input. Or another - may be single inline step right after JavaScript (or before it) will correct the behaviour (of the CWLtool files staging subsystem) and other steps could be external.

@mr-c
Copy link
Member

mr-c commented Jun 30, 2021

FYI, as an alternative to using that ExpressionTool, one could instead transform the Directory into File[] at the workflow step in level using valueFrom

steps:
 wf:
   scatter: inf
   in:
     inf: 
       source: input_dir
       valueFrom: $(self.listing)
   run:  #...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants