Bring your own tool dockerized

Virtual Research Environments integrate tools and pipelines to enforce a research community. We offer the possibility to integrate your application in one of these analytical platforms. Please, read though this documentation and contact us for any doubt or suggestion you may have.

Why?

open Virtual Research Environment offers a number of benefits for developers willing to integrate their tools in the platform:

Open access platform publicly available
A full web-based workbench with user support utilities, OIDC authentication service, and data handlers for local and remote datasets or repositories.
Visibility for your tool and organization, with ownership recognition, tailored web forms, help pages and customized viewers.
The possibility to add extra value to your tool by complementing it with other related tools already in the platform.
A complete control of your tool through the administration panel, for monitoring, logging and managing your tool.

Requirements

The application or pipeline to be integrated should:

Be free and open source code
Containerized (tested with Docker and Singularity)
Run in non-interactive mode in linux-based operating system

How it works?

Since the Virtual Research environment is a dockerized system, also for tools integration a similar dockerization method is followed, so to encapsulate the tools and their dependencies in a container, allowing for easy sharing, version control, and deployment.

This guide walks through the process of Dockerizing a sequence extraction tool and integrating it into a Virtual Research Environment (VRE) framework.

There are the steps to follow for achieving the integration of a your application as a new VRE tool dockerized versioln. As a result, the VRE is able to control the whole tool execution cycle. It:

Automatically build the job-tune form on the web site with the parameters fields and inputs files of the Tool.
Validate input files and parameters (format and data type filtering, maximum/minimum values, etc).
Stage-in the required input files into the Tool working directory in the compute host (if required)
Schedule the Tool in the cloud/HPC backend in a scalable manner
Monitor and log tool progress during the execution
Stage-out output files from the run working directory (if required)
Registration at the website of the output files resulting from the execution

How to bring in a new tool?

Within the OpenVRE environment, you will need to integrates the tool with OpenVRE Tool Dockerized framework. To do that, the VRE will need three elements:

A docker image for your tool containing the application
A docker image specifics for the VRE framework, contain the VRE RUNNER wrapper
A list of descriptive metadata fields annotating the tool (i.e. input files rquirments, arguments, description)

The following guide will help you achive that:

Dockerization of the tool

STEP (1) - Creating a Dockerfile for your tool

In this example we are going to use a SeqIo tool, a sequence extraction tool using Biopython, designed to filter and extract sequences from a FASTA file based on specified IDs and sequence length.

The Dockerfile sets up the environment by installing dependencies like Biopython and placing the necessary Python script into the container.

Create the Dockerfile in your project directory, that defines the environment and the tool configuration. Using as an example the extraction tool mentioned as before:

# Use a lightweight Python image
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /home/

# Install Biopython for sequence handling
RUN pip install biopython

# Copy your Python script into the container
COPY seqio_tool/extract_sequences.py /home/seqio_tool/extract_sequences.py

# Make the Python script executable
RUN chmod +x /home/seqio_tool/extract_sequences.py

# Define the entry point for the container
ENTRYPOINT ["python", "/home/seqio_tool/extract_sequences.py"]

Make sure for the ENTRYPOINT to refer directly to the script/software that you would want to launch on the platform. The VRE framework will use this as the direct command for the wrapper.

Build the Docker image once the Dockerfile is set up, with the command:

docker build -t my_tool_image .

In this tool case, the image would be available on Docker Hub

STEP (2) - Wrapping the Tool for openVRE

Clone the vre_template_tool_dockerized in your system

git clone https://github.com/mapoferri/vre_dockerized_tool_techthon.git
cd vre_dockerized_tool_techthon/template/

This is gonna be our working directory from this point on.

Set the Dockerized tool

In the Dockerfile_template, you would only needed to modify the FROM command:

INTEGRATE NEW TOOL CONTAINER
FROM #YOUR IMAGE NAME HERE

FROM mapoferri/seqio-tool:latest

Remember to change the name of the Dockerfile from Dockerfile_template to Dockerfile to be able to build the image.

STEP (3) - Prepare the VRE RUNNER script for your application

The VRE RUNNERs are the adapters which are gonna consumate the job execution files and send it to the VRE server, to submit a new job each the user sends it through the web interface. It will run locally the wrapper application or pipeline (the ENTRYPOINT for the docker image of your tool) and generate the outputs.

Since the new modified Dockerfile is gonna create it for you automatically following the [vre_template_tool]() format, the only modification you would need to do is to update the Vre_Tool_template.py.

class myTool(Tool):
    DEFAULT_KEYS = ['execution', 'project', 'description']
    PYTHON_SCRIPT_PATH = "/../seqio_tool/extract_sequences.py"

The $PYTHON_SCRIPT_PATH will point directly to your script as you saved it in your Dockeri image. Make sure the path is consistent.

Path consistency

Before running the ultimate VRE Tool dockerized version of your tool, make sure that the path you used in your Dockerfile could be easily called from the $WORK_DIR in the vre_tool_dockerized This path would never change in the VRE Tool Docker, /home/vre_template_tool/, so make sure to keep it in mind when changing the $PYTHON_SCRIPT_PATH.

You would also need to specify in this code the inputs, arguments. The default is one input_file and one argument. This is how the runToolExecution section of VRE_Tool_Template.py has been modify to adapt to the SeqIO tool dependencies:

try:
            # Get input files
            input_file_1 = input_files.get('fasta_file')
            if not os.path.isabs(input_file_1):
                input_file_1 = os.path.normpath(os.path.join(self.parent_dir, input_file_1))

            input_file_2 = input_files.get('ids_file')
            if not os.path.isabs(input_file_2):
                input_file_2 = os.path.normpath(os.path.join(self.parent_dir, input_file_2))

            # TODO: add more input files to use, if it is necessary for you

            # Get arguments
            argument_1 = self.arguments.get('min_lenght')
            if argument_1 is None:
                errstr = "min_lenght must be defined."
                logger.fatal(errstr)
                raise Exception(errstr)

Finally, you would need to change the cmd command in the same code section,following your requirments for your script, who is gonna be called everytime the user would launch a job request.

In the template version:

cmd = [
                'bash', '/home/my_demo_pipeline.sh', output_file_path
            ]

In the example SeqIO tool:


cmd = [
                    'python3',
                    self.parent_dir + self.PYTHON_SCRIPT_PATH,  # extract_sequences.py
                    input_file_1,  # fasta file
                    input_file_2, #ids file
                    output_file_path,
                    argument_1 #min_lenght
            ]

Remember to change the name of the Dockerfile from VRE_Tool_Template.py to VRE_Tool.py to be able to build the image.

STEP (4) - Prepare the metadata files

In this step, we will create two JSON files that provide a basic description of the tool. These files will be used for local testing of the integration with the VRE_RUNNER. You can find them in template/vre_template_tool/tests/basic_docker directory.

Required JSON Files

Run Configuration File (config.json)
- Contains a list of input files selected by the user for a specific run, including:
  - Values of the arguments
  - List of expected output files

{
    "input_files": [
            {
            "name": "fasta_file",
            "value": "unique_file_id_5e14abe0a37012.29503907",
            "required": true,
            "allow_multiple": false
        },
        {
            "name": "ids_file",
            "value": "unique_file_id_5e14abe0a37012.29503908",
            "required": true,
            "allow_multiple": false
        }
    ],
    "arguments": [
        {
            "name": "execution",
            "value": "/shared_data/userdata/user_1/run000"
        },
        {
            "name": "project",
            "value": "example"
        },
        {
            "name": "description",
            "value": "test"
        },
        {
            "name": "min_lenght",
            "value": "50"
        }
    ],
    "output_files": [
        {
            "name": "output_fasta",
            "required": true,
            "allow_multiple": false,
            "file": {
                "file_type": "FASTA",
                "data_type": "result",
                "meta_data": {
                    "visible": true,
                    "tool": "seqio_tool",
                    "description": "Demo output file."
                },
                "file_path": "/shared_data/public_tmp/outputfasta.fasta"
            }
        }
    ]}

Input Files Metadata File (in_metadata.json)
- Contains metadata for each input file listed in config.json, including:
  - Absolute file path
  - Other relevant metadata information

[
    {
        "_id": "unique_file_id_5e14abe0a37012.29503907",
        "type": "file",
        "file_path": "/shared_data/public_tmp/fasta_file.txt",
        "file_type": "TXT",
        "data_type": "input_file",
        "compressed": 0,
        "sources": [],
        "user_id": "user_id",
        "creation_time": {
            "$date": {
                "$numberLong": 1612777323000
            }
        },
        "meta_data": {
            "size": 0,
            "project": "example",
            "atime": {
                "$date": {
                    "$numberLong": 1612777323000
                }
            },
            "parentDir": "unique_file_id_5e14abe0a37742.64003100",
            "lastAccess": {
                "$date": {
                    "$numberLong": 1612777323000
                }
            }
        }
    },
    {
        "_id": "unique_file_id_5e14abe0a37012.29503908",
        "type": "file",
        "file_path": "/shared_data/public_tmp/ids.txt",
        "file_type": "TXT",
        "data_type": "input_file",
        "compressed": 0,
        "sources": [],
        "user_id": "user_id",
        "creation_time": {
            "$date": {
                "$numberLong": 1612777323000
            }
        },
        "meta_data": {
            "size": 0,
            "project": "example",
            "atime": {
                "$date": {
                    "$numberLong": 1612777323000
                }
            },
            "parentDir": "unique_file_id_5e14abe0a37742.64003100",
            "lastAccess": {
                "$date": {
                    "$numberLong": 1612777323000
                }
            }
        }
    }
]

For testing the image: If some input files for running the test are provided, make sure to save/move them in the template/vre_template_tool/tests/basic_docker/volumes/public/ directory, since by default is the one the test_VRE_RUNNER.sh script has as an input.

Purpose

These JSON files serve as standardized input files for the VRE_RUNNER installed in the Docker environment. In a production setting, these files will be dynamically generated by the VRE server during each execution initiated by the user via the web interface.

STEP (5) Building the VRE Dockerized Tool Image

For testing purposes, the tool is momentainarly called demo_tool, but later on it could be called whatever name version is more fitting.

In the vre_template_tool_dockerized/template dir, run this command:

docker build -t demo_tool .

STEP (6) - Testing the VRE Dockerized Tool Image

Once the VRE Tool dockerized version of your tool is complete, before integrating it into the VRE Environement, you can test it in the template/vre_template_tool/tests/basic_docker directory by running:

chmod +x test_VRE_RUNNER.sh
./test_VRE_RUNNER.sh

You would find output data in whatever directoy was specified in the metadata JSON files.

STEP (7) - Integrating the Tool in the openVRE infrastructure

Once the RUNNER is successfully executing the application in your Dockerized development environment, it is time to ask for registering the new tool to the corresponding VRE server. To do so, some descriptive metadata on the new application is required, i.e., tool descriptions and titles, ownership, references, keywords, etc.

Again, two approaches are supported:

Manual approach:

Generate the tool specification file taking as reference some examples to fully annotate the new tool
- JSON schemes:
  - tool schema minimal
- Examples:
  - dpfrep RUNNER (example of a R-based tool): tool_specification.json
Integrate the Tool in the MongoDB corresponding section following the example here.

In /volumes/openVRE/tools, make a new directory copying the tool_skeleton one with the name of the tool (same ID that was used in Mongo);

Modify the input.php file (especially the $tool_id) based on the requirments of the tools (more inputs, more arguments);

Modify the /volumes/openVRE/tools/$your_tool/assets/home/ the index.html file, for your tool to be consinstent with the mongoDB.

Save your tool specification file in your repository and send it all together to VRE administrators. They will validate the data and register the tool the the VRE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly