Skip to content
/ pss-fs Public

FastScreen: a Compi pipeline for Fast Screening of PSS (Positively Selected Sites).

License

Notifications You must be signed in to change notification settings

pegi3s/pss-fs

Repository files navigation

FastScreen license dockerhub compihub

FastScreen is a compi pipeline to identify datasets that likely show evidence for positive selection and thus should be the subject of detailed, time-consuming analyses1. A Docker image is available for this pipeline in this Docker Hub repository.

FastScreen repositories

Using the FastScreen image in Linux

In order to use the FastScreen image, create first a directory with name compi_fss_working_dir/input in your local file system. compi_fss_working_dir is the name of the working directory of the pipeline where the output results and intermediate files will be created. The input FASTA files to be analized must be placed in the compi_fss_working_dir/input directory.

Note that FastScreen requires FASTA files to have at least 4 sequences, otherwise the pipeline will not start its execution and a list with the files having less than 4 sequences is created.

Test data

The sample data is available here. Download the FASTA files and put them inside the directory compi_fss_working_dir/input in your local file system. Please, note that the folder input must remain with that name as the pipeline will look for the FASTA files there.

Then, you should adapt and run the following commands:

WORKING_DIR=/path/to/compi_fss_working_dir

docker run -v ${WORKING_DIR}:/working_dir --rm pegi3s/pss-fs --logs /working_dir/logs

In these commands, you should replace:

  • /path/to/compi_fss_working_dir to the actual path in your file system.

Extra

To re-run the pipeline in the same working directory, run the following command first in order to clean it:

docker run -v ${WORKING_DIR}:/working_dir --entrypoint clean_working_dir pegi3s/pss-fs /working_dir/

Or, alternatively, delete every folder manually:

sudo rm -rf ${WORKING_DIR}/ali ${WORKING_DIR}/renamed_seqs ${WORKING_DIR}/logs ${WORKING_DIR}/tree ${WORKING_DIR}/FUBAR_files ${WORKING_DIR}/FUBAR_results ${WORKING_DIR}/short_list ${WORKING_DIR}/to_be_reevaluated_by_codeML ${WORKING_DIR}/codeML_random_list ${WORKING_DIR}/codeML_results ${WORKING_DIR}/tree.codeML ${WORKING_DIR}/codeML_short_list ${WORKING_DIR}/negative_list ${WORKING_DIR}/files_requiring_attention ${WORKING_DIR}/FUBAR_short_list ${WORKING_DIR}/renamed_seqs_mappings

For Developers

Pipeline implementation

The pipeline.xml analyzes each FASTA file in the input_dir directory in parallell (using binded foreachs) and produces the results at the specified working_dir. For each input FASTA file, ClustalOmega and FastTree are executed in first place in order to look for evidence for positive selection with FUBAR. If evidence for positive selection is found, then the name of the file is added to the short_list file. If it is not found, then the file is analized using CodeML. The tasks related with the execution of CodeML can be skipped by passing the parameter skip_code_ml.

Please, note that there is a limit around 90 000 for the product of the number of sequences times the number of ungapped codons that CodeML can handle1. When this limit is exceeded a random sample is taken from the initial dataset (in the codeml-check-limit task). In these cases, as many as possible sequences minus one are used.

The main output is the short_list file, which contains the names of the FASTA files where evidence for positive selection.

Appart from the short_list file, six other output files are produced:

  1. FUBAR_short_list: contains the names of the files where evidence for positive selection has been found by FUBAR.
  2. to_be_reevaluated_by_codeML: contains the names of the files that where re-evaluated by CodeML.
  3. codeML_random_list: contains the names of the files from which a random sequence sample was taken because they were too large to be analysed by CodeML.
  4. codeML_short_list: contains the names of the files where PSS were detected by CodeML model M2a.
  5. negative_list: contains the names of the files where no evidence for positive selection was found by either FUBAR or CodeML.
  6. files_requiring_attention: contains the names of the files that could not be processed without error (usually because they have in frame stop codons that were introduced during the nucleotide alignment step).

Building the Docker image

To build the Docker image, compi-dk is required. Once you have it installed, simply run compi-dk build from the project directory to build the Docker image. The image will be created with the name specified in the compi.project file (i.e. pegi3s/pss-fs:latest). This file also specifies the version of compi that goes into the Docker image.

References

  • H. López-Fernández; C. P. Vieira; P. Ferreira; P. Gouveia; F. Fdez-Riverola; M. Reboiro-Jato; J. Vieira (2021) On the identification of clinically relevant bacterial amino acid changes at the whole genome level using Auto-PSS-Genome. Interdisciplinary Sciences: Computational Life Sciences. Volume 13, pp. 334–343. DOI
  • H. López-Fernández; P. Duque; N. Vázquez; F. Fdez-Riverola; M. Reboiro-Jato; C.P. Vieira; J. Vieira (2019) Inferring Positive Selection in Large Viral Datasets. 13th International Conference on Practical Applications of Computational Biology & Bioinformatics: PACBB 2019. Ávila, Spain. 26 - June DOI

About

FastScreen: a Compi pipeline for Fast Screening of PSS (Positively Selected Sites).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published