RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Viruses are obligate intracellular parasites that depend on living organisms for their replication and survival. While the application of metagenomic high-throughput sequencing technologies have facilitated the discovery of the viral dark matter, how to determine the hosts of the metagenome-originated viruses remains challenging owing to the complex composition of the metagenomic sequencing samples. The high genetic diversity of RNA viruses poses a great challenge to the alignment-based methods.

Here, we introduce RNAVirHost, a machine learning-based tool that predicts the hosts of RNA viruses solely based on viral genomes. It takes complete RNA viral genomes as input and predicts the natural host groups from kingdom level to order level.

RNAVirHost is designed as a two-layer classification framework to hierarcically predict the host groups of query viruses. Layer 1 contains 5 branches (Chordata, Invertebrate, Viridiplantae, Fungi, Bacteria), which are categorized into kingdom and phylum level. Layer 2, designed for Chordata subtree, has 10 leaves, which are at the class and order level. RNAVirHost utilizes a hierarchical approach to predict the host lineage along the tree. It combines various factors, including virustaxonomic information, viral genomic traits, and sequence homology, to achieve accurate host prediction. To provide different levels of confidence in the predictions, RNAVirHost incorporates prediction score cutoffs. This cutoff allows users to classify the predictions into different confidence level.

Dependency:

python 3.x
BLAST 2.12.0
Prodigal 2.6.3
xgboost 1.7.4
pandas 2.0.3
scikit-learn 1.1.3
biopython 1.83
numpy 1.23.5

Quick Installation

We highly recommend using conda/mamba to install all dependencies.
```
conda install -c bioconda rnavirhost
```

Or you could build the environment from GitHub:

git clone https://github.com/GreyGuoweiChen/VirHost.git
cd VirHost

# Create the environment and install the dependencies using conda or mamba
mamba env create -f environment.yml

# Activate the environment
conda activate rnavirhost

# Distribute RNAVirHost to your conda environment
pip install .

(Alternative) You could install it using pip to avoid all the dependencies, but may fail.

pip install rnavirhost

# You can check the installation by calling:
conda list rnavirhost

Usage:

To enable RNAVirHost's prediction, the viral taxonomic information is required and its format (.csv) is as below.

	y\|virus order
NC_000858.1	Ortervirales
NC_019922.1	Norzivirales
...	...

The file comprises (r+1) rows and two columns, where (r+1) represents the r sequences in the query fasta file and an additional header row. The first column denotes the accession numbers of sequences (sequences' ID), and the second column represents the corresponding order labels. If a label falls outside our designated range, it is acceptable to input them to the host prediction stage. We will output the corresponding sequence ID in a separate file.

For user convenience, we provide a simple alignment-based method for classifying virus sequences at the order level using BLASTN. The code for this method is shown below. Generally, the order-level predictions provided by BLASTN are sufficiently accurate. However, if users desire to refine the classification using other programs, they can follow the file format outlined in the table above.

rnavirhost classify_order [-i INPUT_CONTIG] [-o TAXONOMIC_RESULT]

The input files include the fasta file and the corresponding taxonomic information table.

rnavirhost predict [-i INPUT_CONTIG] [--taxa TAXONOMIC_INFORMATION_OF_INPUT] [-o OUTPUT_DIRECTORY]

Programs' Option

classify_order:

-i: The input contig file in fasta format.
-o: The taxonoic information of query viruses at order level (default: RVH_taxa.csv).

predict:

-i, --input: The input contig file in fasta format.
--taxa: The input csv file corresponding to the virus order labels of the queries (default: RVH_taxa.csv).
-o, --output: The output directory (default: RVH_result).

Output

The output format is as:

	y\|virus order	pred\|L1	pred\|L2	evidence
NC_000858.1	Ortervirales	Chordata	Primates	pred_high_confidence
NC_001426.1	Norzivirales	Bacteria		assign
...	...	...	...	...

The evidence list has 5 labels, including pred_high_confidence, pred_low_confidence, assign, BLASTn, and unclassified. The "pred" prefix represent that the prediction is made by the learning model. If the prediction score is higher than the built-in score cutoff, we regard it as a high-confidence prediction and thus give "pred_high_confidence"; otherwise, we will give "pred_low_confidence". The "assign" evidence means that the reference viruses in the same order with the query virus infect a specific host group. So we assign the host group to the virus order without prediction.

Besides, RNAVirHost encodes the query sequences at the protein level. To obtain the protein level representation, we translate the sequences into proteins by prodigal first. For those sequences do not encode proteins, it is challenging to generate a reliable result, and RNAVirHost does not accept these non-coding sequences for higher confidence. However, for user's convenience, we try to predict hosts of these sequences adopting the best alignment strategy by BLASTn. We aligned the query virus against our reference database, assign the host as the label of the best hit reference, and give "BLASTn" evidence. Finally, we give the "unclassified" evidence to query viruses which do not have order information.

Example

# create the taxonomic file first
rnavirhost classify_order -i test/test.fasta 

# predict the host of query viruses
rnavirhost predict -i test/test.fasta -o RVH_result

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
rnavirhost		rnavirhost
test		test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Dependency:

Quick Installation

Usage:

Programs' Option

Output

Example

About

Releases

Packages

Languages

License

GreyGuoweiChen/VirHost

Folders and files

Latest commit

History

Repository files navigation

RNAVirHost: a machine learning-based method for predicting hosts of RNA viruses through viral genomes

Dependency:

Quick Installation

Usage:

Programs' Option

Output

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages