Current version: v0.1.1
skandiver is a program for identifying mobile genetic elements (prophages, plasmids, transposases, etc.) from assembled whole genome sequences using average nucleotide identity (ANI), genome fragmentation, and evolutionary divergence time. skandiver can find putative mobile genetic elements without the use of gene annotation or training data, and can efficiently query large datasets of hundreds of assemblies or greater within minutes.
- skani (Version 0.2.1 or higher)
- Database of representative genomes (Recommended GTDB Database, see below for installation details)
- Python 3 (with pandas and bio packages)
skandiver uses skani (Developed by Jim Shaw at https://github.com/bluenote-1577/skani), a scalable and robust search tool for computing average nucleotide identity between whole genomes. skani can be installed using conda via:
conda install -c bioconda skani
Alternatively, a binary version of skani can be downloaded for x86-64 Linux systems via:
wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
chmod +x skani
./skani -h
skani search requires a database of representative genomes to query against. The current recommended database is the Genome Taxonomy Database (GTDB), which contains >85,000 representative genomes. To setup this database, first ensure that the following requirements are met:
- skani is installed and in PATH (i.e. typing
skani -h
works). Visit https://github.com/bluenote-1577/skani for more information on setting up skani. - ~120 GB free disk space is available for the uncompressed database and indexing.
First, download the compressed GTDB database and unzip it:
wget
https://data.gtdb.ecogenomic.org/releases/release214/214.1/genomic_files_reps/gtdb_genomes_reps_r214.tar.gz
tar -xf gtdb_genomes_reps_r214.tar.gz
The gtdb database is formatted in a special way. In order to process the reference genome files inside the gtdb folder, we have to do a bit of work. We can run the following to collect all genomes locations into a file called gtdb_file_names.txt
.
find gtdb_genomes_reps_r214/ | grep .fna > gtdb_file_names.txt
Finally, we can construct the indexed database to query against using:
skani sketch -l gtdb_file_names.txt -o gtdb_skani_database_ani -t 20
Note: this process of setting up the database of representative genomes can be replicated for any directory of representative fna.gz files. You can create your own custom representative genome database to search against by downloading a set of representative whole genomes from NCBI, Ensembl, RefSeq, etc. Once the directory of genomes has been initialized, simply run:
find [PATH_TO_REP_DIRECTORY/] | grep .fna > customdb_file_names.txt
skani sketch -l customdb_file_names.txt -o custondb_skani_database_ani -t 20
Now you have created a custom database of representative genomes that skandiver/skani can be used to query against.
Once the three prerequisites have been met (skani, database of representative genomes, python), you are now ready to initialize and begin working with skandiver. To begin, download the skandiver repository:
git clone https://github.com/YoukaiFromAccounting/skandiver
cd skandiver
chmod +x skandiver.sh
bash SETUP.sh
The provided setup script will test your environment for dependencies and download an example data set. You can also install all needed dependencies using the following:
sudo apt-get install python3-pip
pip3 install bio pandas
skandiver is now installed on your system, and can be called using the following command structure:
./skandiver.sh [INPUT_DIRECTORY] [OUTPUT_NAME] [CHUNK_SIZE] [PATH_TO_REPRESENTATIVE_GENOME_DB]
You can test skandiver against a sample whole genome assembly of Acinetobacter baumannii by executing the following command:
./skandiver.sh test_files/abaumannii results 10000 [PATH_TO_REPRESENTATIVE_GENOME_DB]
For example, if you followed the above instructions for setting up the GTDB database of representative genomes in the skandiver directory, you can run:
./skandiver.sh test_files/abaumannii results 10000 gtdb_skani_database_ani
This should output four files; results.txt, resultsskani.txt, resultsskanifiltered.txt, and resultssearch.fna. results.txt contains the summary of potential mobile genetic elements found by skandiver, while resultsskani.txt and resultsskanifiltered.txt contain the skani search results for the query whole genome assembly (with resultsskanifiltered only displaying genome matches with greater than 95% average nucleotide identity and 90% align fraction). resultssearch.fna contains the entire fragmented genome assembly used for the skani search.
The results file looks like the following for a sample whole genome assembly of Pseudomonas aeruginosa:
GenomeID/AccessionNumber QuerySpecies GenomePosition NumberofHits TotalDivergence AverageDivergence RefSpeciesHits
LFMS01000010.1 Pseudomonas_aeruginosa 46306-56305 2 0.00101 0.000505 Pseudomonas_taiwanensis, Pseudomonas_jinjuensis
LFMS01000011.1 Pseudomonas_aeruginosa 1662427-1672426 8 4954.9287 619.3660875 Stutzerimonas_stutzeri, Cronobacter_muytjensii, Cronobacter_universalis, Pseudomonas_putida, Pseudomonas_mosselii, Pseudomonas_saponiphila, Achromobacter_xylosoxidans
LFMS01000011.1 Pseudomonas_aeruginosa 1672427-1682426 6 1992.8886999999997 332.1481166666666 Stutzerimonas_stutzeri, Pseudomonas_putida, Pseudomonas_mosselii, Pseudomonas_saponiphila, Achromobacter_xylosoxidans
- GenomeID/AccessionNumber: the unique sequence identifier for the complete query species.
- QuerySpecies: the NCBI common name of the query assembly.
- GenomePosition: the estimated fragment of the whole genome assembly containing the mobile genetic element.
- NumberofHits: the number of unique species that the query fragment mapped to with >95% ANI and >90% align fraction (extremely high degree of similarity).
- TotalDivergence: the total divergence time for all species the query fragment mapped to, in millions of years.
- AverageDivergence: the average divergence time per species the query fragment mapped to, in millions of years.
As skandiver is considerably faster than gene annotation-based mobile element finders, you can bulk download a large set of whole genome assemblies in .fna or .fasta format (compressed or uncompressed both work) into the [INPUT_DIRECTORY] of skandiver to perform efficient analysis of potential mobile genetic elements in metagenomic data.
Brian Zhang, xiaoleiz@andrew.cmu.edu (Contributing author)
Grace Oualline, gouallin@andrew.cmu.edu (Contributing author)
We would like to express our gratitude to the following individuals and organizations for their major contributions and support in the development of skandiver:
- Jim Shaw (https://github.com/bluenote-1577) for the creation and continuous support of skani, a fundamental tool utilized by skandiver for ANI computations, as well as providing valuable guidance regarding the overall quality and usability of skandiver.
- Yun William Yu (https://github.com/yunwilliamyu) for providing algorithmic support and troubleshooting expertise, greatly improving skandiver's efficiency.
This implementation of skandiver was based on the ideas and software from the following paper:
Shaw, J., & Yu, Y. W. (2023). Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature methods, 20(11), 1661–1665. https://doi.org/10.1038/s41592-023-02018-3