ORFograph: ORF search in assembly graphs

Pipeline for generating potential gene sequences, ORFs (Open Reading Frames), from assembly graphs.

It incorporates the power of two graph alignment tools (PathRacer and SPAligner) and uses their output as initial anchors to search for full gene sequences in assembly graphs.

Installation

The easiest way to install orf-search is via conda:

conda create -n orf_search  -c tatianadvorkina -c conda-forge -c bioconda python=3.6 orf-search
conda activate  orf_search

Alternatively you can download git repo and all packages by yourself.

The main pipeline is written in Python3 and uses several libraries described below (and in requirements.txt file).

Python3
- biopython
- pyyaml
- edlib
- joblib
- argparse
- Mummer4
- HMMer
PathRacer
SPAligner

Execution files of PathRacer and SPAligner must be in aligners/ folder. Suitable versions of both tools can be generated from archives above and installed using following instructions.

PathRacer:

cd spades-0.5-recomb/assembler/
mkdir build && cd build && cmake ../src
make pathracer

SPAligner:

cd spades-spaligner-paper/assembler/
mkdir build && cd build && cmake ../src
make spaligner

Both executables can be found in build/bin/ folder.

Check sucessfull installation by running:

orf_search.py --test

Output

The output for the test run will be saved in ./tiny_dataset_test/ folder:

ricinb_lectin2/                    PathRacer run results
toxin/                             SPAligner run results
orfs_raw.fasta                     Full list of ORFs that were found in assembly graph
orfs_total.fasta                   List of ORFs after initial filtering
orfs_graphonly.fasta               List of ORFs that can be found only in graph (not in contigs)
orfs_novel.fasta                   List of novel ORFs (not presented in list of given protein sequences)
orfs_final_clustered.fasta         List of ORFs clustered with 90% identity
orfs_final_most_reliable.fasta     List of representatives for each cluster (usually contains 2-3 sequences per cluster)

Running

Synopsis:

orf_search.py -m HMMS -g GRAPH -o OUT [-s SEQUENCES] [-r] [-c CONTIGS] [-f] [-t THREADS] [-a]

Main parameters are:

-m HMMS list of HMMs in HMMer format that represent domains for PathRacer input

-g GRAPH path to assembly graph (in GFA format), it can contain paths (starting with P lines) that will be used in filtering

-o OUT output directory name

-s SEQUENCES list of known IPG sequences

-r run IPG sequences to graph alignemnt (may be time-consuming)

-c CONTIGS contigs sequences in fasta file

-t THREADS number of threads (default: 1)

-a do not perform filtering based on contigs or known IPGs

Main algorithm

Aligning insecticide proteins/HMMs to the assembly graph. Our method uses SPAligner (Dvorkina et al, 2019) to align insecticide proteins to the assembly graph and retains all alignments with length exceeding 80% of the protein length. It also uses PathRacer (Shlemov and Korobeynikov, 2019) to align HMMs to the assembly graph and retains all alignments with e-value below 10-9 and length exceeding 90% of the HMM length.
Start and stop codon search. For each alignment, our pipeline finds all putative start and stop codons in the assembly graph using the Breadth-First-Search (BFS). The BFS search is performed on the graph where each vertex represents a pair: a position in the assembly graph and a frameshift string of length 0, 1, or 2, that stores the prefix of the current codon triplet. Information about sequences with start codons that are positioned after a stop codon in the graph is reflected in the CDS file.
CDS generation. For each start/stop codon of the partial alignment, toxinSPAdes generates a set of paths that lead from a start/stop codon position to the ends of the alignment. We run the bounded exhausting search through all prefixes/suffixes (limited to generating at most 1000 paths). For each pair of start and stop codon, all prefixes and suffixes are concatenated with the partial alignment and the resulting path (in nucleotides) is converted to the corresponding protein sequence. Finally, duplicate protein sequences are filtered out and all sequences that are found in a single contig (optional) and all sequences that represent known genes are not included into final output.
ORFs filtering, clustering, and selecting representative ORFs. All paths conflicting with some contig-paths are filtered out from the list of putative ORF paths and a set of representative ORFs is formed.

Contacts

For any questions or suggestions please do not hesitate to contact Tatiana Dvorkina tedvorkina@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
aligners		aligners
scripts		scripts
tiny_dataset		tiny_dataset
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
config.yaml		config.yaml
main_pipeline.jpg		main_pipeline.jpg
orf_search.py		orf_search.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ORFograph: ORF search in assembly graphs

Installation

Output

Running

Main algorithm

Contacts

About

Releases

Packages

Languages

License

ablab/orf-search

Folders and files

Latest commit

History

Repository files navigation

ORFograph: ORF search in assembly graphs

Installation

Output

Running

Main algorithm

Contacts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages