TEstrainer

A pipeline to dramatically improve repeat libraries before genome annotation through curation of each repeat, better identification of satellite repeats, removal of multi-copy genes and reclassification of cleaned libraries.

TEstrainer can be used on both de novo libraries (generated by tools such as RepeatModeler, REPET, EDTA, etc) or on libraries from databases (e.g. Repbase, Dfam).

The iterative curation of repeats is for use on de novo libraries, and will lengthen repeat consensus sequences as close as possible to the full length. This lengthening is accomplished by repeating the following procedure (developed from ):

Example command

./TEstrainer -l seq/Camellia sinensis-families.fa -g seq/ASM1731120v1.fasta -t 64 -r 12 -C

Using BLAST to identify copies of the repeats in the source genome
Extending their coordinates, extracting the extended coordinates
Local aligning the each extracted sequence to each other to identify regions of the flanks which do not align
Trimming off these flanks
Creating a multiple sequence alignment (MSA) of the trimmed sequences
Creating a consensus sequence from the MSA, ensuring coverage of at least three nucleotides along the length of the sequences

If a repeat's new consensus is more than 25% longer than its previously consensus sequence and aligns to the starting consensus sequence it this cycle is repeated.

Required packages and databases

BLAST+ - Altschul, S.F. et al. (1997). https://doi.org/10.1089/10665270050081478.

MAFFT - Katoh, K and Standley, D.M. (2013). https://doi.org/10.1093/molbev/mst010

TRF - Benson, G. (1999). https://doi.org/10.1093/nar/27.2.573

mreps - Kolpakov, R et al. (2003). https://doi.org/10.1093/nar/gkg617

sa-ssr - Pickett, B.D. (2016). https://doi.org/10.1093/bioinformatics/btw298

GNU Parallel - Tange, O. (2011) The USENIX Magazine, February 2011:42-47.

cd-hit (Optional, used for clustering) - Li, W (2006) https://doi.org/10.1093/bioinformatics/btl158

RepeatModeler (Optional, used for reclassifying) - Flynn, J.M. (2020). https://doi.org/10.1073/pnas.1921046117

Python3 packages

pyranges - Stovner, E.B and Sætrom, P. (2020). https://doi.org/10.1093/bioinformatics/btz615

pyfaidx - Shirley MD, et al. (2015). https://doi.org/10.7287/peerj.preprints.970v1

biopython - Cock, P.A. et al. (2009). https://doi.org/10.1093/bioinformatics/btp163

pandas - McKinney, W. (2010). https://doi.org/10.25080/Majora-92bf1922-00a

numpy - Harris, C.R. et al. (2020). https://doi.org/10.1038/s41586-020-2649-2.

R packages

optparse

tidyverse - Wickham, H. et al. (2019). https://doi.org/doi:10.21105/joss.01686.

BSgenome - Pagès H (2022). https://bioconductor.org/packages/BSgenome.

plyranges - Lee, S. et al. (2019). https://doi.org/10.1186/s13059-018-1597-8.

Database

NCBI CDD - Lu, S. et al (2020). https://doi.org/10.1093/nar/gkz991

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TEstrainer

Required packages and databases

Python3 packages

R packages

Database

Files

README.md

Latest commit

History

README.md

File metadata and controls

TEstrainer

Required packages and databases

Python3 packages

R packages

Database