A pipeline to dramatically improve repeat libraries before genome annotation through curation of each repeat, better identification of satellite repeats, removal of multi-copy genes and reclassification of cleaned libraries.
TEstrainer can be used on both de novo libraries (generated by tools such as RepeatModeler, REPET, EDTA, etc) or on libraries from databases (e.g. Repbase, Dfam).
The iterative curation of repeats is for use on de novo libraries, and will lengthen repeat consensus sequences as close as possible to the full length. This lengthening is accomplished by repeating the following procedure (developed from ):
Example command
./TEstrainer -l seq/Camellia sinensis-families.fa -g seq/ASM1731120v1.fasta -t 64 -r 12 -C
- Using BLAST to identify copies of the repeats in the source genome
- Extending their coordinates, extracting the extended coordinates
- Local aligning the each extracted sequence to each other to identify regions of the flanks which do not align
- Trimming off these flanks
- Creating a multiple sequence alignment (MSA) of the trimmed sequences
- Creating a consensus sequence from the MSA, ensuring coverage of at least three nucleotides along the length of the sequences
If a repeat's new consensus is more than 25% longer than its previously consensus sequence and aligns to the starting consensus sequence it this cycle is repeated.
BLAST+ - Altschul, S.F. et al. (1997). https://doi.org/10.1089/10665270050081478.
MAFFT - Katoh, K and Standley, D.M. (2013). https://doi.org/10.1093/molbev/mst010
TRF - Benson, G. (1999). https://doi.org/10.1093/nar/27.2.573
mreps - Kolpakov, R et al. (2003). https://doi.org/10.1093/nar/gkg617
sa-ssr - Pickett, B.D. (2016). https://doi.org/10.1093/bioinformatics/btw298
GNU Parallel - Tange, O. (2011) The USENIX Magazine, February 2011:42-47.
cd-hit (Optional, used for clustering) - Li, W (2006) https://doi.org/10.1093/bioinformatics/btl158
RepeatModeler (Optional, used for reclassifying) - Flynn, J.M. (2020). https://doi.org/10.1073/pnas.1921046117
pyranges - Stovner, E.B and Sætrom, P. (2020). https://doi.org/10.1093/bioinformatics/btz615
pyfaidx - Shirley MD, et al. (2015). https://doi.org/10.7287/peerj.preprints.970v1
biopython - Cock, P.A. et al. (2009). https://doi.org/10.1093/bioinformatics/btp163
pandas - McKinney, W. (2010). https://doi.org/10.25080/Majora-92bf1922-00a
numpy - Harris, C.R. et al. (2020). https://doi.org/10.1038/s41586-020-2649-2.
optparse
tidyverse - Wickham, H. et al. (2019). https://doi.org/doi:10.21105/joss.01686.
BSgenome - Pagès H (2022). https://bioconductor.org/packages/BSgenome.
plyranges - Lee, S. et al. (2019). https://doi.org/10.1186/s13059-018-1597-8.
NCBI CDD - Lu, S. et al (2020). https://doi.org/10.1093/nar/gkz991