Skip to content

Latest commit

 

History

History
66 lines (37 loc) · 2.83 KB

README.md

File metadata and controls

66 lines (37 loc) · 2.83 KB

TEstrainer

A pipeline to dramatically improve repeat libraries before genome annotation through curation of each repeat, better identification of satellite repeats, removal of multi-copy genes and reclassification of cleaned libraries.

TEstrainer can be used on both de novo libraries (generated by tools such as RepeatModeler, REPET, EDTA, etc) or on libraries from databases (e.g. Repbase, Dfam).

The iterative curation of repeats is for use on de novo libraries, and will lengthen repeat consensus sequences as close as possible to the full length. This lengthening is accomplished by repeating the following procedure (developed from ):

Example command

./TEstrainer -l seq/Camellia sinensis-families.fa -g seq/ASM1731120v1.fasta -t 64 -r 12 -C
  1. Using BLAST to identify copies of the repeats in the source genome
  2. Extending their coordinates, extracting the extended coordinates
  3. Local aligning the each extracted sequence to each other to identify regions of the flanks which do not align
  4. Trimming off these flanks
  5. Creating a multiple sequence alignment (MSA) of the trimmed sequences
  6. Creating a consensus sequence from the MSA, ensuring coverage of at least three nucleotides along the length of the sequences

If a repeat's new consensus is more than 25% longer than its previously consensus sequence and aligns to the starting consensus sequence it this cycle is repeated.

Required packages and databases

BLAST+ - Altschul, S.F. et al. (1997). https://doi.org/10.1089/10665270050081478.

MAFFT - Katoh, K and Standley, D.M. (2013). https://doi.org/10.1093/molbev/mst010

TRF - Benson, G. (1999). https://doi.org/10.1093/nar/27.2.573

mreps - Kolpakov, R et al. (2003). https://doi.org/10.1093/nar/gkg617

sa-ssr - Pickett, B.D. (2016). https://doi.org/10.1093/bioinformatics/btw298

GNU Parallel - Tange, O. (2011) The USENIX Magazine, February 2011:42-47.

cd-hit (Optional, used for clustering) - Li, W (2006) https://doi.org/10.1093/bioinformatics/btl158

RepeatModeler (Optional, used for reclassifying) - Flynn, J.M. (2020). https://doi.org/10.1073/pnas.1921046117

Python3 packages

pyranges - Stovner, E.B and Sætrom, P. (2020). https://doi.org/10.1093/bioinformatics/btz615

pyfaidx - Shirley MD, et al. (2015). https://doi.org/10.7287/peerj.preprints.970v1

biopython - Cock, P.A. et al. (2009). https://doi.org/10.1093/bioinformatics/btp163

pandas - McKinney, W. (2010). https://doi.org/10.25080/Majora-92bf1922-00a

numpy - Harris, C.R. et al. (2020). https://doi.org/10.1038/s41586-020-2649-2.

R packages

optparse

tidyverse - Wickham, H. et al. (2019). https://doi.org/doi:10.21105/joss.01686.

BSgenome - Pagès H (2022). https://bioconductor.org/packages/BSgenome.

plyranges - Lee, S. et al. (2019). https://doi.org/10.1186/s13059-018-1597-8.

Database

NCBI CDD - Lu, S. et al (2020). https://doi.org/10.1093/nar/gkz991