Question about the input of extract_codon_alignment.py #194

SWei2333 · 2024-12-16T14:52:49Z

Dear professor,

I am attempting to obtain multi-sequence alignments of orthologous genes using the results from TOGA. Some of the species are from my own analysis, while others are from Zoonomia. I have two questions:

（1）I ran a test with a few species, but I encountered the error message: "# Warning! TOGA didn't find transcript_id orthologs for the following species." I suspect this may be due to a misunderstanding of the formats for the input_dirs, reference_bed, and transcript_id files. Could you kindly provide some guidance on how to resolve this issue?

（2）Some of the files downloaded from Zoonomia have either been renamed or are missing, which may affect the execution of extract_codon_alignment.py. For instance, there are files like codonAlignments.allCESARexons.fa.gz and codonAlignments.fa.gz. Should I use one of these as codon.fasta? Also, some files like query_isoforms.tsv are missing. Could this be problematic?

I would greatly appreciate any advice you can provide.

Best wishes

SWei2333 · 2024-12-17T03:27:01Z

The fist problem has solved, the transcript_id isn't a file, just a id

MichaelHiller · 2024-12-17T09:03:41Z

Hi,

are probably cases where the transcript (gene) is not annotated as intact (or uncertain loss) but missing or where TOGA couldn't find an ortholog. This may happen in fragmented assemblies.
Our multiple codon alis will miss a few species every now and then, when the gene could not be reliably identified.
Here is the updated README.tx
Update Nov 2022.
We now also provide files that contain protein and codon alignments that include also exons, which are classified as deleted or missing.
Essentially, these pairwise alignments contain all exon alignments after the CESAR (transcript alignment) step but before the exon classification step.

codonAlignments.allCESARexons.fa.gz
proteinAlignments.allCESARexons.fa.gz

In other words, pls use the new codonAlignments.fa.gz

SWei2333 · 2024-12-18T09:28:44Z

Thank you for your reply, and I want to ask another question.

I want to get cds multiple sequence alignment file for single-copy orthologous genes across multiple species. I used this script and also applied Hmmcleaner for cleaning, filtered based on the integrity of start codons. Now, the remaining genes include those with alternative splicing. I only need to keep the most suitable transcript, right? I no longer need to filter for premature gene termination or gene duplications, correct?

python3 extract_codon_alignment.py -o ./cds.msa/ENSMUST00000070533.Xkr4.fa -s input_dirs toga.transcripts.bed ENSMUST00000070533.Xkr4 --macse_caller "java -jar /home/software/MACSE_V2_PIPELINES-11.05/UTILS/macse_v2.03.jar"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the input of extract_codon_alignment.py #194

Question about the input of extract_codon_alignment.py #194

SWei2333 commented Dec 16, 2024

SWei2333 commented Dec 17, 2024

MichaelHiller commented Dec 17, 2024 •

edited

Loading

SWei2333 commented Dec 18, 2024

Question about the input of extract_codon_alignment.py #194

Question about the input of extract_codon_alignment.py #194

Comments

SWei2333 commented Dec 16, 2024

SWei2333 commented Dec 17, 2024

MichaelHiller commented Dec 17, 2024 • edited Loading

SWei2333 commented Dec 18, 2024

MichaelHiller commented Dec 17, 2024 •

edited

Loading