Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the input of extract_codon_alignment.py #194

Open
SWei2333 opened this issue Dec 16, 2024 · 3 comments
Open

Question about the input of extract_codon_alignment.py #194

SWei2333 opened this issue Dec 16, 2024 · 3 comments

Comments

@SWei2333
Copy link

Dear professor,

I am attempting to obtain multi-sequence alignments of orthologous genes using the results from TOGA. Some of the species are from my own analysis, while others are from Zoonomia. I have two questions:

(1)I ran a test with a few species, but I encountered the error message: "# Warning! TOGA didn't find transcript_id orthologs for the following species." I suspect this may be due to a misunderstanding of the formats for the input_dirs, reference_bed, and transcript_id files. Could you kindly provide some guidance on how to resolve this issue?

(2)Some of the files downloaded from Zoonomia have either been renamed or are missing, which may affect the execution of extract_codon_alignment.py. For instance, there are files like codonAlignments.allCESARexons.fa.gz and codonAlignments.fa.gz. Should I use one of these as codon.fasta? Also, some files like query_isoforms.tsv are missing. Could this be problematic?

I would greatly appreciate any advice you can provide.

Best wishes

@SWei2333
Copy link
Author

The fist problem has solved, the transcript_id isn't a file, just a id

@MichaelHiller
Copy link
Collaborator

MichaelHiller commented Dec 17, 2024

Hi,

  1. are probably cases where the transcript (gene) is not annotated as intact (or uncertain loss) but missing or where TOGA couldn't find an ortholog. This may happen in fragmented assemblies.
    Our multiple codon alis will miss a few species every now and then, when the gene could not be reliably identified.

  2. Here is the updated README.tx
    Update Nov 2022.
    We now also provide files that contain protein and codon alignments that include also exons, which are classified as deleted or missing.
    Essentially, these pairwise alignments contain all exon alignments after the CESAR (transcript alignment) step but before the exon classification step.

codonAlignments.allCESARexons.fa.gz
proteinAlignments.allCESARexons.fa.gz

In other words, pls use the new codonAlignments.fa.gz

@SWei2333
Copy link
Author

Thank you for your reply, and I want to ask another question.

I want to get cds multiple sequence alignment file for single-copy orthologous genes across multiple species. I used this script and also applied Hmmcleaner for cleaning, filtered based on the integrity of start codons. Now, the remaining genes include those with alternative splicing. I only need to keep the most suitable transcript, right? I no longer need to filter for premature gene termination or gene duplications, correct?

python3 extract_codon_alignment.py -o ./cds.msa/ENSMUST00000070533.Xkr4.fa -s input_dirs toga.transcripts.bed ENSMUST00000070533.Xkr4 --macse_caller "java -jar /home/software/MACSE_V2_PIPELINES-11.05/UTILS/macse_v2.03.jar"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants