-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About extracting CDS and PEP from GFF files obtained from TOGA #164
Comments
Can you pls extract only projections (transcript.gene.chainID) that are classified as intact or partially intact? This info is in loss_summ_data.tsv.gz (https://genome.senckenberg.de/download/TOGA/README.txt). Other projections have frameshifts or stop codons or missing sequences and this may not translate. |
Hi |
Can you send an example? We have all the data also in the genome browser http://genome.senckenberg.de/ Send me one, I can have a look as well |
Thank you very much for your suggestions. While organizing my files, I suddenly realized a potential issue. The GTF files I downloaded from Zoonomia contain separate lines for start_codon and stop_codon. However, when converting them to GFF format using gffread, an error occurred, causing some bases of the stop codons to not be included in the CDS regions, resulting in many sequences not being multiples of 3. Thank you very much for your help. I will correct this error and try again. If the problem persists, I will seek your advice again. Thank you very much. |
Hi, C:76.1%[S:44.2%,D:31.9%],F:6.6%,M:17.3%,n:9226 | Thank you very much. |
For which assembly are these stats? And how does that compare to the BUSCO stats we have in the supplement tables? |
I downloaded the GTF file for Dinomys branickii using the mouse as the reference. In your table, the annotation of this species with the human as the reference has a BUSCO score of 9,025 97.82% 86 115(Complete (s + d) Complete% Fragmented Missing) |
This is the HLdinBra1 DNAzoo assembly, right? Can you pls download the protein alignment file and extract only QUERY sequences and BUSCO this? |
I used this file for the BUSCO analysis and obtained the same results as you did. However, I'm a bit confused about why the BUSCO scores decrease when I extract the peptide sequences (PEP) from the GTF file. Could there be an issue with my method of extracting the PEP? I have used gffread as well as itools. |
Likely TOGA masks frameshifts and internal stop codons, while your GTF to protein may not do that. This could explain it. |
Hello, In the past few days, I have tested a new species that I assembled myself. I ran the make-lastz-chain and TOGA processes on this genome, which has a size of 2.7G and is well-masked for repeat sequences. When TOGA completed, I extracted the query peptides from the prot.fasta file, but the number of sequences was only 1933. This seems strange. My reference genome is mm10, as provided in TOGA. I will provide my lastz log file and TOGA log file. Could you please give me some advice? Thank you very much. |
20000 genes are 1:0. This is the main problem. Can you visualize your chains in the UCSC browser? E.g. convert to bigChain and load as a custom track? |
Hi, |
Hi, I have tested the LASTZ alignment for the species twice, and it seems there are no errors. The file sizes are the same, but I'm still confused about where the issue might be. I also couldn't identify any problems using the UCSC browser. Could you provide some advice? Thank you! |
Can you pls point me to the query assembly you used? I can try to produce the chains here. Then you can compare it. |
Thank you for your help! |
Thx. Is the query actually available on NCBI or DNAzoo ? Or is this a private assembly? |
Sorry, it's a private assembly, and the reference is mm10 |
OK, then I need to know the taxID of the query. And did you already run RepeatModeler + RepeatMasker (softmasking) on the query? |
Sure, I have run RepeatModeler + RepeatMasker+Repeatproteinmasker+TRF to mask the genome.(soft mask). The query is assemblied by myself and the taxid is 423607. |
OK, I started the chains. Will take a few days as our cluster is pretty full |
Everything finished on our side. I think the chains have a similar size to yours. Here is the status of the genes: Looks good. I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1 |
Thank you very much for your help. I have downloaded all the files. I will
compare our log files to see where the issue with my TOGA might be.
Michael Hiller ***@***.***> 于2024年7月7日周日 03:40写道:
… Everything finished on our side. I think the chains have a similar size to
yours.
Pls download everything from
https://genome.senckenberg.de/download/HLfukMec1/
and send me a ping. I'll delete it then.
Here is the status of the genes:
g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c
17755 I
2252 L
344 M
263 PG
108 PI
59 PM
999 UL
Looks good.
I made a quick hack to transfer our ancestral gene set from hg38 to mm10.
Compared to HLfukDam2 = GCF_012274545.1
your assembly is perfectly fine.
intact genes genes with inactivating mutations genes with missing sequence
vs_HLfukMec1 15019 1090 124
vs_HLfukDam2 11690 1080 3463
—
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BIAZCEZIXDAFZ357BX4Z6PLZLBB3BAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHEYDAOBYHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hi,
My command is "./toga.py mm10.Fmechowii.final.chain.gz /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed /home/TOGA/mm10.2bit Fmechowii.2bit --kt --pn ./output/Fmechowii -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --nd ./output/Nextflow.log --cb 12,15,18,21,24,27,30,33,36,39,42,45,47,49,60,80,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms" |
The parameters look good. There will be very few genes that require more than your maximum of 100 GB. |
Thank you for your advice, I'll try again!
Michael Hiller ***@***.***> 于2024年7月10日周三 04:06写道:
… The parameters look good. There will be very few genes that require more
than your maximum of 100 GB.
No idea why the results are different.
Did you re-run TOGA with the chains that I computed. Then we can at least
exclude a chain problem.
—
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hello, I used your chain file, and the result did improve a bit, and the
protein sequence on the reference alignment can be increased, but there are
still as many as 27565 transcripts annotated as one2zero, so I think there
is still a problem with TOGA, because their log files are quite different.
Can you advise on this, or could you please provide the version of TOGA and
command you are running.
Michael Hiller ***@***.***> 于2024年7月10日周三 04:06写道:
… The parameters look good. There will be very few genes that require more
than your maximum of 100 GB.
No idea why the results are different.
Did you re-run TOGA with the chains that I computed. Then we can at least
exclude a chain problem.
—
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
We are running 1.1.4 (the latest public version). |
Hello, the version I am using is the latest version 1.1.7. I think the
difference in the log file may be the cause of the problem. There should
not be such a big difference when running in local or slurm mode. Could you
please compare and make comments?
Michael Hiller ***@***.***> 于2024年7月15日周一 18:50写道:
… We are running 1.1.4 (the latest public version).
Which version do you run?
I'll ask Yury to have a look
—
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BIAZCE5EQDP54UFYHUSZQY3ZMOSPFAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGIYTIMJVHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Hello, UPD: Sorry, found the right link to the log. From what I see, 86 output files are empty, likely due to some threads failing with the “OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable” error. This is likely a problem on your cluster’s side (see, e.g., https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable). Anyway, I wonder why TOGA did not catch the error properly.
|
Hi, @ReverendCasy I'm the partner of the questioner. Based on your suggestion, I think this error may be due to a user process limitation issue. However, I can only run it in local mode, so I tested it twice, v3 and v4, respectively, reducing the number of buckets to reduce the number of processes created (--cb). Both times, the operation ended successfully, but there was still this problem in the log file, and the running result was quite different from that of yours. In addition, attach my "_cesar_crashed_jobs. txt file". v3_cesar_crashed_jobs.txt To my surprise, v4 has the lowest number of buckets, so it should have fewer problems exceeding the user limit and better results. But looking from the result the v3 result is better, I am still can't understand what reason be. v3 command: v4 command: Thank you for your time to answer for me. I hope you can give me some suggestions. Looking forward to your reply. |
Hi @molinfzlvvv , |
Hi @ReverendCasy, As you suggested, I set the --cjn option to 150, but the results did not improve much. After carefully comparing the differences, I found that the only change was in the executor configuration. Specifically, I changed the executor option for all processes in the nextflow_config_files folder from 'slurm' to 'local'. But I still don't understand why there are so many differences in the log file content. I am using the latest version, 1.1.7; is it the same version you are using? Can you offer your opinion on this matter? Currently, I am struggling to execute TOGA in its entirety, as I frequently encounter tasks being interrupted or skipped. This poses a significant challenge. Looking forward to your answer. |
Hi @molinfzlvvv ,
Yep
Right now I assume that the problems arise from either the cluster settings or memory limitations. One thing you can try is repartitioning memory buckets (say, --cb 3,7,10,15,20,30,40,50) to get rid of the heaviest jobs. Also, since reducing the absolute number of CESAR jobs helped a bit in the last run, you could also try combining lighter memory buckets and smaller --cjn number. |
Thank you very much for your repeated assistance. I will upload the latest log soon. Your suggestion might be effective, but could it potentially lead to the loss of alignments for genes that occupy a lot of memory? |
Hi @molinfzlvvv ,
True, and for now I would like to make sure that the problem does no concern insufficient memory resources on your machine and/or erroneous job binning in TOGA. P.S. Have you or @SWei2333 contacted the cluster administrator regarding the user process limit? Might be the most obvious solution, given the OpenBLAS error spotted in the previous run. |
Thanks to your advice, I tried to fix the error about OpenBLAS. Fortunately, TOGA worked well after I reduced the OPENBLAS_NUM_THREADS parameter, and I think it was solid from the results. I sincerely thank you for your patience and assistance. |
Hi @molinfzlvvv , |
Hi @ReverendCasy, As an additional problem, I have a species that has a single point mutation in its genome sequence, but one of the transcripts has a premature stop codon due to exon skipping and so on. My current vision is to do an assembly of the transcriptome data and TOGA with the reference genome to identify mutations in different transcripts, Is this theoretically possible? |
Hi @molinfzlvvv |
Hi
Recently, I've been using annotation files for some species obtained via the TOGA method from Zoonomia. When I attempted to extract CDS and PEP from these annotation files and genomes, I found that more than half of the CDS sequences extracted using itools were not multiples of 3. Upon inspecting these sequences, I discovered that half of them had incomplete stop codons, with only one or two bases remaining. The other half ended with TGA but were still not multiples of 3. The reason for this is currently unknown.
Do you have any suggestions for this issue? Thank you very much.
Best wishes
The text was updated successfully, but these errors were encountered: