About extracting CDS and PEP from GFF files obtained from TOGA #164

SWei2333 · 2024-06-03T03:14:01Z

Hi
Recently, I've been using annotation files for some species obtained via the TOGA method from Zoonomia. When I attempted to extract CDS and PEP from these annotation files and genomes, I found that more than half of the CDS sequences extracted using itools were not multiples of 3. Upon inspecting these sequences, I discovered that half of them had incomplete stop codons, with only one or two bases remaining. The other half ended with TGA but were still not multiples of 3. The reason for this is currently unknown.

Do you have any suggestions for this issue? Thank you very much.

Best wishes

MichaelHiller · 2024-06-03T07:20:41Z

Can you pls extract only projections (transcript.gene.chainID) that are classified as intact or partially intact? This info is in loss_summ_data.tsv.gz (https://genome.senckenberg.de/download/TOGA/README.txt).

Other projections have frameshifts or stop codons or missing sequences and this may not translate.

SWei2333 · 2024-06-03T09:28:15Z

Hi
I compared the genes with not multiples of 3 to the loss_summ_data.tsv.gz file and found that a significant portion of the genes marked as "I" still have CDS sequences that are not multiples of 3, primarily due to incomplete stop codons.

MichaelHiller · 2024-06-03T09:31:41Z

Can you send an example? We have all the data also in the genome browser http://genome.senckenberg.de/
Pls type a projection into the browser of the query species and click on the TOGA annotation. This shows the protein alignment and a list of inactivating mutations.

Send me one, I can have a look as well

SWei2333 · 2024-06-03T09:51:01Z

Thank you very much for your suggestions. While organizing my files, I suddenly realized a potential issue. The GTF files I downloaded from Zoonomia contain separate lines for start_codon and stop_codon. However, when converting them to GFF format using gffread, an error occurred, causing some bases of the stop codons to not be included in the CDS regions, resulting in many sequences not being multiples of 3. Thank you very much for your help. I will correct this error and try again. If the problem persists, I will seek your advice again. Thank you very much.

SWei2333 · 2024-06-05T06:36:03Z

Hi,
I recently downloaded the genomes and corresponding TOGA.gtf files for several species from Zoonomia. I used gffread to extract the protein sequences (pep) from the GTF files and performed BUSCO scoring. I found that the scores were generally around 80%. Could the issue be with the way I extracted the protein sequences, or is it due to the inherent quality of the genomes resulting in GTF files of this level?
below is a result

Thank you very much.

MichaelHiller · 2024-06-05T07:34:50Z

For which assembly are these stats? And how does that compare to the BUSCO stats we have in the supplement tables?

SWei2333 · 2024-06-05T07:59:24Z

I downloaded the GTF file for Dinomys branickii using the mouse as the reference. In your table, the annotation of this species with the human as the reference has a BUSCO score of 9,025 97.82% 86 115（Complete (s + d) Complete% Fragmented Missing）

MichaelHiller · 2024-06-06T13:31:02Z

This is the HLdinBra1 DNAzoo assembly, right?
Of course, mouse or human as the reference will make a difference, but can't explain 97.8% vs. 76% completeness.

Can you pls download the protein alignment file and extract only QUERY sequences and BUSCO this?
https://genome.senckenberg.de/download/TOGA/human_hg38_reference/Rodentia/Dinomys_branickii__pacarana__HLdinBra1/proteinAlignments.fa.gz
This should reproduce the BUSCO values we report.

SWei2333 · 2024-06-07T06:12:58Z

I used this file for the BUSCO analysis and obtained the same results as you did. However, I'm a bit confused about why the BUSCO scores decrease when I extract the peptide sequences (PEP) from the GTF file. Could there be an issue with my method of extracting the PEP? I have used gffread as well as itools.

MichaelHiller · 2024-06-07T09:14:55Z

Likely TOGA masks frameshifts and internal stop codons, while your GTF to protein may not do that. This could explain it.
Can you look at BUSCO genes that are complete in our protein file but not "GTF to protein"?

SWei2333 · 2024-06-26T03:06:28Z

Hello,

In the past few days, I have tested a new species that I assembled myself. I ran the make-lastz-chain and TOGA processes on this genome, which has a size of 2.7G and is well-masked for repeat sequences. When TOGA completed, I extracted the query peptides from the prot.fasta file, but the number of sequences was only 1933. This seems strange. My reference genome is mm10, as provided in TOGA. I will provide my lastz log file and TOGA log file. Could you please give me some advice? Thank you very much.

Uploading toga_2024_06_25_at_14_54.log…
Uploading run.log…

SWei2333 · 2024-06-26T03:21:25Z

lastz log : https://drive.google.com/file/d/1yAay3gBEw-_IXPKjCe54rWOk17Xw7is8/view?usp=sharing
toga log : https://drive.google.com/file/d/1h0ry0-oc4C2pioiCjZos5Ouzu1zlYUZ1/view?usp=sharing

MichaelHiller · 2024-06-26T07:37:36Z

20000 genes are 1:0. This is the main problem.

Can you visualize your chains in the UCSC browser? E.g. convert to bigChain and load as a custom track?
How large is the gziped chain file. It must be a few hundred MB.
I assume the chains are incomplete.

SWei2333 · 2024-06-26T08:26:03Z

Hi，
The gziped chain file is 112M, I think the size is enough, I'll try to visualize it.

SWei2333 · 2024-07-01T07:00:12Z

Hi,

I have tested the LASTZ alignment for the species twice, and it seems there are no errors. The file sizes are the same, but I'm still confused about where the issue might be. I also couldn't identify any problems using the UCSC browser. Could you provide some advice?

Thank you!

MichaelHiller · 2024-07-01T07:10:47Z

Can you pls point me to the query assembly you used? I can try to produce the chains here. Then you can compare it.

SWei2333 · 2024-07-01T07:37:25Z

Thank you for your help!
https://drive.google.com/file/d/1DdAbWqeiowVAwY3svxh1Q_c2Mg1OehcY/view?usp=sharing
Above is my 2bit file.

MichaelHiller · 2024-07-01T08:35:02Z

Thx. Is the query actually available on NCBI or DNAzoo ? Or is this a private assembly?
Is hg38 your reference?

SWei2333 · 2024-07-01T09:48:47Z

Sorry, it's a private assembly, and the reference is mm10

MichaelHiller · 2024-07-01T09:59:03Z

OK, then I need to know the taxID of the query. And did you already run RepeatModeler + RepeatMasker (softmasking) on the query?

SWei2333 · 2024-07-01T11:20:02Z

Sure, I have run RepeatModeler + RepeatMasker+Repeatproteinmasker+TRF to mask the genome.(soft mask). The query is assemblied by myself and the taxid is 423607.
Thank you for your help!

MichaelHiller · 2024-07-01T11:40:32Z

OK, I started the chains. Will take a few days as our cluster is pretty full

MichaelHiller · 2024-07-06T19:40:10Z

Everything finished on our side. I think the chains have a similar size to yours.
Pls download everything from
https://genome.senckenberg.de/download/HLfukMec1/
and send me a ping. I'll delete it then.

Here is the status of the genes:
g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c
17755 I
2252 L
344 M
263 PG
108 PI
59 PM
999 UL

Looks good.

I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1
your assembly is perfectly fine.
intact genes genes with inactivating mutations genes with missing sequence
vs_HLfukMec1 15019 1090 124
vs_HLfukDam2 11690 1080 3463

SWei2333 · 2024-07-07T13:25:02Z

Thank you very much for your help. I have downloaded all the files. I will compare our log files to see where the issue with my TOGA might be. Michael Hiller ***@***.***> 于2024年7月7日周日 03:40写道：

…

Everything finished on our side. I think the chains have a similar size to yours. Pls download everything from https://genome.senckenberg.de/download/HLfukMec1/ and send me a ping. I'll delete it then. Here is the status of the genes: g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c 17755 I 2252 L 344 M 263 PG 108 PI 59 PM 999 UL Looks good. I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1 your assembly is perfectly fine. intact genes genes with inactivating mutations genes with missing sequence vs_HLfukMec1 15019 1090 124 vs_HLfukDam2 11690 1080 3463 — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BIAZCEZIXDAFZ357BX4Z6PLZLBB3BAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHEYDAOBYHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

SWei2333 · 2024-07-09T01:09:56Z

Hi,
I looked closely at the log files and found a big difference in the output, but I'm not sure why. The differences I can think of might be the following,

The reference files I used were provided by toga's github repository, including mm10.2bit,toga /TOGAInput/mouse_mm10/toga.transcripts.bed, toga.isoforms.tsv, toga. U12introns.tsv
The version of TOGA used was obtained from git clone directly
Use the local runtime instead of slurm

My command is "./toga.py mm10.Fmechowii.final.chain.gz /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed /home/TOGA/mm10.2bit Fmechowii.2bit --kt --pn ./output/Fmechowii -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --nd ./output/Nextflow.log --cb 12,15,18,21,24,27,30,33,36,39,42,45,47,49,60,80,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms"

MichaelHiller · 2024-07-09T20:06:15Z

The parameters look good. There will be very few genes that require more than your maximum of 100 GB.
No idea why the results are different.
Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem.

SWei2333 · 2024-07-10T02:23:57Z

Thank you for your advice, I'll try again! Michael Hiller ***@***.***> 于2024年7月10日周三 04:06写道：

…

The parameters look good. There will be very few genes that require more than your maximum of 100 GB. No idea why the results are different. Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem. — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

SWei2333 · 2024-07-15T01:52:55Z

Hello, I used your chain file, and the result did improve a bit, and the protein sequence on the reference alignment can be increased, but there are still as many as 27565 transcripts annotated as one2zero, so I think there is still a problem with TOGA, because their log files are quite different. Can you advise on this, or could you please provide the version of TOGA and command you are running. Michael Hiller ***@***.***> 于2024年7月10日周三 04:06写道：

…

The parameters look good. There will be very few genes that require more than your maximum of 100 GB. No idea why the results are different. Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem. — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

MichaelHiller · 2024-07-15T10:50:05Z

We are running 1.1.4 (the latest public version).
Which version do you run?
I'll ask Yury to have a look

SWei2333 · 2024-07-16T02:03:28Z

Hello, the version I am using is the latest version 1.1.7. I think the difference in the log file may be the cause of the problem. There should not be such a big difference when running in local or slurm mode. Could you please compare and make comments? Michael Hiller ***@***.***> 于2024年7月15日周一 18:50写道：

…

We are running 1.1.4 (the latest public version). Which version do you run? I'll ask Yury to have a look — Reply to this email directly, view it on GitHub <#164 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BIAZCE5EQDP54UFYHUSZQY3ZMOSPFAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGIYTIMJVHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ReverendCasy · 2024-07-16T12:39:33Z

Hello,
For some reason I cannot download the running log, could you please send it to me? Also, if it is there, could you also send me temp/_cesar_crashed_jobs.txt file?

UPD: Sorry, found the right link to the log. From what I see, 86 output files are empty, likely due to some threads failing with the “OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable” error. This is likely a problem on your cluster’s side (see, e.g., https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable). Anyway, I wonder why TOGA did not catch the error properly.
Please let us know if fixing the cluster setup helps. In the meanwhile we will check the crashed job handling.

Hello, the version I am using is the latest version 1.1.7. I think the difference in the log file may be the cause of the problem. There should not be such a big difference when running in local or slurm mode. Could you please compare and make comments? Michael Hiller @.***> 于2024年7月15日周一 18:50写道：
…

molinfzlvvv · 2024-07-18T02:55:45Z

Hi, @ReverendCasy

I'm the partner of the questioner. Based on your suggestion, I think this error may be due to a user process limitation issue. However, I can only run it in local mode, so I tested it twice, v3 and v4, respectively, reducing the number of buckets to reduce the number of processes created (--cb). Both times, the operation ended successfully, but there was still this problem in the log file, and the running result was quite different from that of yours. In addition, attach my "_cesar_crashed_jobs. txt file".

v3_cesar_crashed_jobs.txt
v4_cesar_crashed_jobs.txt
result comparsion.docx

To my surprise, v4 has the lowest number of buckets, so it should have fewer problems exceeding the user limit and better results. But looking from the result the v3 result is better, I am still can't understand what reason be.

v3 command:
./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev3 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,20,30,40,50,60,70,80,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

v4 command:
./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev4 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,30,50,70,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

Thank you for your time to answer for me. I hope you can give me some suggestions. Looking forward to your reply.

ReverendCasy · 2024-07-18T09:19:50Z

Hi @molinfzlvvv ,
Thank you for the test. The number of jobs is controlled with the --cjn option; could you please run any of these commands with cjn set to, say, 150?

molinfzlvvv · 2024-07-25T07:11:16Z

Hi @ReverendCasy,

As you suggested, I set the --cjn option to 150, but the results did not improve much. After carefully comparing the differences, I found that the only change was in the executor configuration. Specifically, I changed the executor option for all processes in the nextflow_config_files folder from 'slurm' to 'local'. But I still don't understand why there are so many differences in the log file content. I am using the latest version, 1.1.7; is it the same version you are using? Can you offer your opinion on this matter?

Currently, I am struggling to execute TOGA in its entirety, as I frequently encounter tasks being interrupted or skipped. This poses a significant challenge. Looking forward to your answer.

ReverendCasy · 2024-07-28T13:16:45Z

Hi @molinfzlvvv ,
Sorry for the late reply. Could I also get output logs for the last run?

I am using the latest version, 1.1.7; is it the same version you are using?

Yep

Can you offer your opinion on this matter?

Right now I assume that the problems arise from either the cluster settings or memory limitations. One thing you can try is repartitioning memory buckets (say, --cb 3,7,10,15,20,30,40,50) to get rid of the heaviest jobs. Also, since reducing the absolute number of CESAR jobs helped a bit in the last run, you could also try combining lighter memory buckets and smaller --cjn number.
Hope that helps.

molinfzlvvv · 2024-07-29T01:06:43Z

Hi,@ReverendCasy

Thank you very much for your repeated assistance. I will upload the latest log soon. Your suggestion might be effective, but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

log.zip

ReverendCasy · 2024-07-29T12:44:59Z

Hi @molinfzlvvv ,

but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

True, and for now I would like to make sure that the problem does no concern insufficient memory resources on your machine and/or erroneous job binning in TOGA.

P.S. Have you or @SWei2333 contacted the cluster administrator regarding the user process limit? Might be the most obvious solution, given the OpenBLAS error spotted in the previous run.

molinfzlvvv · 2024-08-06T06:32:32Z

Hi,@ReverendCasy

Thanks to your advice, I tried to fix the error about OpenBLAS. Fortunately, TOGA worked well after I reduced the OPENBLAS_NUM_THREADS parameter, and I think it was solid from the results.

I sincerely thank you for your patience and assistance.

ReverendCasy · 2024-08-06T15:04:23Z

Hi @molinfzlvvv ,
You’re welcome. The results look much better indeed. Feel free to comment here if you spot any other inconsistencies in your results.

molinfzlvvv · 2024-08-07T02:22:16Z

Hi @ReverendCasy,

As an additional problem, I have a species that has a single point mutation in its genome sequence, but one of the transcripts has a premature stop codon due to exon skipping and so on. My current vision is to do an assembly of the transcriptome data and TOGA with the reference genome to identify mutations in different transcripts, Is this theoretically possible?

ReverendCasy · 2024-08-07T10:31:06Z

Hi @molinfzlvvv
Sorry, not sure if I follow. Do you suspect the premature stop codon to be a false positive call? Are there any other reference transcripts for the same gene from which you can recover the point mutation? Do you want to use transcriptomic data for verifying mutations predicted by TOGA or to run TOGA with transcriptome assembly as query? The latter will likely lead to skewed orthology predictions given the lost synteny data.
P.S. Since the last questions do not relate to the original issue, please feel free to mail me directly (yury.malovichko@senckenberg.de)

About extracting CDS and PEP from GFF files obtained from TOGA #164

About extracting CDS and PEP from GFF files obtained from TOGA #164

Comments

SWei2333 commented Jun 3, 2024

MichaelHiller commented Jun 3, 2024

SWei2333 commented Jun 3, 2024

MichaelHiller commented Jun 3, 2024

SWei2333 commented Jun 3, 2024

SWei2333 commented Jun 5, 2024

MichaelHiller commented Jun 5, 2024

SWei2333 commented Jun 5, 2024

MichaelHiller commented Jun 6, 2024

SWei2333 commented Jun 7, 2024

MichaelHiller commented Jun 7, 2024

SWei2333 commented Jun 26, 2024

SWei2333 commented Jun 26, 2024

MichaelHiller commented Jun 26, 2024

SWei2333 commented Jun 26, 2024

SWei2333 commented Jul 1, 2024

MichaelHiller commented Jul 1, 2024

SWei2333 commented Jul 1, 2024

MichaelHiller commented Jul 1, 2024

SWei2333 commented Jul 1, 2024

MichaelHiller commented Jul 1, 2024

SWei2333 commented Jul 1, 2024

MichaelHiller commented Jul 1, 2024

MichaelHiller commented Jul 6, 2024

SWei2333 commented Jul 7, 2024 via email

SWei2333 commented Jul 9, 2024

MichaelHiller commented Jul 9, 2024

SWei2333 commented Jul 10, 2024 via email

SWei2333 commented Jul 15, 2024 via email

MichaelHiller commented Jul 15, 2024

SWei2333 commented Jul 16, 2024 via email

ReverendCasy commented Jul 16, 2024 • edited Loading

molinfzlvvv commented Jul 18, 2024

ReverendCasy commented Jul 18, 2024

molinfzlvvv commented Jul 25, 2024

ReverendCasy commented Jul 28, 2024

molinfzlvvv commented Jul 29, 2024

ReverendCasy commented Jul 29, 2024 • edited Loading

molinfzlvvv commented Aug 6, 2024

ReverendCasy commented Aug 6, 2024

molinfzlvvv commented Aug 7, 2024

ReverendCasy commented Aug 7, 2024 • edited Loading

ReverendCasy commented Jul 16, 2024 •

edited

Loading

ReverendCasy commented Jul 29, 2024 •

edited

Loading

ReverendCasy commented Aug 7, 2024 •

edited

Loading