Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About extracting CDS and PEP from GFF files obtained from TOGA #164

Open
SWei2333 opened this issue Jun 3, 2024 · 41 comments
Open

About extracting CDS and PEP from GFF files obtained from TOGA #164

SWei2333 opened this issue Jun 3, 2024 · 41 comments

Comments

@SWei2333
Copy link

SWei2333 commented Jun 3, 2024

Hi
Recently, I've been using annotation files for some species obtained via the TOGA method from Zoonomia. When I attempted to extract CDS and PEP from these annotation files and genomes, I found that more than half of the CDS sequences extracted using itools were not multiples of 3. Upon inspecting these sequences, I discovered that half of them had incomplete stop codons, with only one or two bases remaining. The other half ended with TGA but were still not multiples of 3. The reason for this is currently unknown.

Do you have any suggestions for this issue? Thank you very much.

Best wishes

@MichaelHiller
Copy link
Collaborator

Can you pls extract only projections (transcript.gene.chainID) that are classified as intact or partially intact? This info is in loss_summ_data.tsv.gz (https://genome.senckenberg.de/download/TOGA/README.txt).

Other projections have frameshifts or stop codons or missing sequences and this may not translate.

@SWei2333
Copy link
Author

SWei2333 commented Jun 3, 2024

Hi
I compared the genes with not multiples of 3 to the loss_summ_data.tsv.gz file and found that a significant portion of the genes marked as "I" still have CDS sequences that are not multiples of 3, primarily due to incomplete stop codons.

@MichaelHiller
Copy link
Collaborator

Can you send an example? We have all the data also in the genome browser http://genome.senckenberg.de/
Pls type a projection into the browser of the query species and click on the TOGA annotation. This shows the protein alignment and a list of inactivating mutations.

Send me one, I can have a look as well

@SWei2333
Copy link
Author

SWei2333 commented Jun 3, 2024

Thank you very much for your suggestions. While organizing my files, I suddenly realized a potential issue. The GTF files I downloaded from Zoonomia contain separate lines for start_codon and stop_codon. However, when converting them to GFF format using gffread, an error occurred, causing some bases of the stop codons to not be included in the CDS regions, resulting in many sequences not being multiples of 3. Thank you very much for your help. I will correct this error and try again. If the problem persists, I will seek your advice again. Thank you very much.

@SWei2333
Copy link
Author

SWei2333 commented Jun 5, 2024

Hi,
I recently downloaded the genomes and corresponding TOGA.gtf files for several species from Zoonomia. I used gffread to extract the protein sequences (pep) from the GTF files and performed BUSCO scoring. I found that the scores were generally around 80%. Could the issue be with the way I extracted the protein sequences, or is it due to the inherent quality of the genomes resulting in GTF files of this level?
below is a result

C:76.1%[S:44.2%,D:31.9%],F:6.6%,M:17.3%,n:9226 |
|7020 Complete BUSCOs (C) |
|4078 Complete and single-copy BUSCOs (S) |
|2942 Complete and duplicated BUSCOs (D) |
|613 Fragmented BUSCOs (F) |
|1593 Missing BUSCOs (M) |
|9226 Total BUSCO groups searched

Thank you very much.

@MichaelHiller
Copy link
Collaborator

For which assembly are these stats? And how does that compare to the BUSCO stats we have in the supplement tables?

@SWei2333
Copy link
Author

SWei2333 commented Jun 5, 2024

I downloaded the GTF file for Dinomys branickii using the mouse as the reference. In your table, the annotation of this species with the human as the reference has a BUSCO score of 9,025 97.82% 86 115(Complete (s + d) Complete% Fragmented Missing)

@MichaelHiller
Copy link
Collaborator

This is the HLdinBra1 DNAzoo assembly, right?
Of course, mouse or human as the reference will make a difference, but can't explain 97.8% vs. 76% completeness.

Can you pls download the protein alignment file and extract only QUERY sequences and BUSCO this?
https://genome.senckenberg.de/download/TOGA/human_hg38_reference/Rodentia/Dinomys_branickii__pacarana__HLdinBra1/proteinAlignments.fa.gz
This should reproduce the BUSCO values we report.

@SWei2333
Copy link
Author

SWei2333 commented Jun 7, 2024

I used this file for the BUSCO analysis and obtained the same results as you did. However, I'm a bit confused about why the BUSCO scores decrease when I extract the peptide sequences (PEP) from the GTF file. Could there be an issue with my method of extracting the PEP? I have used gffread as well as itools.

@MichaelHiller
Copy link
Collaborator

Likely TOGA masks frameshifts and internal stop codons, while your GTF to protein may not do that. This could explain it.
Can you look at BUSCO genes that are complete in our protein file but not "GTF to protein"?

@SWei2333
Copy link
Author

Hello,

In the past few days, I have tested a new species that I assembled myself. I ran the make-lastz-chain and TOGA processes on this genome, which has a size of 2.7G and is well-masked for repeat sequences. When TOGA completed, I extracted the query peptides from the prot.fasta file, but the number of sequences was only 1933. This seems strange. My reference genome is mm10, as provided in TOGA. I will provide my lastz log file and TOGA log file. Could you please give me some advice? Thank you very much.
图片1
Uploading toga_2024_06_25_at_14_54.log…
Uploading run.log…

@MichaelHiller
Copy link
Collaborator

20000 genes are 1:0. This is the main problem.

Can you visualize your chains in the UCSC browser? E.g. convert to bigChain and load as a custom track?
How large is the gziped chain file. It must be a few hundred MB.
I assume the chains are incomplete.

@SWei2333
Copy link
Author

Hi,
The gziped chain file is 112M, I think the size is enough, I'll try to visualize it.

@SWei2333
Copy link
Author

SWei2333 commented Jul 1, 2024

Hi,

I have tested the LASTZ alignment for the species twice, and it seems there are no errors. The file sizes are the same, but I'm still confused about where the issue might be. I also couldn't identify any problems using the UCSC browser. Could you provide some advice?

Thank you!

@MichaelHiller
Copy link
Collaborator

Can you pls point me to the query assembly you used? I can try to produce the chains here. Then you can compare it.

@SWei2333
Copy link
Author

SWei2333 commented Jul 1, 2024

Thank you for your help!
https://drive.google.com/file/d/1DdAbWqeiowVAwY3svxh1Q_c2Mg1OehcY/view?usp=sharing
Above is my 2bit file.

@MichaelHiller
Copy link
Collaborator

Thx. Is the query actually available on NCBI or DNAzoo ? Or is this a private assembly?
Is hg38 your reference?

@SWei2333
Copy link
Author

SWei2333 commented Jul 1, 2024

Sorry, it's a private assembly, and the reference is mm10

@MichaelHiller
Copy link
Collaborator

OK, then I need to know the taxID of the query. And did you already run RepeatModeler + RepeatMasker (softmasking) on the query?

@SWei2333
Copy link
Author

SWei2333 commented Jul 1, 2024

Sure, I have run RepeatModeler + RepeatMasker+Repeatproteinmasker+TRF to mask the genome.(soft mask). The query is assemblied by myself and the taxid is 423607.
Thank you for your help!

@MichaelHiller
Copy link
Collaborator

OK, I started the chains. Will take a few days as our cluster is pretty full

@MichaelHiller
Copy link
Collaborator

Everything finished on our side. I think the chains have a similar size to yours.
Pls download everything from
https://genome.senckenberg.de/download/HLfukMec1/
and send me a ping. I'll delete it then.

Here is the status of the genes:
g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c
17755 I
2252 L
344 M
263 PG
108 PI
59 PM
999 UL

Looks good.

I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1
your assembly is perfectly fine.
intact genes genes with inactivating mutations genes with missing sequence
vs_HLfukMec1 15019 1090 124
vs_HLfukDam2 11690 1080 3463

@SWei2333
Copy link
Author

SWei2333 commented Jul 7, 2024 via email

@SWei2333
Copy link
Author

SWei2333 commented Jul 9, 2024

Hi,
I looked closely at the log files and found a big difference in the output, but I'm not sure why. The differences I can think of might be the following,

  1. The reference files I used were provided by toga's github repository, including mm10.2bit,toga /TOGAInput/mouse_mm10/toga.transcripts.bed, toga.isoforms.tsv, toga. U12introns.tsv
  2. The version of TOGA used was obtained from git clone directly
  3. Use the local runtime instead of slurm

My command is "./toga.py mm10.Fmechowii.final.chain.gz /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed /home/TOGA/mm10.2bit Fmechowii.2bit --kt --pn ./output/Fmechowii -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --nd ./output/Nextflow.log --cb 12,15,18,21,24,27,30,33,36,39,42,45,47,49,60,80,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms"

@MichaelHiller
Copy link
Collaborator

The parameters look good. There will be very few genes that require more than your maximum of 100 GB.
No idea why the results are different.
Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem.

@SWei2333
Copy link
Author

SWei2333 commented Jul 10, 2024 via email

@SWei2333
Copy link
Author

SWei2333 commented Jul 15, 2024 via email

@MichaelHiller
Copy link
Collaborator

We are running 1.1.4 (the latest public version).
Which version do you run?
I'll ask Yury to have a look

@SWei2333
Copy link
Author

SWei2333 commented Jul 16, 2024 via email

@ReverendCasy
Copy link
Collaborator

ReverendCasy commented Jul 16, 2024

Hello,
For some reason I cannot download the running log, could you please send it to me? Also, if it is there, could you also send me temp/_cesar_crashed_jobs.txt file?

UPD: Sorry, found the right link to the log. From what I see, 86 output files are empty, likely due to some threads failing with the “OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable” error. This is likely a problem on your cluster’s side (see, e.g., https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable). Anyway, I wonder why TOGA did not catch the error properly.
Please let us know if fixing the cluster setup helps. In the meanwhile we will check the crashed job handling.

Hello, the version I am using is the latest version 1.1.7. I think the difference in the log file may be the cause of the problem. There should not be such a big difference when running in local or slurm mode. Could you please compare and make comments? Michael Hiller @.***> 于2024年7月15日周一 18:50写道:

@molinfzlvvv
Copy link

Hi, @ReverendCasy

I'm the partner of the questioner. Based on your suggestion, I think this error may be due to a user process limitation issue. However, I can only run it in local mode, so I tested it twice, v3 and v4, respectively, reducing the number of buckets to reduce the number of processes created (--cb). Both times, the operation ended successfully, but there was still this problem in the log file, and the running result was quite different from that of yours. In addition, attach my "_cesar_crashed_jobs. txt file".

v3_cesar_crashed_jobs.txt
v4_cesar_crashed_jobs.txt
result comparsion.docx

To my surprise, v4 has the lowest number of buckets, so it should have fewer problems exceeding the user limit and better results. But looking from the result the v3 result is better, I am still can't understand what reason be.

v3 command:
./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev3 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,20,30,40,50,60,70,80,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

v4 command:
./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev4 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,30,50,70,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

Thank you for your time to answer for me. I hope you can give me some suggestions. Looking forward to your reply.

@ReverendCasy
Copy link
Collaborator

Hi @molinfzlvvv ,
Thank you for the test. The number of jobs is controlled with the --cjn option; could you please run any of these commands with cjn set to, say, 150?

@molinfzlvvv
Copy link

Hi @ReverendCasy,

As you suggested, I set the --cjn option to 150, but the results did not improve much. After carefully comparing the differences, I found that the only change was in the executor configuration. Specifically, I changed the executor option for all processes in the nextflow_config_files folder from 'slurm' to 'local'. But I still don't understand why there are so many differences in the log file content. I am using the latest version, 1.1.7; is it the same version you are using? Can you offer your opinion on this matter?

image

Currently, I am struggling to execute TOGA in its entirety, as I frequently encounter tasks being interrupted or skipped. This poses a significant challenge. Looking forward to your answer.

@ReverendCasy
Copy link
Collaborator

Hi @molinfzlvvv ,
Sorry for the late reply. Could I also get output logs for the last run?

I am using the latest version, 1.1.7; is it the same version you are using?

Yep

Can you offer your opinion on this matter?

Right now I assume that the problems arise from either the cluster settings or memory limitations. One thing you can try is repartitioning memory buckets (say, --cb 3,7,10,15,20,30,40,50) to get rid of the heaviest jobs. Also, since reducing the absolute number of CESAR jobs helped a bit in the last run, you could also try combining lighter memory buckets and smaller --cjn number.
Hope that helps.

@molinfzlvvv
Copy link

Hi,@ReverendCasy

Thank you very much for your repeated assistance. I will upload the latest log soon. Your suggestion might be effective, but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

log.zip

@ReverendCasy
Copy link
Collaborator

ReverendCasy commented Jul 29, 2024

Hi @molinfzlvvv ,

but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

True, and for now I would like to make sure that the problem does no concern insufficient memory resources on your machine and/or erroneous job binning in TOGA.

P.S. Have you or @SWei2333 contacted the cluster administrator regarding the user process limit? Might be the most obvious solution, given the OpenBLAS error spotted in the previous run.

@molinfzlvvv
Copy link

Hi,@ReverendCasy

Thanks to your advice, I tried to fix the error about OpenBLAS. Fortunately, TOGA worked well after I reduced the OPENBLAS_NUM_THREADS parameter, and I think it was solid from the results.

image

I sincerely thank you for your patience and assistance.

@ReverendCasy
Copy link
Collaborator

Hi @molinfzlvvv ,
You’re welcome. The results look much better indeed. Feel free to comment here if you spot any other inconsistencies in your results.

@molinfzlvvv
Copy link

Hi @ReverendCasy,

As an additional problem, I have a species that has a single point mutation in its genome sequence, but one of the transcripts has a premature stop codon due to exon skipping and so on. My current vision is to do an assembly of the transcriptome data and TOGA with the reference genome to identify mutations in different transcripts, Is this theoretically possible?

@ReverendCasy
Copy link
Collaborator

ReverendCasy commented Aug 7, 2024

Hi @molinfzlvvv
Sorry, not sure if I follow. Do you suspect the premature stop codon to be a false positive call? Are there any other reference transcripts for the same gene from which you can recover the point mutation? Do you want to use transcriptomic data for verifying mutations predicted by TOGA or to run TOGA with transcriptome assembly as query? The latter will likely lead to skewed orthology predictions given the lost synteny data.
P.S. Since the last questions do not relate to the original issue, please feel free to mail me directly (yury.malovichko@senckenberg.de)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants