-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ConvertToReflFlat skips many transcripts #333
Comments
Hi, GTF files are poorly defined, so parsing files from different sources can be tricky. This is a good example. Let’s take your first gene where you are warned the gene is skipped. Here’s some junky R code so you can replicate, with the interesting output:
The only difference in these two records is the ID field at the end. All of the standard identifiers (transcript name, transcript ID, etc) are the same. Since this lab has come up with their own unique field to define that these two records are unique, it's not at all surprising that software not written by them would not know how to interpret this field. To use this GTF with our toolkit you'd have to modify it so the transcript_id field is unique for these two records. Perhaps concatenating the existing transcript ID + new ID would work? I'd start by isolating this one gene in it's own GTF and trying to parse it before and after your modifications to see if you have a reasonable solution. As for your expected behavior: I have yet to see a GTF where all transcripts are included in the refFlat file - there are a handful of genes that have the same gene symbol and appear on many chromosomes (and are thus not unique) that are filtered from the standard Ensembl GTF file. Perhaps reprocessing the 10x GTF file would include (almost?) all transcripts as they have spent some effort cleaning their GTF file before release. Any idea why there are two records for this transcript, given that they have the same coordinates and information? Perhaps they are defining alternate sets of used exons but happen to use the same first and last exon? All the exons point to the same transcript_id, so I'm really not sure what's going on here. |
Hi James, Thanks for your quick reply. I was afraid I could not "easily" use the same GTF for the Drop-seq samples but I had to give it a try! Agreed on my hyperbole regarding the expected behavior, typing too fast. I have no idea why those transcripts are duplicated. I went and checked the files they provide detailing the changes to the GTF, but I didnt find the explanation there (caveat: im a biologist, not a bioinformatician!). I will give it a try to your solution editing the GTF file but it would be nice to know why they went to the trouble of duplicating transcripts and assigning new IDs in the first place... In addition, and probably more important, do you think the massive skipping of duplicated records during the generation of the refFlat will have any impact on DigitalExpression? I am assuming that 1 of the duplicated transcripts is retained in the refFlat file, right? |
Hi,
We don’t retain either of the duplicate records. If you had “Gene A” on chromosome 1 and chromosome 2, which one would you keep? Generally that filtering is a call (that you heard!) to look more carefully at your input and decide if you want to deal with amount of filtering, or alter the input.
If most of your gene models are filtered because this GTF has many duplicate records, then I think the resulting DGE is only going to reflect the genes that were not filtered, which sounds like it might be pretty troublesome in this case.
…-Jim
On Mar 21, 2023, at 12:30 PM, Augusto Escalante ***@***.***> wrote:
Hi James,
Thanks for your quick reply.
I was afraid I could not "easily" use the same GTF for the Drop-seq samples but I had to give it a try!
Agreed on my hyperbole regarding the expected behavior, typing too fast.
I have no idea why those transcripts are duplicated. I went and checked the files <https://storage.cloud.google.com/generecovery/mouse_mm10_v2_input_files.tar.gz> they provide detailing the changes to the GTF, but I didnt find the explanation there (caveat: im a biologist, not a bioinformatician!).
I will give it a try to your solution editing the GTF file but it would be nice to know why they went to the trouble of duplicating transcripts and assigning new IDs in the first place...
In addition, and probably more important, do you think the massive skipping of duplicated records during the generation of the refFlat will have any impact on DigitalExpression? I am assuming that 1 of the duplicated transcripts is retained in the refFlat file, right?
—
Reply to this email directly, view it on GitHub <#333 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCZXJ76J2UG6ECYO42JQ5TW5HJTTANCNFSM6AAAAAAWCOCJ7E>.
You are receiving this because you commented.
|
Yea...that's what I suspected when I saw the results of DGE today morning and that's why I started looking around... It seems even more complicated. The file you looked at (mouse_mm10_optimized_v2.gtf.gz) is the new version of that GTF, I am using the previous version (mouse_mm10_optimized_v2.gtf.gz) where, if I search for Chic1, just to follow with your example:
Only 1 transcript appears, right? Why is that a problem when parsing? |
You'll need to give me a link to the GTF file you're parsing. |
I thought so but the Pool lab have removed access to the version 1. No problem: here you go |
I assume that the GTFParser called by ConvertToRefFlat looks at the "transcript_name" field, right? However, when I build the metadata using another 10x reference (mm10 1.2.0, I happened to have that one sitting around), and even if the search for Chic1 is:
This GTF doesnt give out any warning regarding Chic1... |
Can you also give me the sequence dictionary you used to call ConvertToRefFlat? My mm10 reference does not have matching contig names. |
Sure, here |
So, I at least have a sense of the problem now for the Chic1 example. Most features have the same transcript ID and transcript name:
One exon is a weird outlier and confuses parsing, having the same transcript name, but a new transcript ID.
In your examples above, this was the last line in the problematic example. The exon spans the entire gene, which seems a bit unusual. Our parser is clustering GTF lines on the transcript ID, then adding them to the gene model by the transcript name (which is usually 1:1 unique with the transcript ID!) What transcript does that exon belong to? Is it a new transcript for the gene, or the same transcript as the other features? Maybe it's time to consider having a discussion with the author of the GTF file to decide exactly what this exon is. |
Lets see if I get a reply. In the meantime, my problem would be solved if the parser would look only at the transcript_name? Do you think that would be easy to implement? |
Maybe @jamesnemesh you can somehow integrate AGAT in your pipeline for uniform GTF processing: https://github.com/NBISweden/AGAT? |
We don't distribute R tools, but one might imagine pre-processing your GTF file with this tool, and having the output be a valid input for Drop-Seq tools. Does this not work currently? Do you have an example GTF output that you've processed this way and throws errors? |
@jamesnemesh: I have a GTF file that was created by AGAT that throws warnings (specifically: mostly Chromosome/Strand disagreement ones) when running |
Affected tool(s)
ConvertToRefFlat
Affected version(s)
Description
I am analyzing a public dataset generated with Drop-seq
GSE103892
. In order to merge the resulting expression matrix to other 10x Genomics datasets, including my own, I have used the genome and gtf files as provided by Pool Lab, which represents an improved version over the 10x Genomics-provided reference files.When running the provided script to generate the required Drop-seq metadata files, I observe that many transcripts and some exons are skipped at the
ConvertToRefFlat
step as can be observed in the attached log file.Then, during the next step, ReduceGtf, the same skipping warnings are printed and, in addition, many warnings in the form of:
WARNING 2023-03-14 09:14:56 EnhanceGTFRecords gene GTFRecord(chrX:103356476-103396092 + . [Chic1 gene]) != GeneFromGTF(chrX:103356476-103409092 + Chic1) -- skipping
are output...Both types of warnings can be observed in the attached logfile: logfile.txt
The resulting
refFlat
file is 10 times smaller than the one provided by you in the Cookbook.Is this normal?
Expected behavior
GTF to refFlat conversion includes all transcripts in the GTF
Actual behavior
Many transcripts are removed from refFlat.
The text was updated successfully, but these errors were encountered: