Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rowRanges of SingleCellExperiment output don't give 3'UTR coordinates #94

Open
frstyang opened this issue Oct 3, 2024 · 4 comments
Open

Comments

@frstyang
Copy link

frstyang commented Oct 3, 2024

The rowRanges of the SingleCellExperiment output by scUTRquant seems to not capture the 3'UTRs of the transcripts. For example, the 3'UTR length of the ENST00000621592.8 transcript is 1993 according to the utr_length column. But, the coordinates given by the rowRanges are chr8:127742452-127742951, which is length 500. Looking at the 3'UTR length of this transcript on the genome browser, it does seem to be 1993.

What coordinates is rowRanges providing? It seems to provide multiple coordinates for the same transcript. How can one extract the actual 3'UTR coordinates, the same ones used to compute the SingleCellExperiment output?

image
image
image

@mfansler
Copy link
Collaborator

mfansler commented Oct 3, 2024

Thanks for the interest!

The GRanges metadata provides the window where the counting took place, which is typically 500 nts (summing exon subranges). The 3'UTR length is an inference based on the difference of the cleavage site position (3' position of the GRanges interval) and the CDS STOP codon position of the associated Ensembl transcript that was closest to the cleavage site.

That is, we only count in the peak window that is adjacent to the cleavage site, but provide the 3'UTR length based on parsimony assumptions.

Hope that clarifies the metadata, but feel free to request any additional information.

@frstyang
Copy link
Author

frstyang commented Oct 3, 2024

Thanks, I think that makes sense!
So if there is only one GRanges interval for a transcript which is on the positive strand, the 3'UTR coordinates would be interval_start - utr_length, interval_start?
What if there are multiple intervals for a transcript? Do I take the most 3' endpoint among the intervals as one endpoint, and then take as the other endpoint the position that is utr_length upstream of it?

@mfansler
Copy link
Collaborator

mfansler commented Oct 4, 2024

For the vast majority, yes. However, a rare edge case is if the annotated 3'UTR itself has an intron. In such a scenario, the procedure you outlined would identify the 3'UTR as starting downstream of where it actually begins.

To cover that edge case, I'm not sure there's any way around looking up the reference annotation, i.e., import the Ensembl/GENCODE annotation, derive 3'UTR coordinates from that, and then augment/truncate them according to the 3'-most position in GRanges.

I see that having this precomputed for each cleavage site would be a valuable addition. However, I'm not sure it's a proper feature request for scUTRquant itself. For example, if rowRanges is already occupied by the read counting intervals, where should the full-length 3'UTR ranges be stored? If you have thoughts/preferences about this, do share.

Otherwise, I could just precompute this for the UTRomes and deposit tables somewhere (e.g., FigShare).

@frstyang
Copy link
Author

frstyang commented Oct 4, 2024

Precomputing the 3'UTR coordinates for the transcripts in the UTRomes and depositing them somewhere sounds good! That would be much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants