-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping kallisto equivalence classes to actual transcript sequences #421
Comments
Older version of kallisto had the |
Long-winded answer: Correct, pseudobam was very hard to maintain, especially within the logic of the new de bruijn graph structure that was implemented in the latest release. As such, it has been removed (and I'm unsure if I'll have the bandwidth to bring it back or implement an alternative -- though it remains on my radar). For looking at exact locations where a read maps to, just go with a standard aligner (since that's what they were built for) and use kallisto for getting you the actual quantifications (which is what kallisto was built for). If you use kallisto with the --num option, the read numbers (zero-indexed) are stored in the BUS file. You can then view your BUS file (with bustools text) and get your equivalence classes of interest and the read numbers that are associated with them, and then you can run those reads through a standard aligner. If you're interested in looking at what exon a read maps to, you could index exons rather than individual transcripts (but then this will no longer be splice-aware and you'd discard all reads that cross any exon-exon splice junction). Anyway, just some ideas :) |
Thanks for your reply! |
Ah yes, the equivalence class is basically a collection of transcripts. So if an equivalence class consists of transcripts A, B, and C — it means that the k-mers in your read are compatible with those three transcripts. In ENSEMBL, the transcript names usually start with ENST… (for human), and you can get their sequences through the Ensembl website or by simply looking at the transcript sequences in the FASTA file that you ran “kallisto index” on. In your kallisto output folder, there’s a .ec file — the first column is some arbitrary equivalence class ID and the second column represents the collection of transcript IDs associated with that equivalence class (the numerical transcript IDs are zero-indexed and correspond to the line numbers in the transcripts.txt file in your output folder). |
Yes, it is easy to know which transcripts are present in each equivalence class. Nonetheless, my intention is to link each equivalence class, to the actual transcript sequence (or genomic coordinates) it represents. |
One option might be to use the corresponding GTF file, convert it to BED format and then use bedtools to intersect the transcripts that belong to each equivalence class. This ways, you show end up with genomic regions that are in common to all transcripts in each equivalence class. If you prefer working with R over bedtools, you could implement the same idea using GRanges(List) objects. |
This is actually a great idea. |
In kallisto 0.48.0 there was a way to generate the GFA file of the index using the inspect command. There the transcript id was written for each unitig, so you could easily get the unitig sequence and blat it. This is useful for debugging a handful of equivalence classes but not suitable for building into a pipeline |
Hi there!
Thanks for developing this awesome tool! I am currently using kallisto-bustools to generate single-cell equivalence class counts.
I am wondering if there is a way of linking an equivalence class to a particular transcript sequence or region. For instance, eq class { t1, t2, t3} corresponds to exon 1 of t1, t2 and t3 which is chr1:200-300.
I assume this information must be present in the kallisto index, but I am not finding a straightforward way of extracting it. Any suggestion ?
The text was updated successfully, but these errors were encountered: