Justification for feature overlap filters #143
-
Hello, First of all, thank you so much for making this tool which I have been very impressed by! I was wondering - how were the feature overlap filters decided? I was looking at the table in this issue (#22 (comment)) and wondered if there was any literature or reasoning that you could provide for why you remove some features over others, or why you'd never expect certain features to overlap. Thank you so much again, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi Andrea (@watsonar) , thanks for reaching out and asking this excellent question! Essentially, the current implementation is based on best practices and a piece of common sense within the community (see Dfast, Prokka, PGAP), as for example we regularly see that there are false-positive ORFs crossing tRNAs or in turn false-positive tRNA overlapping tmRNAs. These cases are filtered out. In terms of CDS and especially short CDS (sORFs), it's obviously much more complicated. Here, I'd like to allow some reasonable overlaps, e.g. <X bp if not encoded in the same frame. But as described above, I'm keen to get some publication-backed thresholds for X before implementing any more-sophisticated overlap features. In particular for the sORFs I implemented rather strict overlap features to reduce the potentially large number of false-positives. So, if you (or anyone else) knows of some good publications that could provide solid statistical information that are useful to refine these overlaps, I'd love to hear about them. Thanks again and best regards! |
Beta Was this translation helpful? Give feedback.
Hi Andrea (@watsonar) , thanks for reaching out and asking this excellent question!
To be honest, there are no distinct publications that the current implementation relies on, at least not yet. Instead, I thought that we just need such a step in the workflow and hence implemented a rather simple version and placeholder in the code for future refinements/improvements that of course should be based on large-scale data/results and be less influenced by rather anecdotal observations.
Essentially, the current implementation is based on best practices and a piece of common sense within the community (see Dfast, Prokka, PGAP), as for example we regularly see that there are false-positive ORFs cr…