-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering implementation epic #305
Comments
For this, I suggest we define acceptable filters for each component and disallow footguns where we can think of them. We will need good documentation for each filter. |
I think the idea is that we have a flag on the results object that tracks if the result has failed a filter? We might need to think how we track the failed filters? |
This is complex. In my view each filter line should be independent - i.e if you have: # for alignment
filter = [
"is_primary",
"mapq > 40",
"strand == -1",
] failing on any one of these conditions would result in the alignment being nulled. In this state the molecule would be treated as a no_map and sequencing would proceed as specified in the toml for a no_map read. One can also imagine the posibility of wanting to take a specific action when a filter is evaluated to false: # for alignment
filter = [
"mapq >40:stop_receiving,unblock",
"strand == -1",
] In this case, a read which had a mapq > 40 would stop_receiving if true and unblock if false. If the strand was +1 it would be treated as no_map. This is a fairly complex filtering language and would need to be well documented with some use cases. Note this would also address:
|
Perhaps this should go in the chunk log (as was) with a tag concept ala sam files? |
I think this will have to be a property of a region? I can envision scenarios with barcoded samples where you want different filters depedning on sample type. |
In my view available filters for a plugin should be defined by the plugin and they should be whitelisted by the plugin. And filter passed to a plugin that is not explicitly defined by the plugin should be ignored. On validation of a toml any such filters should be alerted to the user. In this case, a plugin can be valid with no defined filters. Therefore filter availability is optional. |
I don't quite understand this - I think targets will see every result regardless of filtering if we follow the above suggestions. |
I think any read which has been filtered out should be labelled as "filtered" in the chunks log. This is different to an action_overridden (or whatever). In the context of action overridden a decision was made about a specific read which matches the toml but has been changed for a reason outside the toml spec (e.g first read or duplex). If a read has been filtered, the behaviour is specified by the toml so readfish has not overridden what the toml asked for - it has done as it should have done. |
After chatting with @alexomics on Friday - we came to a couple of conclusions, decided to write them up here after a few days to give me time to forget some of them.
signal -> basecalling(signal) -> filter_basecalls(calls) -> align(filtered_calls) ->
filter_alignments(alignments) -> decide(filtered_alignments)
Think that's it for now! |
I agree with most of the above - 1 -3 is all good. 4 - I'm not convinced about where filtering should be performed? Could you expand on the reasoning for filtering in targets rather than the plugin? With respect to 1-3 I broadly agree but I wonder if there should be some suggestions or guidance about foot guns? |
I think we should make it clear that this is experimental and that no verification on the actual filtering conditions will be performed, and state clearly that if a user doesn't double check their filtering statements, they could ruin their experiment, much in the same way we disclaim using readfish at all! If we filter in targets, rather than plugin it means we don't have to:
One thing it would be very useful to do now is to come up with some Key, useful use cases that we can use as tests during development! That way we can not lose sight of why we are doing this. Do you have any ideas @mattloose, @alexomics ? I'm thinking: But I can't think of any others off the top of my head |
I do agree - but I thought we had considered adding a whitelist for acceptable filters? However - all this is definitiely at the users own risk! I think your examples are good - I will have a think about more of them.... |
The reason I don't want to add a whitelist of acceptable filters is that we will end up having to maintain that list and edit it. If nanopore changes the layout of the base-call results returned from dorado v4.3, filtering base-calls will work for v4.2 but not v4.3 and we will then have to manually edit our white list. But then do we have two sets? One for v4.3 and one for V4.2? Whilst letting anyone filter anything does have the added danger of people making mistakes or doing something silly, I think that the amount of work that comes with a manual white list simply isn't worth it |
Suggested filters:
|
I think we should consider having filters at the region level - if there was a filter for a region you ignore the global filter. It there isn't a region filter you apply the global filter. THis would allow testing of filter types on a single flowcell through different regions - e.g. mapQual >0, >20, > 40 all on one flowcell. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Feature being added - Filtering!
Filter reads and alignments based on a flexible "mini language" specified in the TOML.
Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings
Applied in the tight targets loop
This is parsed into magic Enums and Classes in
_filter.py
I suppose we would store these on the respective classes? _PluginModule or something I forget
Ideas
Issues that need resolving/clarification
sequence.metadata.length < 0
,mapq < 0
and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.Result
object, do we add it straight into the pluginbasecall
/map_reads
methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results
instances that passed and one that failed?unblock
orproceed
?fails_validation
to the toml/Conditions section, which defaults toproceed
? This then relies on the exceeded max chunk behaviourTasks
_filter.py
#304The text was updated successfully, but these errors were encountered: