Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter low-quality sequences by Nextclade QC status #144

Open
huddlej opened this issue Jan 30, 2024 · 0 comments
Open

Filter low-quality sequences by Nextclade QC status #144

huddlej opened this issue Jan 30, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@huddlej
Copy link
Contributor

huddlej commented Jan 30, 2024

Context

Despite our myriad quality filters in the workflow including a global clock rate filter, local clock rate filter, an outliers list, and dropping sequences with poor alignments, our trees still occasionally include low-quality sequences that would have been flagged with a "bad" QC status by Nextclade. In most of my recent experiences with this issue, the low-quality sequences have too many private mutations.

In the best case, these low-quality sequences look strange in the tree. In the worst case, these sequences break the date inference for internal nodes and produce an invalid time tree topology, requiring the builds to be run again.

Description

Since we already plan to migrate away from Nextalign to Nextclade, we should align sequences with Nextclade and produce the metadata output file with QC statuses. We should filter any sequences that have a QC status of "bad".

We could approach this functionality in a couple of ways:

  1. Modify the existing align rule to use Nextclade and produce QC output, add a subsequent post-alignment filter rule before the tree building rule to omit "bad" QC records (using augur filter on the alignment sequences and "metadata" from Nextclade), and pass the filtered alignment to the tree rule. We'd probably want to merge the original metadata records with the Nextclade metadata prior to filtering much like the merge we do in the flu_frequencies workflow.

OR

  1. Run Nextclade on all sequences upstream of the main phylogenetic workflow, merge the complete metadata with the Nextclade annotations, upload these combined metadata to S3, start the phylogenetic workflow from the combined metadata files, and apply custom filters on QC in the subsampling logic for each build. The only changes to the main phylogenetic workflow required by this approach would be additions to the build YAML files to include a filter for Nextclade QC status. The bigger changes happen outside of the main workflow in our sequence upload logic.

The benefit of the first approach is that we could implement it now without much additional infrastructure planning, since the changes all happen inside the phylogenetic workflow. The annotations would be very fast, since we'd only run Nextclade on the subsampled data. The main disadvantage is the additional complexity to the workflow and the redundant runs of Nextclade across multiple builds for the same lineages and segments.

The benefits of the second approach are that it would introduce no complexity to the existing workflow and it would produce a valuable resource that other current workflows (like flu_frequencies) and future workflows (forecasts?) could benefit from. The disadvantage is the additional infrastructural complexity of setting up the Nextclade runs with GitHub Actions for different references (e.g., A/Wisconsin/67/2005 or A/Darwin/6/2021 for H3N2) and storing the merged metadata in S3 in a way that allows us to unambiguously grab the outputs from the desired Nextclade dataset.

I think we want to end up at the second approach eventually, so maybe it is worth the extra planning effort to figure that approach out now instead of using the first approach.

@huddlej huddlej added the enhancement New feature or request label Jan 30, 2024
@huddlej huddlej mentioned this issue Mar 18, 2024
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant