Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider "low" mapping quality reads to be unaligned for the purpose of Marking Duplicates. #1460

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yfarjoun
Copy link
Contributor

@yfarjoun yfarjoun commented Feb 2, 2020

Fixes #128 and #1285 (which are similar issues and about a factor of 10 difference in issue number...)

The main thrust here is to consider "low mapping quality" as an indication of the read being effectively unmapped, as its location is not well determined and thus two identical fragments where each have one read with low (=0, e.g) mapping quality should be considered to be duplicates, or not, based on the well-mapped reads, and not the semi-random low-mapping quality read.

Since this uses the same mechanism of unmapped reads, it will also not mark the low-mapping quality read as duplicate when its mate is marked so. Unless the file is queryname sorted, in which case the unmapped and the low-mapping quality reads are marked like their well-aligned mate.

Description

Give your PR a concise yet descriptive title
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Mention any issues fixed, addressed or otherwise related to this pull request, including issue numbers or hard links for issues in other repos.
You can delete these instructions once you have written your PR description.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@yfarjoun yfarjoun marked this pull request as ready for review February 10, 2020 20:17
@fleharty
Copy link
Contributor

@yfarjoun Just letting you know that I'm using this on some somatic data to determine if this affects the false positive rates.

@yfarjoun
Copy link
Contributor Author

great. thanks @fleharty

@fleharty
Copy link
Contributor

fleharty commented Jun 29, 2020

@yfarjoun
Some preliminary results:
I ran 6 NA12878 samples in pairs against each other for a total of choose(6, 2) = 30, runs.

With the current MarkDuplicates
SNP FP = 30
SNP FP rate = 0.029 / Mb
Indel FP = 21
Indel FP rate = 0.020 / Mb

With "low" mapping quality reads unaligned.
SNP FP = 44
SNP FP rate = 0.043 / Mb
Indel FP = 17
Indel FP rate = 0.017

It looks like the SNP false positive rate goes up with this new Mark Duplicates methods, but the indel false positive rate goes down.

@fleharty
Copy link
Contributor

@yfarjoun I would like to follow up on this to understand why this would increase FP rate in SNPs, it makes no sense to me. I don't have time this week to look into that unfortunately.

Do you want me to pursue this further?

@yfarjoun
Copy link
Contributor Author

is there a typo in the new indel FP rate? Seems like a 0 is missing after the decimal.

@yfarjoun
Copy link
Contributor Author

if you can DM me the locations perhaps I can take a look. thanks for setting this up!

@fleharty
Copy link
Contributor

@yfarjoun
I edited the comment to fix my typo you found. And I send you e-mail for the locations.

@jessicaway
Copy link
Member

@fleharty Do you know if we want to proceed with this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicates missed when mate has low mapping quality
3 participants