Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing MergeBamAlignment creates invalid BAMs in the presence of secondary alignments #932

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

DenisVerkhoturov
Copy link

@DenisVerkhoturov DenisVerkhoturov commented Sep 19, 2017

Description

The problem originates in HitsForInserts#coordinateByHitIndex. This method makes decision whether read is paired or not on assumption that all the records have HI-tag but ignores other criteria such as RNEXT (Ref. name of the mate/next read) and PNEXT (Position of the mate/next read) SAM fields. So paired reads are determined as two unrelated ones in the case when HI-tag is absent.


Implementation

  • Added new method (coorditanteByMate) for correlating pair-aware alignments. It does not require presence of HI-tags and performs work based on the data contained in fields: RNAME, RNEXT, POS, PNEXT.
  • Added MergeBamAlignmentTest#testPairedReadsAreProcessedProperlyInAbsenceOfHitIndexTags running MergeBamAlignment with SAM file contained pair-aware alignments without HI-tag.
  • Added testCoordinateByMateMethodWorksProperly for the new method.
  • Changed BestMapqPrimaryAlignmentSelectionStrategy. It uses implemented method coordinateByMate instead of coordinateByHitIndex now. We kept the old method due to backward compatibility.

Performance

All the benchmarks were measured by JMH (Mode - thrpt, Cnt - 100, Units - ops/s)

Benchmark 100 paired alignments 500 paired alignments 10_000 paired alignments
Method Case Score Error Score Error Score Error
coordinateByHitIndex HI-tag and mate is absent 272_334.682 ± 3852.273 27_056.185 ± 543.235 95.384 ± 0.354
HI-tag is absent 282_456.059 ± 5726.669 35_603.756 ± 954.916 696.495 ± 7.419
HI-tag is present 280_571.088 ± 1900.595 27_734.340 ± 914.012 95.101 ± 0.935
coordinateByMate HI-tag and mate is absent 180_912.877 ± 1983.565 17_965.070 ± 242.398 55.060 ± 2.967
HI-tag is absent 174_695.474 ± 1786.275 17_139.180 ± 128.948 51.079 ± 0.542
HI-tag is present 185_531.494 ± 1975.027 18_120.429 ± 157.350 57.640 ± 3.658
New method is 1.7 times slower but this caused by using required fields that allow rearranging alignments in any situations. And the only situation `coordinateByIndex` is much faster is when we have no **Hi-tag** and pair-aware alignments are being rearranged incorrectly (case described in issue).

Resolves

MergeBamAlignment creates invalid BAMs in the presence of secondary alignments

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

@DenisVerkhoturov DenisVerkhoturov changed the title fix: MergeBamAlignment creates invalid BAMs in the presence of second… Fixing MergeBamAlignment creates invalid BAMs in the presence of secondary alignments Sep 19, 2017
@coveralls
Copy link

coveralls commented Sep 19, 2017

Coverage Status

Coverage decreased (-0.03%) to 76.734% when pulling 31d042e on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into 9a7ba13 on broadinstitute:master.

Copy link
Member

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments about the Java code, not about the code's meaning.

final Integer hiTagValueOfFirst = firstOfPairOrFragment.get(i).getIntegerAttribute(SAMTag.HI.name());
final Integer hiTagValueOfSecond = secondOfPair.get(i).getIntegerAttribute(SAMTag.HI.name());

if (nonNull(hiTagValueOfFirst) && nonNull(hiTagValueOfSecond)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for using these functions instead of comparing against null directly? My understanding is that these methods are provided so we have convenient lambda forms for streaming operations, etc. Using them inline like this doesn't look like an improvement to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to use functional style instead of procedural even if it could decrease performance a little bit because it usually increases code readability. Especially for functions of Java standard library. I could revert all cases when I use static import if it is the concern. So is it critical?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean about functional vs procedural... one form is an expression consisting of a function call, the other is a binary expression using the!= operator. Both are equally functional and procedural, as I understand these terms.

Readability is a subjective argument; to me, using != null is more readable. Given that it requires a static import to avoid making the code much wordier than the original, I don't think it's worth the change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking about operators vs functions, I have always thought that since Java doesn't allow to make partial application of an operator it is convenient to use functions in such cases. Let's imagine we have equals function that takes two objects and checks their equality. So isNull function is just a partial application of equals function with fixed null as one of the arguments. Then nonNull is negated isNull. So that way, for the sake of consistency it is easier to use functions instead of operators in Java not only in streams but everywhere in mine opinion. Is it a little bit excessively? 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like you have a more enlightened view of functional programming in Java than I do. :)

Thanks for the explanation, I now understand the distinction you're making between expressions and functions, and why it's more consistent to use functions everywhere. I can't say I'm convinced but I'll definitely be giving it more thought.

import java.util.Comparator;
import java.util.List;
import java.util.Objects;

import static java.util.Objects.isNull;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general static imports should be avoided as they make code harder to read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by operators as was discussed here.

final Integer hiTagValueOfSecond = secondOfPair.get(i).getIntegerAttribute(SAMTag.HI.name());

if (nonNull(hiTagValueOfFirst) && nonNull(hiTagValueOfSecond)) {
if (hiTagValueOfFirst < hiTagValueOfSecond) secondOfPair.add(i, null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code didn't do this, but the new code should use { } around all if/else body statements, even if they're a single line.

Copy link
Author

@DenisVerkhoturov DenisVerkhoturov Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here.

return nonNull(first) && nonNull(second)
&& first.getMateAlignmentStart() == second.getAlignmentStart()
&& first.getAlignmentStart() == second.getMateAlignmentStart()
&& Objects.equals(first.getReferenceName(), second.getMateReferenceName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use Object.equals() here? is it possible that one of these is null?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both SamRecod#mReferenceName and SamRecord#mMateReferenceName are nullable we need to ensure that they are non-null or simply delegate it to Objects.equals to avoid NullPointerException. This is a theoretical case that may hardly ever occur in runtime. So, do we need to prevent this case or can I unwrap this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK, I was wondering if this was added to fix a known problem or was just precautionary.

Looking at SamRecord I see that it's common to check these fields for null so it seems wise to do it here too.

It would be great if picard had nullable constraints in the code base! That would be a huge PR though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional can be rather helpful here and it would be just great if we could use Try and Lazy types from vavr to reduce service-code and to clarify Picard logic!=D

/**
* @return true if records belong to the same pairwise alignment
*/
private static boolean arePair(SAMRecord first, SAMRecord second) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should be a non-static method on SAMRecord, as it only deals with fields in that object and takes that object as its argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAMRecord is part of htsjdk and this was the only constraint I haven't done it. Could I just make this function a SamRecord#arePair method and make associated pull-request to htsjdk?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please review associated request to htsjdk?

@coveralls
Copy link

coveralls commented Sep 20, 2017

Coverage Status

Coverage decreased (-0.04%) to 76.726% when pulling e418168 on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into 9a7ba13 on broadinstitute:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 76.726% when pulling 11a9c1f on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into 9a7ba13 on broadinstitute:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 76.726% when pulling 11a9c1f on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into 9a7ba13 on broadinstitute:master.

@coveralls
Copy link

coveralls commented Sep 20, 2017

Coverage Status

Coverage decreased (-0.03%) to 76.73% when pulling 96bc12a on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into 9a7ba13 on broadinstitute:master.

@yfarjoun
Copy link
Contributor

@DenisVerkhoturov there are many "esthetic" changes mixed in with substantive changes. Could you please rebase and separate into two commits, one that is purely esthetic and one that is substantive? it will make it much easier to understand what you are doing.

@DenisVerkhoturov
Copy link
Author

@yfarjoun Sure, done=)

@coveralls
Copy link

coveralls commented Sep 22, 2017

Coverage Status

Coverage decreased (-0.03%) to 77.699% when pulling e4c8c0d on EpamLifeSciencesTeam:epam-ls_MergeBamAlignment into b63d73e on broadinstitute:master.

}

/**
* Identifies weather records present pairwise alignment or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weather -=> whether

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed here.

void coordinateByMate() {
final List<SAMRecord> newFirstOfPairOrFragment = new ArrayList<>();
final List<SAMRecord> newSecondOfPair = new ArrayList<>();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this file preferred ++i to i++.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might see (here for instance), there was a usage of a postfix increment in this file before. So I just followed this example.
All increment operators in this file changed to prefix form.


hitsForInsert.addFirstOfPairOrFragment(second);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 3 some function of HitsForInsert::numHits?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, fixed here.


// Now renumber any correlated alignments, and remove hit index if no correlated read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for breaking this out and cleaning up hi, and so on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback=)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants