Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed CrosscheckFingerprints hashing to linked hashing #1892

Merged
merged 1 commit into from
Nov 14, 2023

Conversation

tmelman
Copy link
Contributor

@tmelman tmelman commented Jun 13, 2023

Issue: CrosscheckFingerprints produces results in a different order each time.

Change: Switched from HashMap and HashSet to Java's LinkedHashMap and LinkedHashSet, respectively

Intended result: Produce stable sorted output.

Issue: #1854

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@tmelman tmelman linked an issue Jun 13, 2023 that may be closed by this pull request
@serge2016
Copy link

Great!

@lbergelson
Copy link
Member

@tmelman As a heads up, Collectors.toSet() is cursed by HashSet, so you may have to replace instances of stream collection with Collectors.toCollection(LinkedHashSet::new)) or continue to suffer the horrors in non-deterministic behavior.

@lbergelson
Copy link
Member

I looked at this a bit more, it seems like there is instability built in at the start since FingerPrintChecker.fingerprintFiles() builds an unsorted map from the output of a parallel and unpredictable process. I think it's going to require a more complicated change than just changing the other sets to linked sets. Either change the inputs to return their results in order, or sort the maps according to some criteria.

@lbergelson
Copy link
Member

I think we either need to change it so the input maps are built as the result of Futures instead of being added to the concurrent map as the are built. The errors are actually handled this way so it wouldn't be that hard to change.

The other option would be to pick an arbitrary sort order and sort the input maps after they're built and then maintain that order.

@droazen droazen marked this pull request as ready for review November 14, 2023 18:14
@droazen droazen merged commit 5721517 into master Nov 14, 2023
@droazen droazen deleted the tm_stable_order_crosscheckmetrics branch November 14, 2023 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Order of rows in crosscheck metrics files are non-deterministic
4 participants