-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 416 - Parallelise evidence string generation #421
Merged
apriltuesday
merged 5 commits into
EBIvariation:master
from
apriltuesday:parallel-evidence
Mar 7, 2024
Merged
Changes from 3 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
32b8d54
chunk and parallelise evidence string generation
apriltuesday 089ad47
output and aggregate counts for evidence strings
apriltuesday 6d6e450
add tests and cleanup nextflow
apriltuesday ddd9eaa
adjust counts to appropriately skip invalid clinical significance values
apriltuesday 0668213
bugfix for summing reports
apriltuesday File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import argparse | ||
import glob | ||
|
||
from cmat.output_generation.report import Report | ||
|
||
parser = argparse.ArgumentParser('Aggregate counts reports') | ||
parser.add_argument('--counts-yml', nargs='+', help='YAML files containing intermediate counts', required=True) | ||
|
||
|
||
if __name__ == '__main__': | ||
args = parser.parse_args() | ||
|
||
# Load all the reports | ||
filenames = [f for files in args.counts_yml for f in glob.glob(files)] | ||
reports = [] | ||
for filename in filenames: | ||
r = Report() | ||
r.load_from_file(filename) | ||
reports.append(r) | ||
|
||
# Sum them up and output the results | ||
complete_report = sum(reports, start=Report()) | ||
complete_report.print_report() | ||
complete_report.dump_to_file(dir_out='.') | ||
complete_report.write_unmapped_terms(dir_out='.') | ||
if not complete_report.check_counts(): | ||
raise RuntimeError('Aggregate counts not consistent') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import argparse | ||
|
||
from cmat.clinvar_xml_io import ClinVarDataset | ||
|
||
parser = argparse.ArgumentParser('Count number of RCV records in the XML, print to stdout') | ||
parser.add_argument('--clinvar-xml', help='ClinVar XML release', required=True) | ||
|
||
|
||
if __name__ == '__main__': | ||
args = parser.parse_args() | ||
print(sum(1 for _ in ClinVarDataset(args.clinvar_xml))) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,138 @@ | ||
import logging | ||
import os | ||
from collections import Counter | ||
|
||
import yaml | ||
|
||
logger = logging.getLogger(__package__) | ||
|
||
# Add representers to enable safe-dump | ||
yaml.SafeDumper.add_representer(Counter, | ||
lambda dumper, data: dumper.represent_mapping('tag:yaml.org,2002:map', data.items())) | ||
|
||
UNMAPPED_TRAITS_FILE_NAME = 'unmapped_traits.tsv' | ||
COUNTS_FILE_NAME = 'counts.yml' | ||
|
||
|
||
class Report: | ||
"""Holds counters and other records for a pipeline run.""" | ||
|
||
def __init__(self, trait_mappings=None, consequence_mappings=None): | ||
# The main evidence string counter. | ||
self.evidence_string_count = 0 | ||
# Complete evidence strings are ones with an EFO mapping. | ||
self.complete_evidence_string_count = 0 | ||
|
||
# ClinVar record counters. | ||
self.clinvar_total = 0 | ||
self.clinvar_fatal_no_valid_traits = 0 | ||
self.clinvar_skip_unsupported_variation = 0 | ||
self.clinvar_skip_no_functional_consequences = 0 | ||
self.clinvar_skip_missing_efo_mapping = 0 | ||
self.clinvar_skip_invalid_evidence_string = 0 | ||
self.clinvar_done_one_complete_evidence_string = 0 | ||
self.clinvar_done_multiple_complete_evidence_strings = 0 | ||
self.clinvar_fatal = 0 | ||
self.clinvar_skipped = 0 | ||
self.clinvar_done = 0 | ||
|
||
# Total number of trait-to-ontology mappings present in the database. | ||
self.total_trait_mappings = 0 | ||
if trait_mappings: | ||
self.total_trait_mappings = sum([len(mappings) for mappings in trait_mappings.values()]) | ||
# All distinct (trait name, EFO ID) mappings used in the evidence strings. | ||
self.used_trait_mappings = set() | ||
# All unmapped trait names which prevented evidence string generation and their counts. | ||
self.unmapped_trait_names = Counter() | ||
|
||
# Variant-to-consequence mapping counts. | ||
self.total_consequence_mappings = 0 | ||
if consequence_mappings: | ||
self.total_consequence_mappings = sum([len(mappings) for mappings in consequence_mappings.values()]) | ||
self.repeat_expansion_variants = 0 | ||
self.structural_variants = 0 | ||
|
||
def __eq__(self, other): | ||
if not isinstance(other, Report): | ||
return NotImplemented | ||
return self.__dict__ == other.__dict__ | ||
|
||
def __add__(self, other): | ||
if not isinstance(other, Report): | ||
return NotImplemented | ||
result = Report() | ||
for var_name in vars(self).keys() | vars(other).keys(): | ||
if var_name == 'total_trait_mappings': | ||
result.total_trait_mappings = max(self.total_trait_mappings, other.total_trait_mappings) | ||
if var_name == 'total_consequence_mappings': | ||
result.total_consequence_mappings = max(self.total_consequence_mappings, | ||
other.total_consequence_mappings) | ||
if var_name == 'used_trait_mappings': | ||
result.used_trait_mappings = self.used_trait_mappings | other.used_trait_mappings | ||
else: | ||
result.__setattr__(var_name, self.__getattribute__(var_name) + other.__getattribute__(var_name)) | ||
return result | ||
|
||
def dump_to_file(self, dir_out, filename=COUNTS_FILE_NAME): | ||
with open(os.path.join(dir_out, filename), 'w') as f: | ||
yaml.safe_dump(vars(self), f) | ||
|
||
def load_from_file(self, filename): | ||
with open(filename, 'r') as f: | ||
data = yaml.safe_load(f) | ||
self.__dict__.update(**data) | ||
# yaml loads a dict, convert to counter | ||
self.unmapped_trait_names = Counter(self.unmapped_trait_names) | ||
|
||
def compute_record_tallies(self): | ||
"""Compute tallies of records fatal/skipped/done based on the more granular counts.""" | ||
self.clinvar_fatal = self.clinvar_fatal_no_valid_traits | ||
self.clinvar_skipped = (self.clinvar_skip_unsupported_variation + self.clinvar_skip_no_functional_consequences + | ||
self.clinvar_skip_missing_efo_mapping + self.clinvar_skip_invalid_evidence_string) | ||
self.clinvar_done = (self.clinvar_done_one_complete_evidence_string + | ||
self.clinvar_done_multiple_complete_evidence_strings) | ||
|
||
def check_counts(self): | ||
"""Return True if counts are consistent, False otherwise.""" | ||
self.compute_record_tallies() | ||
expected_total = self.clinvar_fatal + self.clinvar_skipped + self.clinvar_done | ||
if expected_total != self.clinvar_total: | ||
logger.error(f'ClinVar evidence string tallies do not add up to the total amount: ' | ||
f'fatal + skipped + done = {expected_total}, total = {self.clinvar_total}') | ||
return False | ||
return True | ||
|
||
def print_report(self): | ||
"""Print report of counts.""" | ||
self.compute_record_tallies() | ||
report = f'''Total number of evidence strings generated\t{self.evidence_string_count} | ||
Total number of complete evidence strings generated\t{self.complete_evidence_string_count} | ||
|
||
Total number of ClinVar records\t{self.clinvar_total} | ||
Fatal: No traits with valid names\t{self.clinvar_fatal_no_valid_traits} | ||
Skipped: Can be rescued by future improvements\t{self.clinvar_skipped} | ||
Unsupported variation type\t{self.clinvar_skip_unsupported_variation} | ||
No functional consequences\t{self.clinvar_skip_no_functional_consequences} | ||
Missing EFO mapping\t{self.clinvar_skip_missing_efo_mapping} | ||
Invalid evidence string\t{self.clinvar_skip_invalid_evidence_string} | ||
Done: Generated at least one complete evidence string\t{self.clinvar_done} | ||
One complete evidence string\t{self.clinvar_done_one_complete_evidence_string} | ||
Multiple complete evidence strings\t{self.clinvar_done_multiple_complete_evidence_strings} | ||
Percentage of all potentially supportable ClinVar records which generated at least one complete evidence string\t{ | ||
self.clinvar_done / (self.clinvar_skipped + self.clinvar_done):.1%} | ||
|
||
Total number of trait-to-ontology mappings in the database\t{self.total_trait_mappings} | ||
The number of distinct trait-to-ontology mappings used in the evidence strings\t{ | ||
len(self.used_trait_mappings)} | ||
The number of distinct unmapped trait names which prevented complete evidence string generation\t{ | ||
len(self.unmapped_trait_names)} | ||
|
||
Total number of variant to consequence mappings\t{self.total_consequence_mappings} | ||
Number of repeat expansion variants\t{self.repeat_expansion_variants} | ||
Number of structural variants \t{self.structural_variants}'''.replace('\n' + ' ' * 12, '\n') | ||
print(report) | ||
|
||
def write_unmapped_terms(self, dir_out): | ||
with open(os.path.join(dir_out, UNMAPPED_TRAITS_FILE_NAME), 'w') as unmapped_traits_file: | ||
for trait, number_of_occurrences in sorted(self.unmapped_trait_names.items(), key=lambda x: -x[1]): | ||
unmapped_traits_file.write(f'{trait}\t{number_of_occurrences}\n') |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say that pickling the objet might be safer but I 'm guessing that you want to use the final output yaml elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking having it in yaml might be more readable and flexible for downstream purposes (e.g. updating the metrics or generating sankey diagrams), but it's probably overkill if those processes end up implemented in Python anyway... I'd probably keep it as it is for now.