Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 416 - Parallelise evidence string generation #421

Merged
merged 5 commits into from
Mar 7, 2024

Conversation

apriltuesday
Copy link
Contributor

@apriltuesday apriltuesday commented Mar 1, 2024

Closes #416
Closes #420

@apriltuesday apriltuesday marked this pull request as ready for review March 4, 2024 15:23
@apriltuesday apriltuesday self-assigned this Mar 4, 2024
@apriltuesday apriltuesday requested a review from tcezard March 4, 2024 15:23
Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

Comment on lines +76 to +85
def dump_to_file(self, dir_out, filename=COUNTS_FILE_NAME):
with open(os.path.join(dir_out, filename), 'w') as f:
yaml.safe_dump(vars(self), f)

def load_from_file(self, filename):
with open(filename, 'r') as f:
data = yaml.safe_load(f)
self.__dict__.update(**data)
# yaml loads a dict, convert to counter
self.unmapped_trait_names = Counter(self.unmapped_trait_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to say that pickling the objet might be safer but I 'm guessing that you want to use the final output yaml elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking having it in yaml might be more readable and flexible for downstream purposes (e.g. updating the metrics or generating sankey diagrams), but it's probably overkill if those processes end up implemented in Python anyway... I'd probably keep it as it is for now.

@apriltuesday
Copy link
Contributor Author

Counts from test run match last submission, aside from a very silly error that I just fixed.
Runtime was about 14 hours for the entire pipeline, as opposed to about 26 hours before. About 2 hours of that is consequence prediction, so we've basically halved the run time for the evidence generation. I think it's definitely still worth speeding up the JSON validation.

@apriltuesday apriltuesday merged commit 1e1afdb into EBIvariation:master Mar 7, 2024
1 check passed
@apriltuesday apriltuesday deleted the parallel-evidence branch March 7, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Properly count and exclude suppressed / flagged RCVs Scale up evidence string generation
2 participants