Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multi-process handling across CDMs #121

Open
azimov opened this issue Jan 30, 2024 · 0 comments
Open

Improve multi-process handling across CDMs #121

azimov opened this issue Jan 30, 2024 · 0 comments

Comments

@azimov
Copy link
Collaborator

azimov commented Jan 30, 2024

Running a job across multiple CDMs does not take advantage of multi-process execution within targets (which would also support clusters).
This means that job execution is often spending time waiting on I/O in terms of the CDMs or users have to configure the execution manually across the CDMs.

The individual tasks themselves are also not fine grained enough to allow some I/O blocking tasks to be separated from cpu intensive tasks (such as PLP, SCCS or CohortMethod that operate on local andromeda objects with multiple processes).

Though some aspects of this may be related to having multiple steps within an individual analytics package that would be difficult to resolve with the way tasks are currently set up, exposing the targets workflow could be significantly improved by use of meta-targets and usage of internal targets functions to spawn multiple jobs.

This would also give us the advanced functionality of targets (e.g. use of SLURM clusters to execute multiple jobs) but even if it didn't it would be healthy to uncouple our execution infrastructure away from targets, which is currently just really being used for dependency trees.

Current approach

  • send study execution per CDM to strategus
  • Strategus creates targets tasks script internally
  • Strategus executes tasks calling targets

Proposed approach

  • Strategus takes 1) analysis script 2) Cdms to execute on
  • Strategus creates targets list for targets file across configured cdms
  • User executes targets::tar_make in custom way (or call is just masked by targets).

Note, that in both cases the execution of results uploading tasks to a results db and execution of the meta-analysis step is still an optional target type. However, in the latter case we weill be able to clearly see a dependent task for all cdm executions.

Stretch goal

Allow package maintainers to split out tasks within analytics packages by exposing an interface that allows targets to see them. For example, in PLP there can be a single process task "pull covariates" and a multiprocess task "train models". Internally, we still take advantages of multithreaded calls e.g. in C++ code or external libraries but in this case there are multiple models that use independent parameters and/or hyperparameters so will finish execution at different times. The same applies in case of any study that has many Target/Comparator/Indication comparisons which will need independent propensity score models, for example.

@anthonysena anthonysena added this to the v1.0.0 milestone Jan 30, 2024
@anthonysena anthonysena modified the milestones: v1.0.0, Backlog Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants