Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool that interfaces with scheduler for long-running tasks #33

Open
ltalirz opened this issue May 10, 2024 · 2 comments
Open

tool that interfaces with scheduler for long-running tasks #33

ltalirz opened this issue May 10, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@ltalirz
Copy link
Collaborator

ltalirz commented May 10, 2024

Motivation

The current implementation of tools works for fast toy calculations, but scientifically relevant calculations in chemistry and materials science often make tradeoffs between compute cost and accuracy that results in calculations that run several hours or days, even on powerful hardware.

In the current implementation, the notebook will be blocked for the time of the calculation the calculation will be killed once the ipython kernel is stopped.

We would therefore like langsim to be able to submit computationally intensive tasks to remote scheduling systems, check the status of these calculations, and retrieve the result once they have completed.

Thoughts

I think this is a tough one to make user friendly, particularly if you think about the original target audience: an experimentalist wanting to run calculations.
Do we ask them to install slurm on their local work station (they may be running Windows)? Do they need to apply for computational time on a HPC resource (and then figure out how to run the simulation code they need there)? I think with such asks we already lose a large fraction of the target audience.

The only feasible way I see for letting someone without HPC expertise run on HPC is either

  • A) Their computational colleagues configure the setup for them
  • B) They get some cloud account and let langsim connect to a dedicated cloud service for running DFT calculations with a well-defined API [1]

That said, adding the basic functionality for interacting with schedulers is certainly feasible, if the user can provide all necessary information (credentials for connecting, scheduler type, partitions you have access to, where codes are located, etc.).

There is some light at the end of the tunnel, as also academic HPC centers are moving from giving users SSH access to REST APIs (example), but this process is still underway and to my knowledge no clear standard has emerged.

Also, none of the APIs I've seen so far offer a mechanism for discovering the simulation codes that are installed and how to module load them... perhaps we could draft a specification for how we would like such an API to look like and then approach HPC centers with this idea.

[1] Or, if that is not available, some HPC cluster template with pre-installed software in standard locations (e.g. there are interesting efforts like the CernVMFS build cache from Compute Canada or also the spack build caches), but that already adds a lot of complexity.

@ltalirz ltalirz added the enhancement New feature or request label May 10, 2024
@chiang-yuan
Copy link
Collaborator

chiang-yuan commented May 24, 2024

I recommend prefect.io for pythonic way to submit custom and monitor pythonic jobs!
It also supports different job management systems and can be orchestrated both locally and on HPC

@jan-janssen
Copy link
Owner

Over the last two years I worked on library as part of the exascale project to address this challenge. We are currently in progress of merging different parts together. But basically it follows the concurrent futures executor design from the python standard library and extends it with the option to assign HPC resources like GPUs, MPI-parallel codes and thread parallel codes as well as use the future object of one function as an input of the next function to realise dependencies:
https://pympipool.readthedocs.io/en/latest/examples.html#coupled-functions
This currently works inside the allocation of a given queuing system using the flux-framework scheduler and we are extending it to run outside the queuing system:
https://github.com/pyiron-dev/remote-executor/blob/main/example.ipynb
In that case the queuing system is handling all the dependencies of the individual tasks, so no daemon process is required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants