tool that interfaces with scheduler for long-running tasks #33

ltalirz · 2024-05-10T23:20:08Z

Motivation

The current implementation of tools works for fast toy calculations, but scientifically relevant calculations in chemistry and materials science often make tradeoffs between compute cost and accuracy that results in calculations that run several hours or days, even on powerful hardware.

In the current implementation, the notebook will be blocked for the time of the calculation the calculation will be killed once the ipython kernel is stopped.

We would therefore like langsim to be able to submit computationally intensive tasks to remote scheduling systems, check the status of these calculations, and retrieve the result once they have completed.

Thoughts

I think this is a tough one to make user friendly, particularly if you think about the original target audience: an experimentalist wanting to run calculations.
Do we ask them to install slurm on their local work station (they may be running Windows)? Do they need to apply for computational time on a HPC resource (and then figure out how to run the simulation code they need there)? I think with such asks we already lose a large fraction of the target audience.

The only feasible way I see for letting someone without HPC expertise run on HPC is either

A) Their computational colleagues configure the setup for them
B) They get some cloud account and let langsim connect to a dedicated cloud service for running DFT calculations with a well-defined API [1]

That said, adding the basic functionality for interacting with schedulers is certainly feasible, if the user can provide all necessary information (credentials for connecting, scheduler type, partitions you have access to, where codes are located, etc.).

There is some light at the end of the tunnel, as also academic HPC centers are moving from giving users SSH access to REST APIs (example), but this process is still underway and to my knowledge no clear standard has emerged.

Also, none of the APIs I've seen so far offer a mechanism for discovering the simulation codes that are installed and how to module load them... perhaps we could draft a specification for how we would like such an API to look like and then approach HPC centers with this idea.

[1] Or, if that is not available, some HPC cluster template with pre-installed software in standard locations (e.g. there are interesting efforts like the CernVMFS build cache from Compute Canada or also the spack build caches), but that already adds a lot of complexity.

The text was updated successfully, but these errors were encountered:

chiang-yuan · 2024-05-24T18:03:38Z

I recommend prefect.io for pythonic way to submit custom and monitor pythonic jobs!
It also supports different job management systems and can be orchestrated both locally and on HPC

jan-janssen · 2024-05-24T18:18:30Z

Over the last two years I worked on library as part of the exascale project to address this challenge. We are currently in progress of merging different parts together. But basically it follows the concurrent futures executor design from the python standard library and extends it with the option to assign HPC resources like GPUs, MPI-parallel codes and thread parallel codes as well as use the future object of one function as an input of the next function to realise dependencies:
https://pympipool.readthedocs.io/en/latest/examples.html#coupled-functions
This currently works inside the allocation of a given queuing system using the flux-framework scheduler and we are extending it to run outside the queuing system:
https://github.com/pyiron-dev/remote-executor/blob/main/example.ipynb
In that case the queuing system is handling all the dependencies of the individual tasks, so no daemon process is required.

ltalirz added the enhancement New feature or request label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool that interfaces with scheduler for long-running tasks #33

tool that interfaces with scheduler for long-running tasks #33

ltalirz commented May 10, 2024 •

edited

Loading

chiang-yuan commented May 24, 2024 •

edited

Loading

jan-janssen commented May 24, 2024

tool that interfaces with scheduler for long-running tasks #33

tool that interfaces with scheduler for long-running tasks #33

Comments

ltalirz commented May 10, 2024 • edited Loading

Motivation

Thoughts

chiang-yuan commented May 24, 2024 • edited Loading

jan-janssen commented May 24, 2024

ltalirz commented May 10, 2024 •

edited

Loading

chiang-yuan commented May 24, 2024 •

edited

Loading