-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool that interfaces with scheduler for long-running tasks #33
Comments
I recommend prefect.io for pythonic way to submit custom and monitor pythonic jobs! |
Over the last two years I worked on library as part of the exascale project to address this challenge. We are currently in progress of merging different parts together. But basically it follows the concurrent futures executor design from the python standard library and extends it with the option to assign HPC resources like GPUs, MPI-parallel codes and thread parallel codes as well as use the future object of one function as an input of the next function to realise dependencies: |
Motivation
The current implementation of tools works for fast toy calculations, but scientifically relevant calculations in chemistry and materials science often make tradeoffs between compute cost and accuracy that results in calculations that run several hours or days, even on powerful hardware.
In the current implementation, the notebook will be blocked for the time of the calculation the calculation will be killed once the ipython kernel is stopped.
We would therefore like langsim to be able to submit computationally intensive tasks to remote scheduling systems, check the status of these calculations, and retrieve the result once they have completed.
Thoughts
I think this is a tough one to make user friendly, particularly if you think about the original target audience: an experimentalist wanting to run calculations.
Do we ask them to install slurm on their local work station (they may be running Windows)? Do they need to apply for computational time on a HPC resource (and then figure out how to run the simulation code they need there)? I think with such asks we already lose a large fraction of the target audience.
The only feasible way I see for letting someone without HPC expertise run on HPC is either
That said, adding the basic functionality for interacting with schedulers is certainly feasible, if the user can provide all necessary information (credentials for connecting, scheduler type, partitions you have access to, where codes are located, etc.).
There is some light at the end of the tunnel, as also academic HPC centers are moving from giving users SSH access to REST APIs (example), but this process is still underway and to my knowledge no clear standard has emerged.
Also, none of the APIs I've seen so far offer a mechanism for discovering the simulation codes that are installed and how to
module load
them... perhaps we could draft a specification for how we would like such an API to look like and then approach HPC centers with this idea.[1] Or, if that is not available, some HPC cluster template with pre-installed software in standard locations (e.g. there are interesting efforts like the CernVMFS build cache from Compute Canada or also the spack build caches), but that already adds a lot of complexity.
The text was updated successfully, but these errors were encountered: