CUDA-aware support #26798

mcacace · 2024-02-13T09:22:11Z

mcacace
Feb 13, 2024

Dear all,
quick question: I recently installed MOOSE (+ our applications) on the leonardo booster module at CINECA in Italy. All went fine and smooth, but when trying to run the tests (failing) - errors are related to having disabled the CUDA-aware support (this is by deqfault required from the openMPI module installed locally). This said, I was wondering if there is a simple way to disable the CUDA-aware support at the TestHarness script level, or in other words, where I can integrate the --mca opal_warn_on_missing_libcuda 0 option. If not, I might go all the way to install locally openmpi. Please, keep in mind that this is not a top priority discussion, since production runs are fine (at least as long as I have tested), and it is only because I found it annoying not to be able to run the tests upon a new installation (users keep on asking me why tests are failing...).
Thanks for any advise,
mauro

milljm · 2024-02-13T13:31:32Z

milljm
Feb 13, 2024
Maintainer

If I am reading you correctly, you may want to use:

export MOOSE_MPI_COMMAND='mpiexec --opal_warn_on_missing_libcuda 0'

This might work, if all you need is a way to add some arguments to mpiexec.

If instead you need to pass arguments to your MOOSE based application I believe you're after the --cli-args TestHarness option:

./run_tests --help

<trimmed>
  --cli-args [CLI_ARGS] Append the following list of arguments to the 
                        command line (Encapsulate the command in quotes)

I was looking for a decent doco page on our TestHarness, but this was all I could find: https://mooseframework.inl.gov/python/TestHarness.html

TestHarness options/influential-environment-variables documentation could really use some TLC... #26799

14 replies

mcacace Feb 20, 2024
Author

If I might add something, I had to reinstall all from scratch on another cluster, because of an upgrade of the system. All went smooth, all moose tests are passing but those from my app are still crashing (cannot have any informative hint, apart reason crash though). I also noticed that while compiling the system complains on NEML and libtorch missing, but I think this is just a warning. And again, all simulations do run fine if launched with proper slurm commands.

GiudGiud Feb 20, 2024
Collaborator

ah thanks that s good information.
Can you save all your work then git clean -xfd your app (and maybe moose too)
then rebuild moose and your app

my concern is that there are some left over compiled objects from before the system was upgraded

GiudGiud Feb 20, 2024
Collaborator

I dont see anything wrong in the diagnostics. Though do we expect more cuda libraries in that log @grmnptr ?

mcacace Feb 20, 2024
Author

Sorry, but I think I attached the wrong files (those read to me as my local conda installation). I'm out of office until tomorrow, and will double check. SORRY

GiudGiud Feb 20, 2024
Collaborator

Ok no worries.
Let s look into it when you have them

mcacace · 2024-02-21T09:05:11Z

mcacace
Feb 21, 2024
Author

So I think I managed to solve the issue. It all started with me trying to get rid of this warning (after updating the framework)
"*** Warning, This code is deprecated and will be removed in future versions:
Please update your main.C to adapt new main function in MOOSE framework, see 'test/src/main.C in MOOSE as an example of moose::main()'."

For which I thought it smart to update my main.C as follows:

From the old version:
// Create an instance of the application and store it in a smart pointer for easy cleanup
std::shared_ptr<MooseApp> app = AppFactory::createAppShared("GolemApp", argc, argv);
// Execute the application
app->run();

to the new version:

Moose::main<GolemApp>(argc, argv);

That creates the error=crashed for all the tests. In addition, my fault, I never had a look at the log file from my students runs on the cluster, which, I realized this morning, did run till the end, but throwing an segfault message before finalizing.

So the question that remains, is what's the right way to update the pointer to the app. Would something like the following work fine with the new syntax?

MooseApp * app = AppFactory::createApp("GolemApp", argc, argv);
app->run();
delete app;

Apart from that I think we can close the issue, again, sorry for bothering you all that much and thanks for the support.
Mauro

1 reply

mcacace Feb 21, 2024
Author

On a side note, the CUDA-aware support issue has been solved by:
export MOOSE_MPI_COMMAND='mpiexec --opal_warn_on_missing_libcuda 0'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-aware support #26798

{{title}}

Replies: 2 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CUDA-aware support #26798

mcacace Feb 13, 2024

Replies: 2 comments · 15 replies

milljm Feb 13, 2024 Maintainer

mcacace Feb 20, 2024 Author

GiudGiud Feb 20, 2024 Collaborator

GiudGiud Feb 20, 2024 Collaborator

mcacace Feb 20, 2024 Author

GiudGiud Feb 20, 2024 Collaborator

mcacace Feb 21, 2024 Author

mcacace Feb 21, 2024 Author

mcacace
Feb 13, 2024

Replies: 2 comments 15 replies

milljm
Feb 13, 2024
Maintainer

mcacace Feb 20, 2024
Author

GiudGiud Feb 20, 2024
Collaborator

GiudGiud Feb 20, 2024
Collaborator

mcacace Feb 20, 2024
Author

GiudGiud Feb 20, 2024
Collaborator

mcacace
Feb 21, 2024
Author

mcacace Feb 21, 2024
Author