QP2 singularity running in parallel (OpenMPI) on multiple nodes #333
-
Hello, I have been able to build multiple QP2 singularities by cloning the following "https://github.com/QuantumPackage/qp2 --branch=dev", and these seem to work fine when run on a single node. The code cloned from the above link is parallelized on a single node and I am able to run using multiple CPU's which provides for a significant speedup in running a CIPSI calculation. I am having trouble, however, in running the singularity on multiple nodes. The singularity template I am using including the .def file can be found through this link: "https://github.com/Ydrnan/qp2-singularity". Any advice in going about this in the proper way would be appreciated. I believe the issue to be improper communication between the MPI and QP2 singularity. The following "https://apptainer.org/docs/user/1.0/mpi.html" hybrid singularity on its own works in parallel on multiple nodes on our HPC, How can I implement MPI use on multiple nodes within QP2? The openMPI version which works on our HPC using the apptainer tutorial in the above link is 4.0.5 Thank you in advance, Ben |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Hello, MPI is not required to run QP in multi-node, as the communications occur with the ZeroMQ library with TCP sockets. You first need to run a standard single node calculation:
This will be the "master" run. It opens a ZeroMQ socket at an address and port number stored in the On another machine, you can run a "slave" calculation that will connect to the master to accelerate it:
If the file system is shared, the slave calculation will read the You can run as many slaves as you want, and you can start them at any time. If you want to use multiple slaves, then it is worth using MPI for the slave process:
In this mode, only rank zero of the slave will make the ZeroMQ connection to the master, and the common data will be propagated to all the other slaves using a MPI broadcast, which is much faster than doing multiple ZeroMQ communications. If you look at the
Warning: Only the Davidson diagonalization and PT2 selection/perturbation take advantage of multi-node parallelism. So to answer your question, the only thing you need is to make it possible for the slaves to connect to the master. The simplest way is to put them on the same network. If you can't do it, you can run You can have a look at this presentation to better understand how all this works: Important: There was something wrong on the |
Beta Was this translation helpful? Give feedback.
-
Hello @scemama , Thank you for this information. Efforts in first running 'qp_run fci ' as the master followed by running 'qp_run --slave fci ' as a separate job within the same directory seems to properly connect the slave to the master. However, shortly after the slave reads in the tcp address from the master, I run into a floating point error and the calculation exits with error code 136. See output for both master and slave. Once the slave job runs into this floating point error, the calculation terminates. The master job continues to run but it does not carry the calculation any further as seen from it cancelling due to time limit. It is worth noting that the slurm file used to submit both these calculation uses the QP2 singularity but also has singularity, openmpi, gcc and libzmq modules loaded. Without the libzmq module loaded within the slurm, the slave does not connect to the master as it does in the above FCI files. See slurm file: My guess is either 1.) I have not properly created the singularity, though this works fine using one node. Or 2.) I do not have a necessary module loaded within the slurm file. Thank you and I appreciate any additional feedback, Ben |
Beta Was this translation helpful? Give feedback.
-
Hi @bavramidis,
|
Beta Was this translation helpful? Give feedback.
-
Hi @scemama , The issue is resolved using the dev-stable branch, as you had mentioned prior. Thank you for your help! Ben |
Beta Was this translation helpful? Give feedback.
Hello,
MPI is not required to run QP in multi-node, as the communications occur with the ZeroMQ library with TCP sockets.
You first need to run a standard single node calculation:
This will be the "master" run. It opens a ZeroMQ socket at an address and port number stored in the
<EZFIO>/work/qp_run_address
file. It should look like:tcp://192.1268.1.91:47279
On another machine, you can run a "slave" calculation that will connect to the master to accelerate it:
If the file system is shared, the slave calculation will read the
qp_run_address
file to get the address and port number ofqp_run
to attach to it.You can run as many slaves as you wa…