Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build failure with nccl #121

Open
loveshack opened this issue Nov 8, 2022 · 1 comment
Open

build failure with nccl #121

loveshack opened this issue Nov 8, 2022 · 1 comment

Comments

@loveshack
Copy link

This is from trying to to update the spack package to 2.6.2 and provide NCCL/RCCL support, but it doesn't look as if it's related to spack. Building fails when I enable NCCL, but works without it; I'm puzzled why, as it must usually work.

The cmake args which fail (with openmpi-4.1.4, cuda-11.4.1, nccl-2.14.3-1) are

-DCOSMA_WITH_TESTS:STRING=OFF -DCOSMA_WITH_APPS:STRING=OFF -DCOSMA_WITH_PROFILING:STRING=OFF -DCOSMA_WITH_BENCHMARKS:STRING=OFF -DCOSMA_BLAS:STRING=CUDA -DCOSMA_SCALAPACK:STRING=CUSTOM -DBUILD_SHARED_LIBS=ON -DCOSMA_WITH_GPU_AWARE_MPI:STRING=ON -DCOSMA_WITH_NCCL=ON

It succeeds when -DCOSMA_WITH_NCCL=ON is removed.

There are two different failures, depending on whether openmpi is built with C++ support.

With openmpi+cxx, the failure is

[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o  -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi_cxx.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so 
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-jdxn55a26z4fhc2xtgq7hiihcehuxhgs/include/openmpi/ompi/mpi/cxx/intracomm_inln.h:102: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:413: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-neo24soctuz3gh5w75eoivfgvyykwk7v/spack-build-neo24so'

and without cxx it's

[ 83%] Linking CXX shared library libcosma.so
cd /tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/src/cosma && /usr/bin/cmake -E cmake_link_script CMakeFiles/cosma.dir/link.txt --verbose=1
/nobackup/projects/bdman01/mdehsdl3/spack.clean/lib/spack/env/gcc/g++ -fPIC -O2 -g -DNDEBUG -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib -Wl,-rpath -Wl,/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -L/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/hwloc-2.8.0-bkqulonwqaazeatswgiw3y73tkxry2yo/lib -pthread -shared -Wl,-soname,libcosma.so -o libcosma.so CMakeFiles/cosma.dir/blas.cpp.o CMakeFiles/cosma.dir/buffer.cpp.o CMakeFiles/cosma.dir/communicator.cpp.o CMakeFiles/cosma.dir/context.cpp.o CMakeFiles/cosma.dir/interval.cpp.o CMakeFiles/cosma.dir/layout.cpp.o CMakeFiles/cosma.dir/local_multiply.cpp.o CMakeFiles/cosma.dir/mapper.cpp.o CMakeFiles/cosma.dir/math_utils.cpp.o CMakeFiles/cosma.dir/matrix.cpp.o CMakeFiles/cosma.dir/memory_pool.cpp.o CMakeFiles/cosma.dir/multiply.cpp.o CMakeFiles/cosma.dir/one_sided_communicator.cpp.o CMakeFiles/cosma.dir/strategy.cpp.o CMakeFiles/cosma.dir/two_sided_communicator.cpp.o CMakeFiles/cosma.dir/cinterface.cpp.o CMakeFiles/cosma.dir/environment_variables.cpp.o CMakeFiles/cosma.dir/pinned_buffers.cpp.o CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o  -Wl,-rpath,/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/COSTA/src/costa:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey/libs/Tiled-MM/src/Tiled-MM:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib:/nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib: ../../libs/COSTA/src/costa/libcosta.so ../../libs/Tiled-MM/src/Tiled-MM/libTiled-MM.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/nccl-2.14.3-1-anhrq6463uiydo7xfah7tmhcrrup4zfb/lib/libnccl.so /nobackup/projects/bdman01/mdehsdl3/spack.clean/opt/spack/linux-rhel8-power9le/gcc-8.5.0/openmpi-4.1.4-tngp6b2qcx64wd7ndf53dmdeovlmui4h/lib/libmpi.so /usr/lib/gcc/ppc64le-redhat-linux/8/libgomp.so /usr/lib64/libpthread.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcublas.so /opt/software/builder/developers/compilers/cuda/11.4.1/1/default/lib64/libcudart.so 
CMakeFiles/cosma.dir/gpu/gpu_aware_mpi_utils.cpp.o: In function `cosma::gpu::check_runtime_status(cudaError)':
/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: multiple definition of `cosma::gpu::check_runtime_status(cudaError)'
CMakeFiles/cosma.dir/gpu/nccl_utils.cpp.o:/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-src/src/cosma/gpu/utils.hpp:7: first defined here
collect2: error: ld returned 1 exit status
make[2]: *** [src/cosma/CMakeFiles/cosma.dir/build.make:412: src/cosma/libcosma.so] Error 1
make[2]: Leaving directory '/tmp/mdehsdl3/spack-stage/spack-stage-cosma-2.6.2-iy3pxeya5oy7n52rsyyzx2zjzv2qry5g/spack-build-iy3pxey'

By the way, as something else to add, what exactly does COSMA_WITH_GPU_AWARE_MPI mean? In the case of openmpi, it could be configuring --with-cuda and/or using a UCX built with cuda and/or gdrcopy.

@simonpintarelli
Copy link
Member

This should be fixed in #130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants