Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SST reader stuck when using RDMA #4100

Open
abhishek1297 opened this issue Mar 19, 2024 · 13 comments
Open

SST reader stuck when using RDMA #4100

abhishek1297 opened this issue Mar 19, 2024 · 13 comments

Comments

@abhishek1297
Copy link

I am trying to run basic SST related examples from examples/hello/
Reader: sstReader/sstReader.py
Writer: sstWriter/sstWriter.py

But, the SST reader always gets stuck when using RDMA data transport.

Installations

I am currently using a conda environment for adios2 python bindings. Here's what I do on the cluster,

>> module load \
    conda/23.5.0 \
    cmake/3.23.3_gcc-10.4.0 \
    openmpi/4.1.5_gcc-10.4.0 \
    gcc/10.4.0_gcc-10.4.0
>> module list
Currently Loaded Modules:
  1) conda/23.5.0                   7) singularity/3.8.7_gcc-10.4.0
  2) cmake/3.23.3_gcc-10.4.0        8) cuda/11.7.1_gcc-10.4.0
  3) libfabric/1.15.1_gcc-10.4.0    9) rdma-core/41.0_gcc-10.4.0
  4) opa-psm2/11.2.230_gcc-10.4.0  10) ucx/1.13.1_gcc-10.4.0
  5) pmix/4.1.2_gcc-10.4.0         11) openmpi/4.1.5_gcc-10.4.0
  6) go/1.18_gcc-10.4.0            12) gcc/10.4.0_gcc-10.4.0
>> conda create -n adios python=3.10 zeromq=4.3.4 -y

Letting mpi4py use existing OpenMPI

>> conda activate adios
>> echo $MPICC
/grid5000/spack/v1/opt/spack/linux-debian11-x86_64_v2/gcc-10.4.0/openmpi-4.1.5-34kj6dkmk4pg3e3nqniaidqj7l2rkkww/bin/mpicc
>> pip3 install --no-binary :all: mpi4py

The OpenMPI module has a support for libfabric as well as ucx.

>> ompi_info
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.5)
MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.5)
MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.5)

With the loaded modules, I build adios2 from source. I have attached the output.log from CMake build.

>> export CMAKE_PREFIX_PATH=$CMAKE_PREFIX_PATH:$CONDA_PREFIX
>> cmake -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DADIOS2_BUILD_EXAMPLES=ON ..
>> make -j12
>> make install

Running with UCX

Updating both files,
SST filepath to ../helloSst
io.set_parameter("DataTransport", "ucx")

Writer

>> mpirun -mca pml ucx -n 1 python3 sstWriter.py
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           gros-12
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
DP Writer 0 (0x2729870): UCX init Success
Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Reader (Gets all timesteps correctly)

>> mpirun -mca pml ucx -n 1 python3 sstReader.py
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           gros-12
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
DP Reader 0 (0x3163e30): UCX init Success
Rank= 0 loop index = 0 stream step = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 stream step = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 stream step = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 stream step = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]

Stuck when using libfabric

Updating both files with io.set_parameter("DataTransport", "fabric") or io.set_parameter("DataTransport", "RDMA"). Here, the writer will wait for the reader, by default. After executing the reader, the writer will start writing but the reader gets stuck in the in the engine.get call or in this example's case, stream.read call. Writer throws a warning when the reader is interrupted.

Writer

>> mpirun -mca btl ofi -n 1 python3 sstWriter.py
Rank= 0 loop index = 0 data = [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
Rank= 0 loop index = 1 data = [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Rank= 0 loop index = 2 data = [20. 21. 22. 23. 24. 25. 26. 27. 28. 29.]
Rank= 0 loop index = 3 data = [30. 31. 32. 33. 34. 35. 36. 37. 38. 39.]
Writer 0 (0x2cb73e0): Got an unexpected connection close event

Reader

>>> mpirun -mca btl ofi -n 1 python3 sstReader.py
# Stuck. No output.
# Keyboard interrupt
@pnorbert
Copy link
Contributor

pnorbert commented Mar 19, 2024 via email

@abhishek1297
Copy link
Author

Hi,

LIBFABRIC does not appear in bpls -Vv output.

I am not sure I understand but how do we specify the transport in ADIOS2 exactly?

I see that the libfabric support is a bit complicated but I got the "RDMA Transport for Staging: Available" message in the output logs as described from the documentation SST defaults to use libfabric.

Configuration:
ADIOS2 uses the CMake find_package() functionality to locate libfabric. CMake will automatically search system libraries, but if you need to specify a libfabric location other than in a default system location you can add a “-DLIBFABRIC_ROOT=” argument to direct CMake to libfabric’s location. If CMake finds libfabric, you should see the line “RDMA Transport for Staging: Available” near the end of the CMake output. This makes the RDMA DataTransport the default for SST data movement. (More information about SST engine parameters like DataTransport appears in the SST engine description.) If instead you see “RDMA Transport for Staging: Unconfigured”, RDMA will not be available to SST.

@pnorbert
Copy link
Contributor

That doc was written before the UCX support was added. Since it found UCX, the RDMA transport is using that. For some reason the cmake config did not like the libfabric library it found.

@eisenhauer
Copy link
Member

Unfortunately it is the nature of libfabric that even if it is available at compile-time, SST may discover that the features available at run-time are not appropriate for our needs. Generally that determination is automatic, that transport is disabled at run-time and we fall back to something else.

But lets take a step back here. A couple of points: UCX is an rdma transport. It's a relatively new addition to SST, and while our naming scheme isn't completely consistent, it's perfectly usable. The "RDMA Transport for Staging: Available" message happens whenever we find libfabric (previously or only direct-RDMA transport) or UCX. So, I'm not sure there's really a problem. Don't force the libfabric transport (which is still called "rdma", despite there being a UCX RDMA alternative) and you should be OK.

@pnorbert
Copy link
Contributor

On Perlmutter I got:

-- Found LIBFABRIC: /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so (Required is at least version "1.6")
-- Checking for module 'cray-drc'
--   No package 'cray-drc' found
-- Could NOT find CrayDRC (missing: CrayDRC_LIBRARIES)
-- Libfabric support for the HPE CXI provider: TRUE

But you got FALSE on your system. The test compile command for cmake/check_libfabric_cxi.c in cmake/DetectOptions.cmake:460 fails for you.

  if(LIBFABRIC_FOUND)
    set(ADIOS2_SST_HAVE_LIBFABRIC TRUE)
    find_package(CrayDRC)
    if(CrayDRC_FOUND)
      set(ADIOS2_SST_HAVE_CRAY_DRC TRUE)
    endif()

    try_compile(ADIOS2_SST_HAVE_CRAY_CXI
      ${ADIOS2_BINARY_DIR}/check_libfabric_cxi
      ${ADIOS2_SOURCE_DIR}/cmake/check_libfabric_cxi.c
      CMAKE_FLAGS
        "-DINCLUDE_DIRECTORIES=${LIBFABRIC_INCLUDE_DIRS}"
        "-DLINK_DIRECTORIES=${LIBFABRIC_LIBRARIES}")
    message(STATUS "Libfabric support for the HPE CXI provider: ${ADIOS2_SST_HAVE_CRAY_CXI}")
  endif()

@abhishek1297
Copy link
Author

Okay. Thanks for the details.

Maybe CrayDRC might be the cause of this issue. But, it is not present on the cluster. I suppose I will continue working with UCX rdma.

@abhishek1297
Copy link
Author

Hi again,

I want to mention that even when I set ADIOS2_USE_UCX=OFF while also explicitely setting LIBFABRIC_ROOT path (cmake finds it regardless), I get the RDMA Transport for Staging: Available message from CMake and yet LIBFABRIC is not added in the supported features. This might be misleading.

@pnorbert
Copy link
Contributor

Indeed, I can't see it either. The libfabric option was taken out of the user options and now it does not appear in the list of features even when it is on.

You can see the RDMA Transport for Staging: Available message only if either UCX or LIBFABRIC is on.

As @eisenhauer explained, unfortunately, a successful build with LIBFABRIC does not guarantee that it will work properly. So you have it, but it hangs instead of functioning properly.

@pnorbert pnorbert reopened this Mar 20, 2024
@eisenhauer
Copy link
Member

Several action items here. One is that the "RDMA Transport for Staging" output needs to be more complex now that we've added more options. Probably it needs to be a list of possibly available RDMA transports, rather than just "Available". That would have at least made it clear that UCX was viable. Maybe we can also put that list in the bpls output.

@abhishek1297
Copy link
Author

Is there any test or example for the usage of RDMA?

@eisenhauer
Copy link
Member

In an ideal world, using RDMA would be completely transparent to the user. You'd specify the SST engine for streaming between reader and writer jobs, start them up (presumably on the same cluster where they can use a shared RDMA network for connectivity), SST would connect them and RDMA would be used for the data transfers. You could verify that RDMA was selected by specifying the environment variable SstVerbose=1 or maybe 2, but otherwise you'd just see faster data transfer than you would if you were using TCP.

In practice, things can be a bit more complex. Maybe you're in a batch-only environment, so you need an example batch script for Slurm or LSF (usually you just have to background or more jobs in the script and wait for them at the end). But on some platforms the installed version of libfabric doesn't default to reasonable things and you have to specify environment variables to fix it up (Summit), or the libfabric module is incompatible with other normally-loaded modules (Titan), or the network doesn't let two different jobs talk to each other over the network (what the Cray DRC library above was meant to address), etc. Unfortunately that means that getting stuff to work on any specific machine can require a bit of sleuthing. We can provide some example batch scripts for machines that we've had access to (mostly US HPC platforms) but you might still have to do some digging (which we are happy to help with) on any other machine.

@pnorbert We need to expand our info in read-the-docs about running on specific machines. We've got a tiny bit, for example: https://adios2.readthedocs.io/en/v2.9.2/advanced/ecp_hardware.html
But having a variety of example scripts that have worked on specific machines would be a significant help not only to users of those machines, but might give clues to folks trying to work through running on machines we don't have access to.

@abhishek1297
Copy link
Author

Yes, the logs show that RDMA was picked up by SST. But, that RDMA does not use UCX by default. If I do NOT set DataTransport=UCX on the writer's side, the reader gets blocked.

On the reader's side, even if DataTransport set or not, it will still receive the data as long as the writer is using UCX.

@eisenhauer
Copy link
Member

The SST reader will use the transport that the writer selected (it would be nice if they negotiated, but for various technical reasons, that's difficult). I would guess that the libfabric transport looks viable to SST, but then turns out not to function. Sorting out why might require both SST and libfabric verbosity to see what exactly is going on. It may be that libfabric claims to have a feature that turns out not to work or something like that. Libfabric is kind of a frankenstein of features...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants