[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

erhoo82 · 2024-11-14T06:53:12Z

I get CUDA Error: misaligned address when running the tp comm overlap unit test with recent pytorch container.
I think the error comes from the cublas versions that enables nvjet.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 922, in <module>
[rank1]:     sys.exit(_main(_parse_args()))
[rank1]:              ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank1]:     return f(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 721, in _main
[rank1]:     all_outputs = _fp8_gemm()
[rank1]:                   ^^^^^^^^^^^
[rank1]:   File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 602, in _fp8_gemm
[rank1]:     return tex.fp8_gemm(
[rank1]:            ^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 180, in fp8_gemm
[rank1]:     _ = fn(*args)
[rank1]:         ^^^^^^^^^
[rank1]: RuntimeError: /workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 in function split_overlap_ag: CUDA Error: misaligned address

The text was updated successfully, but these errors were encountered:

denera · 2024-11-14T07:10:43Z

/workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 is a cudaEventRecord call. It seems weird that this would trigger a misaligned address error, so I'm guessing the error actually originates from nvte_cublas_gemm just a few lines above that?

I'm not familiar with nvjet. Does cuBLAS have an environment variable that lets us at least temporarily disable this for debugging?

erhoo82 · 2024-11-14T07:17:00Z

Not sure if there is a way.

I got the same error in the both below cases.

Got the above error with the old container and setting LD_LIBRARY_PATH to use the recent cublas build. Here, when not using the recent cublas build, the unit test just runs fine.
Got the above error with the latest pytorch container.

The model e2e job with the latest cublas build runs fine.
So, I think this is just about the unit test codes that is not working.

denera · 2024-11-14T07:32:28Z

Thanks for the info! 'll take a look at the unit tests as soon as I can (likely first thing next week).

erhoo82 assigned denera Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

erhoo82 commented Nov 14, 2024 •

edited

Loading

denera commented Nov 14, 2024

erhoo82 commented Nov 14, 2024

denera commented Nov 14, 2024

[TP comm overlap unit test]CUDA Error: misaligned address error when testing with recent cublas (or pytorch container) #1332

[TP comm overlap unit test]CUDA Error: misaligned address error when testing with recent cublas (or pytorch container) #1332

Comments

erhoo82 commented Nov 14, 2024 • edited Loading

denera commented Nov 14, 2024

erhoo82 commented Nov 14, 2024

denera commented Nov 14, 2024

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

[TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container) #1332

erhoo82 commented Nov 14, 2024 •

edited

Loading