-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could not work , even use the official script #1011
Comments
We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that
Can you try running the following script with import sys
def _print(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}") |
HI @timmoon10 this is the output of the script |
HI @timmoon10 this is the output of the script |
Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before: TransformerEngine/examples/pytorch/fsdp/fsdp.py Lines 205 to 207 in 8e039fd
Differences I can see:
|
@hellangleZ you can also try this FSDP test from HuggingFace Accelerate that uses TE/FP8: https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8. It handles the FSDP configuration. |
(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757]
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
Fatal Python error: Segmentation fault
Current thread 0x00007f2714afb740 (most recent call first):
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper
File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)
The text was updated successfully, but these errors were encountered: