Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not work , even use the official script #1011

Open
hellangleZ opened this issue Jul 12, 2024 · 5 comments
Open

Could not work , even use the official script #1011

hellangleZ opened this issue Jul 12, 2024 · 5 comments

Comments

@hellangleZ
Copy link

(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757]
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
Fatal Python error: Segmentation fault

Current thread 0x00007f2714afb740 (most recent call first):
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper
File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)

@timmoon10
Copy link
Collaborator

timmoon10 commented Jul 12, 2024

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

@hellangleZ
Copy link
Author

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

Uploading image.png…

@hellangleZ
Copy link
Author

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

image

@timmoon10
Copy link
Collaborator

Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before:

dist.init_process_group(backend="nccl")
torch.cuda.set_device(LOCAL_RANK)
dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Differences I can see:

  • python -m torch.distributed.launch vs torchrun
  • Multi-GPU vs single-GPU
  • torch.cuda.set_device
  • PyTorch and Transformer Engine imports:
    import torch
    import torch.distributed as dist
    from torch import nn
    from torch.distributed.fsdp import FullyShardedDataParallel, MixedPrecision
    from torch.distributed.fsdp.wrap import always_wrap_policy, transformer_auto_wrap_policy
    from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing,
    checkpoint_wrapper,
    )
    import transformer_engine.pytorch as te
    from transformer_engine.common.recipe import Format, DelayedScaling
    from transformer_engine.pytorch.distributed import prepare_te_modules_for_fsdp

@sbhavani
Copy link
Contributor

sbhavani commented Sep 5, 2024

@hellangleZ you can also try this FSDP test from HuggingFace Accelerate that uses TE/FP8: https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8. It handles the FSDP configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants