Could not work , even use the official script #1011

hellangleZ · 2024-07-12T10:02:00Z

(TE) root@bjdb-h20-node-118:/aml/TransformerEngine/examples/pytorch/fsdp# torchrun --standalone --nnodes=1 --nproc-per-node=$(nvidia-smi -L | wc -l) fsdp.py
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757]
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0712 09:57:45.035000 139805827512128 torch/distributed/run.py:757] *****************************************
Fatal Python error: Segmentation fault

Current thread 0x00007f2714afb740 (most recent call first):
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113 in _call_store
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 64 in init
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 259 in create_backend
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36 in _create_c10d_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263 in create_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66 in get_rendezvous_handler
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 235 in launch_agent
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132 in call
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 870 in run
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/run.py", line 879 in main
File "/root/miniconda3/envs/TE/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347 in wrapper
File "/root/miniconda3/envs/TE/bin/torchrun", line 8 in

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special (total: 20)
Segmentation fault (core dumped)

timmoon10 · 2024-07-12T18:00:12Z

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

TransformerEngine/examples/pytorch/fsdp/fsdp.py

Line 207 in 8e039fd

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?

import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

hellangleZ · 2024-07-13T01:29:19Z

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

TransformerEngine/examples/pytorch/fsdp/fsdp.py

Line 207 in 8e039fd

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?
import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

hellangleZ · 2024-07-13T01:29:21Z

We should make sure your system is correctly configured and that the distributed job is launched correctly. It's odd that fsdp.py didn't print out the world size after initialization:

TransformerEngine/examples/pytorch/fsdp/fsdp.py

Line 207 in 8e039fd

dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Can you try running the following script with python -m torch.distributed.launch --standalone --nnodes=1 --nproc-per-node=1 test.py?
import sys
def _print(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

_print("Starting script")
import torch
_print("Imported PyTorch")
torch.distributed.init_process_group(backend="nccl")
_print("Initialized NCCL")
rank = torch.distributed.get_rank()
world_size = torch.distributed.get_world_size()
_print(f"{rank=}, {world_size=}")

HI @timmoon10 this is the output of the script

timmoon10 · 2024-07-15T18:45:53Z

Interesting, so we need to figure out why the toy script worked while FSDP script failed somewhere before:

TransformerEngine/examples/pytorch/fsdp/fsdp.py

Lines 205 to 207 in 8e039fd

    
           dist.init_process_group(backend="nccl") 
        
           torch.cuda.set_device(LOCAL_RANK) 
        
           dist_print(f"WORLD_SIZE = {WORLD_SIZE}")

Differences I can see:

python -m torch.distributed.launch vs torchrun
Multi-GPU vs single-GPU
torch.cuda.set_device

PyTorch and Transformer Engine imports:

TransformerEngine/examples/pytorch/fsdp/fsdp.py

Lines 10 to 22 in 8e039fd

    
           import torch 
        
           import torch.distributed as dist 
        
           from torch import nn 
        
           from torch.distributed.fsdp import FullyShardedDataParallel, MixedPrecision 
        
           from torch.distributed.fsdp.wrap import always_wrap_policy, transformer_auto_wrap_policy 
        
           from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( 
        
               apply_activation_checkpointing, 
        
               checkpoint_wrapper, 
        
           ) 
        
           import transformer_engine.pytorch as te 
        
           from transformer_engine.common.recipe import Format, DelayedScaling 
        
           from transformer_engine.pytorch.distributed import prepare_te_modules_for_fsdp

sbhavani · 2024-09-05T14:58:32Z

@hellangleZ you can also try this FSDP test from HuggingFace Accelerate that uses TE/FP8: https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8. It handles the FSDP configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could not work , even use the official script #1011

Could not work , even use the official script #1011

hellangleZ commented Jul 12, 2024

timmoon10 commented Jul 12, 2024 •

edited

Loading

hellangleZ commented Jul 13, 2024

hellangleZ commented Jul 13, 2024

timmoon10 commented Jul 15, 2024

sbhavani commented Sep 5, 2024

Could not work , even use the official script #1011

Could not work , even use the official script #1011

Comments

hellangleZ commented Jul 12, 2024

timmoon10 commented Jul 12, 2024 • edited Loading

hellangleZ commented Jul 13, 2024

hellangleZ commented Jul 13, 2024

timmoon10 commented Jul 15, 2024

sbhavani commented Sep 5, 2024

timmoon10 commented Jul 12, 2024 •

edited

Loading