Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

有没有人训练的时候遇到了这个错误?torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python,前后从A100换成了H20出的错误,其他没改 #749

Open
liangshuangI opened this issue Nov 21, 2024 · 3 comments

Comments

@liangshuangI
Copy link

nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
[�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2
[�[34m2024-11-21 08:39:44�[0m] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 5,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_buiald_workers': 16,
'num_workers': 8,
'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
[�[34m2024-11-21 08:39:44�[0m] Building dataset...
[�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples.
[�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[�[34m2024-11-21 08:39:44�[0m] Building buckets...
[�[34m2024-11-21 08:39:46�[0m] Bucket Info:
[�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio:
{'0.52': [131, 9],
'0.53': [468, 39],
'0.54': [1, 0],
'0.56': [3733, 386],
'0.57': [1291, 123],
'0.60': [1, 0],
'0.67': [21, 0],
'0.68': [9, 0],
'0.75': [7, 0],
'0.78': [2, 0]}
[�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT:
{}
[�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 408): [36, 36],
('360p', 204): [73, 36],
('360p', 102): [226, 55],
('360p', 51): [521, 64],
('240p', 408): [101, 50],
('240p', 204): [153, 30],
('240p', 102): [509, 49],
('240p', 51): [1183, 58],
('256', 408): [65, 32],
('256', 204): [110, 21],
('256', 102): [376, 36],
('256', 51): [883, 43],
('144p', 408): [59, 9],
('144p', 204): [117, 8],
('144p', 102): [409, 14],
('144p', 51): [843, 16]}
[�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48
[�[34m2024-11-21 08:39:46�[0m] Building models...

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it]
Missing keys: []
Unexpected keys: []
[�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/
[�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds
[�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training...
[�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training
[�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch
[�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...

Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])

Epoch 0: 0%| | 2/557 [01:13<5:22:05, 34.82s/it]tensor([408])
[2024-11-21 08:42:00,328] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053

@liangshuangI
Copy link
Author

I met this problem when I tried to train it. I also tried to do an inference, and it went well. I just transformed GPU from A100 to H20,

@Vincentyua
Copy link

I met this problem too

@liangshuangI
Copy link
Author

Now I have solved this problem. You need to ensure there is only one cublas and that version is greater than 12.3. If it's not greater than 12.3, just run pip install nvidia-cublas-cu12==12.4.5.8 . It works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants