有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

liangshuangI · 2024-11-21T08:55:36Z

nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
[�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2
[�[34m2024-11-21 08:39:44�[0m] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 5,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_buiald_workers': 16,
'num_workers': 8,
'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
[�[34m2024-11-21 08:39:44�[0m] Building dataset...
[�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples.
[�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[�[34m2024-11-21 08:39:44�[0m] Building buckets...
[�[34m2024-11-21 08:39:46�[0m] Bucket Info:
[�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio:
{'0.52': [131, 9],
'0.53': [468, 39],
'0.54': [1, 0],
'0.56': [3733, 386],
'0.57': [1291, 123],
'0.60': [1, 0],
'0.67': [21, 0],
'0.68': [9, 0],
'0.75': [7, 0],
'0.78': [2, 0]}
[�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT:
{}
[�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 408): [36, 36],
('360p', 204): [73, 36],
('360p', 102): [226, 55],
('360p', 51): [521, 64],
('240p', 408): [101, 50],
('240p', 204): [153, 30],
('240p', 102): [509, 49],
('240p', 51): [1183, 58],
('256', 408): [65, 32],
('256', 204): [110, 21],
('256', 102): [376, 36],
('256', 51): [883, 43],
('144p', 408): [59, 9],
('144p', 204): [117, 8],
('144p', 102): [409, 14],
('144p', 51): [843, 16]}
[�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48
[�[34m2024-11-21 08:39:46�[0m] Building models...

Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it]
Missing keys: []
Unexpected keys: []
[�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/
[�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds
[�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training...
[�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training
[�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch
[�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...

Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])

Epoch 0: 0%| | 2/557 [01:13<5:22:05, 34.82s/it]tensor([408])
[2024-11-21 08:42:00,328] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053

liangshuangI · 2024-11-21T09:06:25Z

I met this problem when I tried to train it. I also tried to do an inference, and it went well. I just transformed GPU from A100 to H20,

Vincentyua · 2024-11-23T07:42:03Z

I met this problem too

liangshuangI · 2024-11-26T12:52:31Z

Now I have solved this problem. You need to ensure there is only one cublas and that version is greater than 12.3. If it's not greater than 12.3, just run pip install nvidia-cublas-cu12==12.4.5.8 . It works for me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

liangshuangI commented Nov 21, 2024

liangshuangI commented Nov 21, 2024

Vincentyua commented Nov 23, 2024

liangshuangI commented Nov 26, 2024

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

有没有人训练的时候遇到了这个错误？torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python，前后从A100换成了H20出的错误，其他没改 #749

Comments

liangshuangI commented Nov 21, 2024

scripts/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-21_08:42:00 host : localhost rank : 0 (local_rank: 0) exitcode : -8 (pid: 69053) error_file: <N/A> traceback : Signal 8 (SIGFPE) received by PID 69053

liangshuangI commented Nov 21, 2024

Vincentyua commented Nov 23, 2024

liangshuangI commented Nov 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053