You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now I have solved this problem. You need to ensure there is only one cublas and that version is greater than 12.3. If it's not greater than 12.3, just run pip install nvidia-cublas-cu12==12.4.5.8 . It works for me
nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/root/miniconda3/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
[�[34m2024-11-21 08:39:44�[0m] Experiment directory created at /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1/010-STDiT3-XL-2
[�[34m2024-11-21 08:39:44�[0m] Training configuration:
{'adam_eps': 1e-15,
'bucket_config': {'1024': {1: (0.05, 36)},
'1080p': {1: (0.1, 5)},
'144p': {1: (1.0, 475),
51: (1.0, 51),
102: ((1.0, 0.33), 27),
204: ((1.0, 0.1), 13),
408: ((1.0, 0.1), 6)},
'2048': {1: (0.1, 5)},
'240p': {1: (0.3, 297),
51: (0.4, 20),
102: ((0.4, 0.33), 10),
204: ((0.4, 0.1), 5),
408: ((0.4, 0.1), 2)},
'256': {1: (0.4, 297),
51: (0.5, 20),
102: ((0.5, 0.33), 10),
204: ((0.5, 0.1), 5),
408: ((0.5, 0.1), 2)},
'360p': {1: (0.2, 141),
51: (0.15, 8),
102: ((0.15, 0.33), 4),
204: ((0.15, 0.1), 2),
408: ((0.15, 0.1), 1)},
'480p': {1: (0.1, 89)},
'512': {1: (0.1, 141)},
'720p': {1: (0.05, 36)}},
'ckpt_every': 200,
'config': 'configs/opensora-v1-2/train/stage1.py',
'dataset': {'data_path': '/high_perf_store/surround-view/liangshuang/Data/webvid-stage1/stage1.csv',
'transform_name': 'resize_crop',
'type': 'VariableVideoTextDataset'},
'dtype': 'bf16',
'ema_decay': 0.99,
'epochs': 5,
'grad_checkpoint': True,
'grad_clip': 1.0,
'load': None,
'log_every': 10,
'lr': 0.0001,
'mask_ratios': {'image_head': 0.05,
'image_head_tail': 0.025,
'image_random': 0.025,
'image_tail': 0.025,
'intepolate': 0.005,
'quarter_head': 0.005,
'quarter_head_tail': 0.005,
'quarter_random': 0.005,
'quarter_tail': 0.005,
'random': 0.05},
'model': {'enable_flash_attn': True,
'enable_layernorm_kernel': True,
'freeze_y_embedder': True,
'from_pretrained': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/',
'qk_norm': True,
'type': 'STDiT3-XL/2'},
'num_bucket_buiald_workers': 16,
'num_workers': 8,
'outputs': '/high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/stage1',
'plugin': 'zero2',
'record_time': False,
'scheduler': {'sample_method': 'logit-normal',
'type': 'rflow',
'use_timestep_transform': True},
'seed': 42,
'start_from_scratch': False,
'text_encoder': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/t5-v1_1-xxl',
'model_max_length': 300,
'shardformer': True,
'type': 't5'},
'vae': {'from_pretrained': '/high_perf_store/surround-view/liangshuang/Open-Sora-weights-1.2/OpenSora-VAE-v1.2/model.safetensors',
'micro_batch_size': 4,
'micro_frame_size': 17,
'type': 'OpenSoraVAE_V1_2'},
'wandb': False,
'warmup_steps': 1000}
[�[34m2024-11-21 08:39:44�[0m] Building dataset...
[�[34m2024-11-21 08:39:44�[0m] Dataset contains 7552 samples.
[�[34m2024-11-21 08:39:44�[0m] Number of buckets: 626
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[�[34m2024-11-21 08:39:44�[0m] Building buckets...
[�[34m2024-11-21 08:39:46�[0m] Bucket Info:
[�[34m2024-11-21 08:39:46�[0m] Bucket [#sample, #batch] by aspect ratio:
{'0.52': [131, 9],
'0.53': [468, 39],
'0.54': [1, 0],
'0.56': [3733, 386],
'0.57': [1291, 123],
'0.60': [1, 0],
'0.67': [21, 0],
'0.68': [9, 0],
'0.75': [7, 0],
'0.78': [2, 0]}
[�[34m2024-11-21 08:39:46�[0m] Image Bucket [#sample, #batch] by HxWxT:
{}
[�[34m2024-11-21 08:39:46�[0m] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 408): [36, 36],
('360p', 204): [73, 36],
('360p', 102): [226, 55],
('360p', 51): [521, 64],
('240p', 408): [101, 50],
('240p', 204): [153, 30],
('240p', 102): [509, 49],
('240p', 51): [1183, 58],
('256', 408): [65, 32],
('256', 204): [110, 21],
('256', 102): [376, 36],
('256', 51): [883, 43],
('144p', 408): [59, 9],
('144p', 204): [117, 8],
('144p', 102): [409, 14],
('144p', 51): [843, 16]}
[�[34m2024-11-21 08:39:46�[0m] #training batch: 557, #training sample: 5.53 K, #non empty bucket: 48
[�[34m2024-11-21 08:39:46�[0m] Building models...
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|█████ | 1/2 [00:11<00:11, 11.88s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 12.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:23<00:00, 11.99s/it]
Missing keys: []
Unexpected keys: []
[�[34m2024-11-21 08:40:30�[0m] Model checkpoint loaded from /high_perf_store/surround-view/liangshuang/Test/STDiT/outputs/adapt/006-STDiT3-XL-2-lr4-split/epoch4-global_step5700/
[�[34m2024-11-21 08:40:30�[0m] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.06306171417236328 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.0817575454711914 seconds
[�[34m2024-11-21 08:40:31�[0m] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[�[34m2024-11-21 08:40:31�[0m] Preparing for distributed training...
[�[34m2024-11-21 08:40:31�[0m] Boosting model for distributed training
[�[34m2024-11-21 08:40:31�[0m] Training for 5 epochs with 557 steps per epoch
[�[34m2024-11-21 08:40:31�[0m] Beginning epoch 0...
Epoch 0: 0%| | 0/557 [00:00<?, ?it/s]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 1/557 [00:47<7:19:39, 47.45s/it]tensor([51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
51, 51])
Epoch 0: 0%| | 2/557 [01:13<5:22:05, 34.82s/it]tensor([408])
[2024-11-21 08:42:00,328] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -8) local_rank: 0 (pid: 69053) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.2.2', 'console_scripts', 'torchrun')())
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-11-21_08:42:00
host : localhost
rank : 0 (local_rank: 0)
exitcode : -8 (pid: 69053)
error_file: <N/A>
traceback : Signal 8 (SIGFPE) received by PID 69053
The text was updated successfully, but these errors were encountered: