iam encountering error saving model checkpoints #6

Khaledbouza · 2024-07-18T21:40:53Z

iam encountering an AttributeError i try # If using DDP
if hasattr(model, 'module'):
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
else:
model.save_pretrained(current_model_directory, max_shard_size='100GB')
but i got complicated error about dataloader
what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
[rank0]: main(args)
[rank0]: File "run_pretrain.py", line 531, in main
[rank0]: model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
[rank0]: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
wandb: / 0.053 MB of 0.053 MB uploaded
wandb: Run history:
wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: loss 9.375
wandb: lr 0.0
wandb: throughput_batches 0.76993
wandb: throughput_examples 49.27552
wandb: throughput_tokens 9572.05922
wandb: tokens_seen 9905694
wandb: update_step 99
wandb:
wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376
wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_pretrain.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-18_20:37:00
host : DESKTOP-M0GCNFO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 179820)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iam encountering error saving model checkpoints #6

iam encountering error saving model checkpoints #6

Khaledbouza commented Jul 18, 2024

iam encountering error saving model checkpoints #6

iam encountering error saving model checkpoints #6

Comments

Khaledbouza commented Jul 18, 2024

run_pretrain.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-07-18_20:37:00 host : DESKTOP-M0GCNFO. rank : 0 (local_rank: 0) exitcode : 1 (pid: 179820) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-18_20:37:00
host : DESKTOP-M0GCNFO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 179820)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html