Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iam encountering error saving model checkpoints #6

Open
Khaledbouza opened this issue Jul 18, 2024 · 0 comments
Open

iam encountering error saving model checkpoints #6

Khaledbouza opened this issue Jul 18, 2024 · 0 comments

Comments

@Khaledbouza
Copy link

iam encountering an AttributeError i try # If using DDP
if hasattr(model, 'module'):
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
else:
model.save_pretrained(current_model_directory, max_shard_size='100GB')
but i got complicated error about dataloader
what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
[rank0]: main(args)
[rank0]: File "run_pretrain.py", line 531, in main
[rank0]: model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
[rank0]: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
wandb: / 0.053 MB of 0.053 MB uploaded
wandb: Run history:
wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: loss 9.375
wandb: lr 0.0
wandb: throughput_batches 0.76993
wandb: throughput_examples 49.27552
wandb: throughput_tokens 9572.05922
wandb: tokens_seen 9905694
wandb: update_step 99
wandb:
wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376
wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

run_pretrain.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-07-18_20:37:00
host : DESKTOP-M0GCNFO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 179820)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant