You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
iam encountering an AttributeError i try # If using DDP
if hasattr(model, 'module'):
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
else:
model.save_pretrained(current_model_directory, max_shard_size='100GB')
but i got complicated error about dataloader
what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
[rank0]: main(args)
[rank0]: File "run_pretrain.py", line 531, in main
[rank0]: model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
[rank0]: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
wandb: / 0.053 MB of 0.053 MB uploaded
wandb: Run history:
wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: loss 9.375
wandb: lr 0.0
wandb: throughput_batches 0.76993
wandb: throughput_examples 49.27552
wandb: throughput_tokens 9572.05922
wandb: tokens_seen 9905694
wandb: update_step 99
wandb:
wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376
wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with wandb.require("core")! See https://wandb.me/wandb-core for more information.
E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
(myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
The text was updated successfully, but these errors were encountered:
iam encountering an AttributeError i try # If using DDP
if hasattr(model, 'module'):
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
else:
model.save_pretrained(current_model_directory, max_shard_size='100GB')
but i got complicated error about dataloader
what this error about it happen when i try to save model in checkpoints after some steps Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
[rank0]: main(args)
[rank0]: File "run_pretrain.py", line 531, in main
[rank0]: model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
[rank0]: File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
wandb: / 0.053 MB of 0.053 MB uploaded
wandb: Run history:
wandb: loss █████████████▇▇▇▄▃▂▂▁▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: lr ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: throughput_batches ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_examples ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: throughput_tokens ▁▇▇▇▇▇▁▇▇▇▂▇▇▇▇▇▇▂▇▇█▇▇▇▇▇▇▇▇▇▇▆▇▇▇▇▇▇▆▇
wandb: tokens_seen ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: update_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: loss 9.375
wandb: lr 0.0
wandb: throughput_batches 0.76993
wandb: throughput_examples 49.27552
wandb: throughput_tokens 9572.05922
wandb: tokens_seen 9905694
wandb: update_step 99
wandb:
wandb: 🚀 View run test at: https://wandb.ai/khaledbouzaiene365/test/runs/xe47q376
wandb: ⭐️ View project at: https://wandb.ai/khaledbouzaiene365/test
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240718_200147-xe47q376/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with
wandb.require("core")
! See https://wandb.me/wandb-core for more information.E0718 20:37:00.711596 140291664311360 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 179820) of binary: /home/koko/miniconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/koko/miniconda3/envs/myenv/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.3.1', 'console_scripts', 'torchrun')())
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
run_pretrain.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-07-18_20:37:00
host : DESKTOP-M0GCNFO.
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 179820)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(myenv) (base) koko@DESKTOP-M0GCNFO:~/Q-GaLore$ Update steps: 0%| | 100/150000 [32:29<462:48:31, 11.11s/it]2024-07-18 20:36:51.265 | INFO | main:main:529 - Saving model and optimizer to checkpoints/llama_100m-2024-07-18-20-01-46/model_100, update step 100
Traceback (most recent call last):
File "run_pretrain.py", line 664, in
main(args)
File "run_pretrain.py", line 531, in main
model.module.save_pretrained(current_model_directory, max_shard_size='100GB')
File "/home/koko/miniconda3/envs/myenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'module'
[rank0]: Traceback (most recent call last):
[rank0]: File "run_pretrain.py", line 664, in
The text was updated successfully, but these errors were encountered: