Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU training crashes with "Cannot copy out of meta tensor; no data" #243

Open
rohan-mehta opened this issue Aug 18, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@rohan-mehta
Copy link

rohan-mehta commented Aug 18, 2023

I cloned the repo, and ran the provided training command from here on 1 node, 2 GPUs, and it failed with the stack trace below. I've made no changes to the repo. Running on an 8xA100 GPU machine. It does work fine on a single GPU.

Expected Behavior

The training run should work correctly.

Current Behavior

Crashes when loading the language model. Full logs here: logs
Excerpt:

Traceback (most recent call last):
  File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 484, in <module>
    main()
  File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 260, in main
    model, image_processor, tokenizer = create_model_and_transforms(
  File "/home/fsuser/open_flamingo/open_flamingo/src/factory.py", line 57, in create_model_and_transforms
    lang_encoder = AutoModelForCausalLM.from_pretrained(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
    return model_class.from_pretrained(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3084, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3525, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MosaicGPT:
        While copying the parameter named "transformer.wte.weight", whose dimensions in the model are torch.Size([50432, 2048]) and whose dimensions in the checkpoint are torch.Size([50432, 2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
        While copying the parameter named "transformer.blocks.0.ln_1.weight", whose dimensions in the model are torch.Size([2048]) and whose dimensions in the checkpoint are torch.Size([2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).

Steps to Reproduce

torchrun --nnodes=1 --nproc_per_node=2 train.py \
  --lm_path anas-awadalla/mpt-1b-redpajama-200b \
  --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \
  --cross_attn_every_n_layers 2 \
  --dataset_resampled \
  --batch_size_mmc4 1 \
  --batch_size_laion 2 \
  --train_num_samples_mmc4 100\
  --train_num_samples_laion 200 \
  --loss_multiplier_laion 0.2 \
  --workers=4 \
  --run_name OpenFlamingo-3B-vitl-mpt1b \
  --num_epochs 10 \
  --warmup_steps  5 \
  --mmc4_textsim_threshold 0.24 \
  --laion_shards "..." \
  --mmc4_shards "..."

Environment

- Python 3.9
- Installed requirements from `requirements.txt`
- `conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant