CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

The-Hulk-007 · 2024-04-16T12:32:46Z

Hi @anujinho
I am trying to reproduce your TRIDENT CCVAE model. however, I am not able to solve this problem, CUDA out of memory on an 11GB NVIDIA 2080Ti GPU.Even if I lower the meta_batch_size(i.e. 10 , 4 , 1), I can't get it to work.
Below is my running configuration, parameters and screenshots:

e.g. mini-5,1,train_conf.json

1.running configuration
python -m src.trident_train --cnfg /home/zzh/projectLists/trident/configs/mini-5,1/train_conf.json

2.train_conf.json hyperparameters:

{
"dataset": "miniimagenet",
"root": "./dataset/mini_imagenet",
"n_ways": 5,
"k_shots": 1,
"q_shots": 10,
"inner_adapt_steps_train": 5,
"inner_adapt_steps_test": 5,
"inner_lr": 0.001,
"meta_lr": 0.0001,
"meta_batch_size":20,
"iterations": 100000,
"reconstr": "std",
"wt_ce": 100,
"klwt": "False",
"rec_wt": 0.01,
"beta_l": 1,
"beta_s": 1,
"zl": 64,
"zs": 64,
"task_adapt": "True",
"experiment": "exp1",
"order": "False",
"device": "cuda:3"
}

3.Screenshot of results：

  0%|  | 0/100000 [00:00<?, ?it/s]/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/_tensor.py:1013: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at  /opt/conda/conda-bld/pytorch_1639180549130/work/build/aten/src/ATen/core/TensorBody.h:417.)
  return self._grad
  0%|                                                                                 | 4/100000 [00:27<193:24:54,  6.96s/it]
Traceback (most recent call last):
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/zzh/projectLists/trident/src/trident_train.py", line 103, in <module>
    evaluation_loss, evaluation_accuracy = inner_adapt_trident(
  File "/home/zzh/projectLists/trident/src/zoo/trident_utils.py", line 125, in inner_adapt_trident
    reconst_image, logits, mu_l, log_var_l, mu_s, log_var_s = learner(
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/learn2learn/algorithms/maml.py", line 107, in forward
    return self.module(*args, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 814, in forward
    logits, mu_l, log_var_l, z_l = self.classifier_vae(x, z_s, update)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 752, in forward
    x = self.encoder(x, update)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/projectLists/trident/src/zoo/archs.py", line 533, in forward
    x = self.net(x)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/modules/activation.py", line 738, in forward
    return F.leaky_relu(input, self.negative_slope, self.inplace)
  File "/home/zzh/anaconda3/envs/tip/lib/python3.9/site-packages/torch/nn/functional.py", line 1475, in leaky_relu
    result = torch._C._nn.leaky_relu(input, negative_slope)

RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 3; 10.76 GiB total capacity; 9.63 GiB already allocated; 49.12 MiB free; 9.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I also learned from the #Issues that the code in this paper does not run in a distributed manner. However, I tried many methods but could not solve it.Can you please help to resolve this issue or give valuable suggestions. Will look forward to hearing from you soon. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

The-Hulk-007 commented Apr 16, 2024

CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

CUDA out of memory on an 11GB NVIDIA 2080Ti GPU. #6

Comments

The-Hulk-007 commented Apr 16, 2024

e.g. mini-5,1,train_conf.json