Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练rtdert报错 #27

Open
iodncookie opened this issue Aug 22, 2023 · 0 comments
Open

训练rtdert报错 #27

iodncookie opened this issue Aug 22, 2023 · 0 comments

Comments

@iodncookie
Copy link

iodncookie commented Aug 22, 2023

python tools/train.py -f exps/rtdetr/rtdetr_r18vd_6x_coco.py -d 2 -b 20 -eb 24 -w 4 -ew 4 -lrs 0.1
报错如下:

2023-08-22 17:46:14 | INFO | mmdet.core.trainer:493 - ---> start train epoch1
2023-08-22 17:46:16 | ERROR | mmdet.core.trainer:98 - one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor []] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
2023-08-22 17:46:16 | INFO | mmdet.core.trainer:343 - Training of experiment is done and the best AP is 0.00
2023-08-22 17:46:16 | ERROR | mmdet.core.launch:147 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (478), thread 'MainThread' (139673154561728):
Traceback (most recent call last):
File "", line 1, in
File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
│ │ └ 5
│ └ 8
└ <function _main at 0x7f083058cc10>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/spawn.py", line 129, in _main
return self._bootstrap(parent_sentinel)
│ │ └ 5
│ └ <function BaseProcess._bootstrap at 0x7f083073dee0>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
│ └ <function BaseProcess.run at 0x7f083073d550>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └
│ │ │ └ (<function _distributed_worker at 0x7f07b1e38160>, 0, (<function main at 0x7f077b30d940>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:5...
│ │ └
│ └ <function _wrap at 0x7f07b1956310>

File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
│ │ └ (<function main at 0x7f077b30d940>, 2, 2, 0, 'nccl', 'tcp://127.0.0.1:56017', (╒═══════════════════════╤═════════════════════...
│ └ 0
└ <function _distributed_worker at 0x7f07b1e38160>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/launch.py", line 147, in _distributed_worker
main_func(*args)
│ └ (╒═══════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════...
└ <function main at 0x7f077b30d940>
File "/home/a-bamboo/repositories/miemiedetection/tools/train.py", line 126, in main
trainer.train()
│ └ <function Trainer.train at 0x7f077a68bc10>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 96, in train
self.train_in_epoch()
│ └ <function Trainer.train_in_epoch at 0x7f077a68bd30>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 336, in train_in_epoch
self.train_in_iter()
│ └ <function Trainer.train_in_iter at 0x7f077a68be50>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 350, in train_in_iter
self.train_one_iter()
│ └ <function Trainer.train_one_iter at 0x7f077a68bee0>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/home/a-bamboo/repositories/miemiedetection/mmdet/core/trainer.py", line 462, in train_one_iter
self.scaler.scale(loss).backward()
│ │ │ └ tensor(13467.0713, device='cuda:0', grad_fn=)
│ │ └ <function GradScaler.scale at 0x7f07b2133790>
│ └ <torch.cuda.amp.grad_scaler.GradScaler object at 0x7f077a65ce50>
└ <mmdet.core.trainer.Trainer object at 0x7f077a65ce20>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
│ │ │ │ │ │ │ └ None
│ │ │ │ │ │ └ False
│ │ │ │ │ └ None
│ │ │ │ └ None
│ │ │ └ tensor(13467.0713, device='cuda:0', grad_fn=)
│ │ └ <function backward at 0x7f07b1d6cee0>
│ └ <module 'torch.autograd' from '/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/autograd/init.py'>
└ <module 'torch' from '/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/init.py'>
File "/data/anaconda3/envs/miemie_det/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
│ │ └ <method 'run_backward' of 'torch._C._EngineBase' objects>
│ └ <torch._C._EngineBase object at 0x7f07be7d8d80>
└ <class 'torch.autograd.variable.Variable'>

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor []] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant