-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running test.py needs more GPU memory than train.py #12
Comments
Hi @rajat-talak, thanks for reaching out. What is the number of points you used for the ModelNet dataset? Also, when computing Jacobian, we only used 100 points, could you double check the Jacobian size? To use a Jacobian with a larger number of points, you might need to aggregate Jacobian computation. Let me know if this problem persists. Cheers. |
Hi @Lilac-Lee , I encountered the same error. I used 1000 points for the ModelNet dataset. Details are as follows. Traceback (most recent call last):
File "test.py", line 157, in <module>
main(ARGS)
File "test.py", line 112, in main
test(args, testset, dptnetlk)
File "test.py", line 106, in test
dptnetlk.test_one_epoch(model, testloader, args.device, 'test', args.data_type, args.vis)
File "/home/***/PytorchProject/***/dptlk_o/trainer.py", line 149, in test_one_epoch
p1, None, j, self.xtol, self.p0_zero_mean, self.p1_zero_mean, mode, data_type)
File "/home/***/PytorchProject/***/dptlk_o/model.py", line 189, in do_forward
r = net(q0, q1, mode, maxiter=maxiter, xtol=xtol, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
File "/home/***/Software/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/***/PytorchProject/***/dptlk_o/model.py", line 210, in forward
r, g, itr = self.iclk_new(g0, p0, p1, maxiter, xtol, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
File "/home/***/PytorchProject/***/dptlk_o/model.py", line 297, in iclk_new
num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type) # B x N x K x D, K=1024, D=3 or 6
File "/home/***/PytorchProject/***/dptlk_o/model.py", line 231, in Cal_Jac
Mask_fn, A_fn, Ax_fn, BN_fn, self.device).to(self.device)
File "/home/***/PytorchProject/***/dptlk_o/utils.py", line 316, in feature_jac
A3BN3M3 = M3 * dBN3 * A3
RuntimeError: CUDA out of memory. Tried to allocate 22.01 GiB (GPU 1; 23.69 GiB total capacity; 2.61 GiB already allocated; 18.16 GiB free; 3.69 GiB reserved in total by PyTorch) I located the problem in model.py line 294-297. if mode == 'test':
f0, Mask_fn, A_fn, Ax_fn, BN_fn, max_idx = self.ptnet(p0, -1)
J = self.Cal_Jac(Mask_fn, A_fn, Ax_fn, BN_fn, max_idx,
num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type) # B x N x K x D, K=1024, D=3 or 6 When debugging, I found the value of num_points is 927. I think this may be what you mentioned about computing a Jacobian with a larger number of points. But this function is very complicated for me and I am unable to modify it. I wish my comments might be helpful to you. Thank you, |
Hi, I found the previous discussion about a similar issue here #10 is interesting, and might be helpful. Could you take a look? Cheers. |
Hi @Lilac-Lee, Thank you very much for your quick reply. I have looked issue #10 and it's enlightening. My problem has been solved, but not through the method in issue #10. I found that data_utils.Resampler() is implemented for mode train and val but not test. When testing, one point cloud contains more than 45000 points, and the calculation of the Jacobian will cause out of memory. I modified lines 123-126 in test.py as follows (the last line is added).
Thank you again. |
Hi @Lilac-Lee,
When I try running test,py on my computer, it gives me the RuntimeError: CUDA out of memory. This, however, does not happen for train.py. Why does the code require more GPU memory in testing, than in training? The training uses a higher batch size, while the test does not. Is this a bug?
PS - I changed the dataset from 3DMatch to modelnet. The issue remains. The test.py for 3DMatch attempts to allocate 3.91 GB, while test.py for ModelNet attempts to allocate 22.01GB.
Thank you,
The text was updated successfully, but these errors were encountered: