Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running test.py needs more GPU memory than train.py #12

Open
rajat-talak opened this issue Nov 24, 2022 · 4 comments
Open

Running test.py needs more GPU memory than train.py #12

rajat-talak opened this issue Nov 24, 2022 · 4 comments

Comments

@rajat-talak
Copy link

rajat-talak commented Nov 24, 2022

Hi @Lilac-Lee,

When I try running test,py on my computer, it gives me the RuntimeError: CUDA out of memory. This, however, does not happen for train.py. Why does the code require more GPU memory in testing, than in training? The training uses a higher batch size, while the test does not. Is this a bug?

PS - I changed the dataset from 3DMatch to modelnet. The issue remains. The test.py for 3DMatch attempts to allocate 3.91 GB, while test.py for ModelNet attempts to allocate 22.01GB.

Thank you,

@Lilac-Lee
Copy link
Owner

Hi @rajat-talak, thanks for reaching out.

What is the number of points you used for the ModelNet dataset? Also, when computing Jacobian, we only used 100 points, could you double check the Jacobian size? To use a Jacobian with a larger number of points, you might need to aggregate Jacobian computation.

Let me know if this problem persists. Cheers.

@CVrookieee
Copy link

Hi @Lilac-Lee ,

I encountered the same error. I used 1000 points for the ModelNet dataset. Details are as follows.

Traceback (most recent call last):
  File "test.py", line 157, in <module>
    main(ARGS)
  File "test.py", line 112, in main
    test(args, testset, dptnetlk)
  File "test.py", line 106, in test
    dptnetlk.test_one_epoch(model, testloader, args.device, 'test', args.data_type, args.vis)
  File "/home/***/PytorchProject/***/dptlk_o/trainer.py", line 149, in test_one_epoch
    p1, None, j, self.xtol, self.p0_zero_mean, self.p1_zero_mean, mode, data_type)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 189, in do_forward
    r = net(q0, q1, mode, maxiter=maxiter, xtol=xtol, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
  File "/home/***/Software/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 210, in forward
    r, g, itr = self.iclk_new(g0, p0, p1, maxiter, xtol, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type, num_random_points=num_random_points)
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 297, in iclk_new
    num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type)   # B x N x K x D, K=1024, D=3 or 6
  File "/home/***/PytorchProject/***/dptlk_o/model.py", line 231, in Cal_Jac
    Mask_fn, A_fn, Ax_fn, BN_fn, self.device).to(self.device)
  File "/home/***/PytorchProject/***/dptlk_o/utils.py", line 316, in feature_jac
    A3BN3M3 =  M3 * dBN3 * A3
RuntimeError: CUDA out of memory. Tried to allocate 22.01 GiB (GPU 1; 23.69 GiB total capacity; 2.61 GiB already allocated; 18.16 GiB free; 3.69 GiB reserved in total by PyTorch)

I located the problem in model.py line 294-297.

if mode == 'test':
            f0, Mask_fn, A_fn, Ax_fn, BN_fn, max_idx = self.ptnet(p0, -1)
            J = self.Cal_Jac(Mask_fn, A_fn, Ax_fn, BN_fn, max_idx,
                            num_points, p0, mode, voxel_coords_diff=voxel_coords_diff, data_type=data_type)   # B x N x K x D, K=1024, D=3 or 6

When debugging, I found the value of num_points is 927. I think this may be what you mentioned about computing a Jacobian with a larger number of points. But this function is very complicated for me and I am unable to modify it. I wish my comments might be helpful to you.

Thank you,

@Lilac-Lee
Copy link
Owner

Lilac-Lee commented May 31, 2023

Hi, I found the previous discussion about a similar issue here #10 is interesting, and might be helpful. Could you take a look? Cheers.

@CVrookieee
Copy link

Hi @Lilac-Lee,

Thank you very much for your quick reply. I have looked issue #10 and it's enlightening. My problem has been solved, but not through the method in issue #10. I found that data_utils.Resampler() is implemented for mode train and val but not test. When testing, one point cloud contains more than 45000 points, and the calculation of the Jacobian will cause out of memory.

I modified lines 123-126 in test.py as follows (the last line is added).

if args.dataset_type == 'modelnet':
        transform = torchvision.transforms.Compose([\
                    data_utils.Mesh2Points(),\
                    data_utils.OnUnitCube(),\
                    data_utils.Resampler(args.num_points)])

Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants