Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Issues with TruFor #42

Open
iamwangyabin opened this issue Oct 15, 2024 · 8 comments
Open

Training Issues with TruFor #42

iamwangyabin opened this issue Oct 15, 2024 · 8 comments

Comments

@iamwangyabin
Copy link

After installing the package, I attempted to train using the default config to train the TruFor. However, I encountered significant issues when trying to train on more than 2 GPUs. The training process frequently breaks down without providing any error information.

When I finally managed to train the model on 2 H100 GPUs, I observed NaN losses occurring intermittently during training. Even though GradScaler is supposed to skip NaN values. Below is an example from the training log:


1384 [03:15:23.681523] Epoch: [2300/3613] eta: 0:33:55 lr: 0.000001 loss_ce: 0.3530 (nan) dice_loss: 0.3622 (0.3642) combined_loss: 0.3518 (nan) time: 1.5536 data: 0.0002 max mem: 55633
1385 [03:15:54.706125] Epoch: [2320/3613] eta: 0:33:24 lr: 0.000001 loss_ce: 0.3381 (nan) dice_loss: 0.2943 (0.3640) combined_loss: 0.2975 (nan) time: 1.5511 data: 0.0002 max mem: 55633
...
1392 [03:19:31.109580] Epoch: [2460/3613] eta: 0:29:47 lr: 0.000001 loss_ce: 0.3052 (nan) dice_loss: 0.2988 (0.3629) combined_loss: 0.3212 (nan) time: 1.5526 data: 0.0002 max mem: 55633

I've also noticed that the reported Image-level Accuracy values are greater than 1, which should be impossible for accuracy metrics. Here's an example from the log:



2356 [09:08:44.806920] Test: [4]  [1220/1355]  eta: 0:00:58  pixel-level F1: 3.6400 (0.2459)  pixel-level Accuracy: 15.1577 (0.9424)  time: 0.4322  data: 0.0002  max mem: 55633
2357 [09:08:53.450127] Test: [4]  [1240/1355]  eta: 0:00:50  pixel-level F1: 3.2157 (0.2455)  pixel-level Accuracy: 15.1957 (0.9425)  time: 0.4321  data: 0.0002  max mem: 55633
2358 [09:09:02.094049] Test: [4]  [1260/1355]  eta: 0:00:41  pixel-level F1: 3.6974 (0.2455)  pixel-level Accuracy: 14.7980 (0.9423)  time: 0.4322  data: 0.0002  max mem: 55633
2359 [09:09:10.735877] Test: [4]  [1280/1355]  eta: 0:00:32  pixel-level F1: 4.0385 (0.2458)  pixel-level Accuracy: 14.9980 (0.9424)  time: 0.4321  data: 0.0001  max mem: 55633
2360 [09:09:19.373706] Test: [4]  [1300/1355]  eta: 0:00:23  pixel-level F1: 3.4870 (0.2458)  pixel-level Accuracy: 15.2521 (0.9424)  time: 0.4319  data: 0.0002  max mem: 55633
2361 [09:09:28.008097] Test: [4]  [1320/1355]  eta: 0:00:15  pixel-level F1: 3.5316 (0.2453)  pixel-level Accuracy: 15.1872 (0.9424)  time: 0.4317  data: 0.0001  max mem: 55633
2362 [09:09:36.643513] Test: [4]  [1340/1355]  eta: 0:00:06  pixel-level F1: 3.8572 (0.2452)  pixel-level Accuracy: 15.2518 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2363 [09:09:42.688089] Test: [4]  [1354/1355]  eta: 0:00:00  pixel-level F1: 3.8572 (0.2451)  pixel-level Accuracy: 15.1298 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2364 [09:09:42.862711] Test: [4] Total time: 0:09:51 (0.4362 s / it)
2365 [09:09:42.863425] ***************************************************************
2366 [09:09:42.863506] ****An extra tail dataset should exist for accracy metrics!****
2367 [09:09:42.863562] ***************************************************************
2368 [09:09:42.863615] **** Length of tail: 5 ****
2369 [09:09:43.297684] ====================
2370 [09:09:43.298023] A batch that is not fully loaded was detected at the end of the dataset. The actual batch size for this batch is 5: The default batch size is 16
2371 [09:09:43.298088] ====================
2372 [09:09:43.298470] Actual Batchsize/ world_size {'_n': 2.5}
2373 [09:09:43.298647] {'pixel-level F1': tensor(0., device='cuda:0', dtype=torch.float64)}
2374 [09:09:43.328352] Actual Batchsize/ world_size {'_n': 2.5}
2375 [09:09:43.328514] {'pixel-level Accuracy': tensor(2.4969, device='cuda:0', dtype=torch.float64)}
2376 [09:09:43.330867] Test <remaining>: [4]  [0/1]  eta: 0:00:00  pixel-level F1: 3.8376 (0.2451)  pixel-level Accuracy: 15.0991 (0.9426)  time: 0.4655  data: 0.3176  max mem: 55633
2377 [09:09:43.331108] Test <remaining>: [4] Total time: 0:00:00 (0.4661 s / it)
2378 [09:09:45.373779] ---syncronized---
2379 [09:09:45.374095] pixel-level F1 reduced_count 43365
2380 [09:09:45.374198] pixel-level F1 reduced_sum 10573.642782675543
2381 [09:09:45.374291] image-level F1 reduced_count 2
2382 [09:09:45.374378] image-level F1 reduced_sum 1.0861882453763827
2383 [09:09:45.374466] pixel-level Accuracy reduced_count 43365
2384 [09:09:45.374551] pixel-level Accuracy reduced_sum 40856.14878845215
2385 [09:09:45.374638] image-level Accuracy reduced_count 2
2386 [09:09:45.374729] image-level Accuracy reduced_sum 15.531676273922066
2387 [09:09:45.374817] ---syncronized done ---
2388 [09:09:49.053563] Averaged stats: pixel-level F1: 3.8376 (0.2438)  pixel-level Accuracy: 15.0991 (0.9421)  image-level F1: 0.5431 (0.5431)  image-level Accuracy: 7.7658 (7.7658)
2389 [09:09:49.059104] Best pixel-level F1 = 0.24382895843826918

The image-level Accuracy is reported as 7.7600, which is not possible for a standard accuracy metric that should range from 0 to 1.

@iamwangyabin
Copy link
Author

I have tested MVSS training config. MVSS does not have the above problems; I can train with more than 4 GPUs without error, and the metrics look good.

@dddb11
Copy link
Contributor

dddb11 commented Oct 15, 2024

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

@iamwangyabin
Copy link
Author

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

Yes, I understand that, and it's not a significant issue since GradScaler can skip these NaN losses during backpropagation

@iamwangyabin
Copy link
Author

I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1...
That's why I got the accuracy larger than 1.
The total number of test images is 1000. And the Pixel-level metrics are right.

[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168

I think there is the problem exists here:

def update(self, value, n=1):

@iamwangyabin
Copy link
Author

I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1... That's why I got the accuracy larger than 1. The total number of test images is 1000. And the Pixel-level metrics are right.

[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168

I think there is the problem exists here:

def update(self, value, n=1):

maybe i m wrong, the MVSS also has 1 total count, but its total is right.
below is the same code run for MVSS:

(Pdb) metric_logger.meters['image-level Accuracy'].count
[13:37:36.288908] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[13:37:49.646384] 0.494
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[13:38:18.780522] 690.0959243774414
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[13:38:27.091506] 1000

@SunnyHaze
Copy link
Contributor

We have received this bug, we will check it and localize the issue as soon as possible.

@iamwangyabin
Copy link
Author

iamwangyabin commented Oct 15, 2024

We have received this bug, we will check it and localize the issue as soon as possible.

Thank you for your quick response.

I have identified a bug in the TruFor implementation that is causing incorrect calculations and significantly enlarging the results. The issue lies in the shape of the predicted binary tensor output by the TruFor model.
The predicted binary tensor should be a 1-dimensional tensor, but TruFor outputs a 2-dim tensor


(Pdb) torch.sum((1 - predict) * (1 - label)).item()
[13:58:38.743372] 176.0
(Pdb) 1-predict
[13:59:01.708168] tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]], device='cuda:0')
(Pdb) 1-label
[13:59:09.698975] tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
(Pdb) (1 - predict) * (1 - label)
[13:59:19.740963] tensor([[1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.]],
       device='cuda:0')

Interesting bug. This can be solved by applying the squeeze() function to the output tensor.
I have submitted a pull request that addresses this problem by squeezing the output tensor. However, it's important to note that there may be other methods or parts of the codebase that are implemented in a similar way and may require similar modifications.

@dddb11
Copy link
Contributor

dddb11 commented Oct 24, 2024

We have received this bug, we will check it and localize the issue as soon as possible.

Thank you for your quick response.

I have identified a bug in the TruFor implementation that is causing incorrect calculations and significantly enlarging the results. The issue lies in the shape of the predicted binary tensor output by the TruFor model. The predicted binary tensor should be a 1-dimensional tensor, but TruFor outputs a 2-dim tensor


(Pdb) torch.sum((1 - predict) * (1 - label)).item()
[13:58:38.743372] 176.0
(Pdb) 1-predict
[13:59:01.708168] tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]], device='cuda:0')
(Pdb) 1-label
[13:59:09.698975] tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
(Pdb) (1 - predict) * (1 - label)
[13:59:19.740963] tensor([[1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.]],
       device='cuda:0')

Interesting bug. This can be solved by applying the squeeze() function to the output tensor. I have submitted a pull request that addresses this problem by squeezing the output tensor. However, it's important to note that there may be other methods or parts of the codebase that are implemented in a similar way and may require similar modifications.

You are right. The dimension of the predicted label is not aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants