Training Issues with TruFor #42

iamwangyabin · 2024-10-15T09:44:35Z

After installing the package, I attempted to train using the default config to train the TruFor. However, I encountered significant issues when trying to train on more than 2 GPUs. The training process frequently breaks down without providing any error information.

When I finally managed to train the model on 2 H100 GPUs, I observed NaN losses occurring intermittently during training. Even though GradScaler is supposed to skip NaN values. Below is an example from the training log:


1384 [03:15:23.681523] Epoch: [2300/3613] eta: 0:33:55 lr: 0.000001 loss_ce: 0.3530 (nan) dice_loss: 0.3622 (0.3642) combined_loss: 0.3518 (nan) time: 1.5536 data: 0.0002 max mem: 55633
1385 [03:15:54.706125] Epoch: [2320/3613] eta: 0:33:24 lr: 0.000001 loss_ce: 0.3381 (nan) dice_loss: 0.2943 (0.3640) combined_loss: 0.2975 (nan) time: 1.5511 data: 0.0002 max mem: 55633
...
1392 [03:19:31.109580] Epoch: [2460/3613] eta: 0:29:47 lr: 0.000001 loss_ce: 0.3052 (nan) dice_loss: 0.2988 (0.3629) combined_loss: 0.3212 (nan) time: 1.5526 data: 0.0002 max mem: 55633

I've also noticed that the reported Image-level Accuracy values are greater than 1, which should be impossible for accuracy metrics. Here's an example from the log:



2356 [09:08:44.806920] Test: [4]  [1220/1355]  eta: 0:00:58  pixel-level F1: 3.6400 (0.2459)  pixel-level Accuracy: 15.1577 (0.9424)  time: 0.4322  data: 0.0002  max mem: 55633
2357 [09:08:53.450127] Test: [4]  [1240/1355]  eta: 0:00:50  pixel-level F1: 3.2157 (0.2455)  pixel-level Accuracy: 15.1957 (0.9425)  time: 0.4321  data: 0.0002  max mem: 55633
2358 [09:09:02.094049] Test: [4]  [1260/1355]  eta: 0:00:41  pixel-level F1: 3.6974 (0.2455)  pixel-level Accuracy: 14.7980 (0.9423)  time: 0.4322  data: 0.0002  max mem: 55633
2359 [09:09:10.735877] Test: [4]  [1280/1355]  eta: 0:00:32  pixel-level F1: 4.0385 (0.2458)  pixel-level Accuracy: 14.9980 (0.9424)  time: 0.4321  data: 0.0001  max mem: 55633
2360 [09:09:19.373706] Test: [4]  [1300/1355]  eta: 0:00:23  pixel-level F1: 3.4870 (0.2458)  pixel-level Accuracy: 15.2521 (0.9424)  time: 0.4319  data: 0.0002  max mem: 55633
2361 [09:09:28.008097] Test: [4]  [1320/1355]  eta: 0:00:15  pixel-level F1: 3.5316 (0.2453)  pixel-level Accuracy: 15.1872 (0.9424)  time: 0.4317  data: 0.0001  max mem: 55633
2362 [09:09:36.643513] Test: [4]  [1340/1355]  eta: 0:00:06  pixel-level F1: 3.8572 (0.2452)  pixel-level Accuracy: 15.2518 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2363 [09:09:42.688089] Test: [4]  [1354/1355]  eta: 0:00:00  pixel-level F1: 3.8572 (0.2451)  pixel-level Accuracy: 15.1298 (0.9426)  time: 0.4317  data: 0.0001  max mem: 55633
2364 [09:09:42.862711] Test: [4] Total time: 0:09:51 (0.4362 s / it)
2365 [09:09:42.863425] ***************************************************************
2366 [09:09:42.863506] ****An extra tail dataset should exist for accracy metrics!****
2367 [09:09:42.863562] ***************************************************************
2368 [09:09:42.863615] **** Length of tail: 5 ****
2369 [09:09:43.297684] ====================
2370 [09:09:43.298023] A batch that is not fully loaded was detected at the end of the dataset. The actual batch size for this batch is 5: The default batch size is 16
2371 [09:09:43.298088] ====================
2372 [09:09:43.298470] Actual Batchsize/ world_size {'_n': 2.5}
2373 [09:09:43.298647] {'pixel-level F1': tensor(0., device='cuda:0', dtype=torch.float64)}
2374 [09:09:43.328352] Actual Batchsize/ world_size {'_n': 2.5}
2375 [09:09:43.328514] {'pixel-level Accuracy': tensor(2.4969, device='cuda:0', dtype=torch.float64)}
2376 [09:09:43.330867] Test <remaining>: [4]  [0/1]  eta: 0:00:00  pixel-level F1: 3.8376 (0.2451)  pixel-level Accuracy: 15.0991 (0.9426)  time: 0.4655  data: 0.3176  max mem: 55633
2377 [09:09:43.331108] Test <remaining>: [4] Total time: 0:00:00 (0.4661 s / it)
2378 [09:09:45.373779] ---syncronized---
2379 [09:09:45.374095] pixel-level F1 reduced_count 43365
2380 [09:09:45.374198] pixel-level F1 reduced_sum 10573.642782675543
2381 [09:09:45.374291] image-level F1 reduced_count 2
2382 [09:09:45.374378] image-level F1 reduced_sum 1.0861882453763827
2383 [09:09:45.374466] pixel-level Accuracy reduced_count 43365
2384 [09:09:45.374551] pixel-level Accuracy reduced_sum 40856.14878845215
2385 [09:09:45.374638] image-level Accuracy reduced_count 2
2386 [09:09:45.374729] image-level Accuracy reduced_sum 15.531676273922066
2387 [09:09:45.374817] ---syncronized done ---
2388 [09:09:49.053563] Averaged stats: pixel-level F1: 3.8376 (0.2438)  pixel-level Accuracy: 15.0991 (0.9421)  image-level F1: 0.5431 (0.5431)  image-level Accuracy: 7.7658 (7.7658)
2389 [09:09:49.059104] Best pixel-level F1 = 0.24382895843826918

The image-level Accuracy is reported as 7.7600, which is not possible for a standard accuracy metric that should range from 0 to 1.

The text was updated successfully, but these errors were encountered:

iamwangyabin · 2024-10-15T09:47:24Z

I have tested MVSS training config. MVSS does not have the above problems; I can train with more than 4 GPUs without error, and the metrics look good.

dddb11 · 2024-10-15T10:31:42Z

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

iamwangyabin · 2024-10-15T10:42:13Z

Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly.

Yes, I understand that, and it's not a significant issue since GradScaler can skip these NaN losses during backpropagation

iamwangyabin · 2024-10-15T12:58:48Z

I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1...
That's why I got the accuracy larger than 1.
The total number of test images is 1000. And the Pixel-level metrics are right.

[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168

I think there is the problem exists here:

IMDLBenCo/IMDLBenCo/training_scripts/utils/misc.py

Line 42 in 2ef150e

def update(self, value, n=1):

iamwangyabin · 2024-10-15T13:40:12Z

I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1... That's why I got the accuracy larger than 1. The total number of test images is 1000. And the Pixel-level metrics are right.

[12:39:12.063887] defaultdict(<class 'IMDLBenCo.training_scripts.utils.misc.SmoothedValue'>, {'pixel-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242c50>, 'pixel-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>, 'image-level F1': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218a30>, 'image-level Accuracy': <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502218bb0>})
(Pdb) metric_logger.meters['pixel-level Accuracy'#]
[12:39:29.361305] *** SyntaxError: '[' was never closed
(Pdb) metric_logger.meters['pixel-level Accuracy']
[12:39:32.102573] <IMDLBenCo.training_scripts.utils.misc.SmoothedValue object at 0x7f3502242fe0>
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[12:39:44.795894] 1000
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[12:39:51.632652] 286.1601448059082
(Pdb) metric_logger.meters['image-level Accuracy'].count
[12:40:10.770748] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[12:40:18.716873] 8.168

I think there is the problem exists here:

IMDLBenCo/IMDLBenCo/training_scripts/utils/misc.py

Line 42 in 2ef150e

def update(self, value, n=1):

maybe i m wrong, the MVSS also has 1 total count, but its total is right.
below is the same code run for MVSS:

(Pdb) metric_logger.meters['image-level Accuracy'].count
[13:37:36.288908] 1
(Pdb) metric_logger.meters['image-level Accuracy'].total
[13:37:49.646384] 0.494
(Pdb) metric_logger.meters['pixel-level Accuracy'].total
[13:38:18.780522] 690.0959243774414
(Pdb) metric_logger.meters['pixel-level Accuracy'].count
[13:38:27.091506] 1000

SunnyHaze · 2024-10-15T13:49:38Z

We have received this bug, we will check it and localize the issue as soon as possible.

iamwangyabin · 2024-10-15T14:14:10Z

We have received this bug, we will check it and localize the issue as soon as possible.

Thank you for your quick response.

I have identified a bug in the TruFor implementation that is causing incorrect calculations and significantly enlarging the results. The issue lies in the shape of the predicted binary tensor output by the TruFor model.
The predicted binary tensor should be a 1-dimensional tensor, but TruFor outputs a 2-dim tensor


(Pdb) torch.sum((1 - predict) * (1 - label)).item()
[13:58:38.743372] 176.0
(Pdb) 1-predict
[13:59:01.708168] tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]], device='cuda:0')
(Pdb) 1-label
[13:59:09.698975] tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
(Pdb) (1 - predict) * (1 - label)
[13:59:19.740963] tensor([[1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.]],
       device='cuda:0')

Interesting bug. This can be solved by applying the squeeze() function to the output tensor.
I have submitted a pull request that addresses this problem by squeezing the output tensor. However, it's important to note that there may be other methods or parts of the codebase that are implemented in a similar way and may require similar modifications.

dddb11 · 2024-10-24T08:27:47Z

We have received this bug, we will check it and localize the issue as soon as possible.

Thank you for your quick response.

I have identified a bug in the TruFor implementation that is causing incorrect calculations and significantly enlarging the results. The issue lies in the shape of the predicted binary tensor output by the TruFor model. The predicted binary tensor should be a 1-dimensional tensor, but TruFor outputs a 2-dim tensor
(Pdb) torch.sum((1 - predict) * (1 - label)).item()
[13:58:38.743372] 176.0
(Pdb) 1-predict
[13:59:01.708168] tensor([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]], device='cuda:0')
(Pdb) 1-label
[13:59:09.698975] tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0], device='cuda:0')
(Pdb) (1 - predict) * (1 - label)
[13:59:19.740963] tensor([[1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 1., 0.]],
       device='cuda:0')
Interesting bug. This can be solved by applying the squeeze() function to the output tensor. I have submitted a pull request that addresses this problem by squeezing the output tensor. However, it's important to note that there may be other methods or parts of the codebase that are implemented in a similar way and may require similar modifications.

You are right. The dimension of the predicted label is not aligned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Issues with TruFor #42

Training Issues with TruFor #42

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

dddb11 commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

SunnyHaze commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024 •

edited

Loading

dddb11 commented Oct 24, 2024

Training Issues with TruFor #42

Training Issues with TruFor #42

Comments

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

dddb11 commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024

SunnyHaze commented Oct 15, 2024

iamwangyabin commented Oct 15, 2024 • edited Loading

dddb11 commented Oct 24, 2024

iamwangyabin commented Oct 15, 2024 •

edited

Loading