-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Issues with TruFor #42
Comments
I have tested MVSS training config. MVSS does not have the above problems; I can train with more than 4 GPUs without error, and the metrics look good. |
Do you use the default training sh in IMDLBenCo and encounter NaN during every training? Based on my experience training Trufor, dice loss can sometimes be unstable, so we use a smaller learning rate. Also, if you have a larger GPU memory, you may need to adjust the learning rate accordingly. |
Yes, I understand that, and it's not a significant issue since GradScaler can skip these NaN losses during backpropagation |
I observed that the count for the image-level metric (e.g., "image-level Accuracy") is sometimes only 1...
I think there is the problem exists here:
|
maybe i m wrong, the MVSS also has 1 total count, but its total is right.
|
We have received this bug, we will check it and localize the issue as soon as possible. |
Thank you for your quick response. I have identified a bug in the TruFor implementation that is causing incorrect calculations and significantly enlarging the results. The issue lies in the shape of the predicted binary tensor output by the TruFor model.
Interesting bug. This can be solved by applying the squeeze() function to the output tensor. |
You are right. The dimension of the predicted label is not aligned. |
After installing the package, I attempted to train using the default config to train the TruFor. However, I encountered significant issues when trying to train on more than 2 GPUs. The training process frequently breaks down without providing any error information.
When I finally managed to train the model on 2 H100 GPUs, I observed NaN losses occurring intermittently during training. Even though GradScaler is supposed to skip NaN values. Below is an example from the training log:
I've also noticed that the reported Image-level Accuracy values are greater than 1, which should be impossible for accuracy metrics. Here's an example from the log:
The image-level Accuracy is reported as 7.7600, which is not possible for a standard accuracy metric that should range from 0 to 1.
The text was updated successfully, but these errors were encountered: