Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Getting NaN values for MSE in Montevideo Bus #132

Closed
nithinmanoj10 opened this issue May 12, 2024 · 3 comments · Fixed by #135
Closed

🐛 Getting NaN values for MSE in Montevideo Bus #132

nithinmanoj10 opened this issue May 12, 2024 · 3 comments · Fixed by #135
Labels
bug Something isn't working v1.1.0 Tasks for STGraph v1.1.0

Comments

@nithinmanoj10
Copy link
Contributor

Issue: Getting NaN as the MSE value after epoch = 1 while training a TGCN on Montevideo Bus dataset.

How to replicate error: Run the v1.1.0 test script for temporal_tgcn_dataloaders testpack. Inspect the outputs folder for Montevideo Bus to view the MSE values

@nithinmanoj10 nithinmanoj10 added bug Something isn't working v1.1.0 Tasks for STGraph v1.1.0 labels May 12, 2024
@nithinmanoj10
Copy link
Contributor Author

Current Findings

  1. The gradient for all the bias and weight parameters present inside the TGCN is NaN except linear2.weight and linear2.bias
  2. This is only noticed in Montevideo Bus dataset and for none of the other datasets

I have attached a log file displaying the gradient value of each parameter and the parameter values for the first epoch. The cost has also been displayed at the beginning for each iteration within the first epoch.

Montevideo_Bus.txt

How to Proceed

Learn how gradients are calculated. Possibly look at forums to see what others have posted.

@nithinmanoj10 nithinmanoj10 linked a pull request May 19, 2024 that will close this issue
@nithinmanoj10
Copy link
Contributor Author

nithinmanoj10 commented May 19, 2024

First Fix

Was now able to run the training script for Montevideo Bus without getting any NaN errors. Ran it for 5 epochs. The first fix can be found in this commit: c66ac7b

🎉 CUDA is available
Training...

                                                      
   (STGraph Static-Temporal) TGCN on Montevideo_Bus   
                       dataset                        
                                                      
 Epoch ┃ Time(s)  ┃ MSE    ┃ Used GPU Memory (Max MB) 
━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━
 0     │ 11.36216 │ 2.0212 │ 3189.5093                
 1     │ 5.95814  │ 1.9548 │ 3189.5337                
 2     │ 5.86139  │ 2.0020 │ 3189.5337                
 3     │ 5.90587  │ 1.9583 │ 3189.4888                
 4     │ 6.07822  │ 2.0046 │ 3189.4888                
Average Time taken: 5.992046

@nithinmanoj10
Copy link
Contributor Author

nithinmanoj10 commented May 20, 2024

Things To-Do

  • Address and try to run the script a bit faster
  • Modify train.py so that we are using the right node features to train the model for all datasets - will provide this change in a future release
  • Make sure we are passing the right node features value to train.py. Maybe have an attribute called num_node_feats for all datset loaders - will provide this in a future release
  • Add the model parameters info to the benchmark_tools module. Name it to tools instead of benchmark_tools - will address this in a different Pull Request/Issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v1.1.0 Tasks for STGraph v1.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant