Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training using Nvidia A100 GPU #114

Open
ajaysurya1221 opened this issue Jun 8, 2022 · 1 comment
Open

Training using Nvidia A100 GPU #114

ajaysurya1221 opened this issue Jun 8, 2022 · 1 comment

Comments

@ajaysurya1221
Copy link

Hi, i'm using one A100 GPU to train PICK and i've set distributed to false.

[2022-06-08 01:41:58,561 - train - INFO] - One GPU or CPU training mode start...
[2022-06-08 01:41:58,565 - train - INFO] - Dataloader instances created. Train datasets: 100 samples Validation datasets: 20 samples.
[2022-06-08 01:41:59,276 - train - INFO] - Model created, trainable parameters: 68571598.
[2022-06-08 01:41:59,277 - train - INFO] - Optimizer and lr_scheduler created.
[2022-06-08 01:41:59,277 - train - INFO] - Max_epochs: 35 Log_per_step: 20 Validation_per_step: 100.
[2022-06-08 01:41:59,277 - train - INFO] - Training start...
[2022-06-08 01:41:59,289 - trainer - WARNING] - Training is using GPU 0!

I've been struck here for so long and after 10-15 mins, it throws CuDNN error. any solution?

cuda version = 10.1 and pythorch = 1.5.1+101

@bankh
Copy link

bankh commented Jan 12, 2023

@ajaysurya1221 A100 has Ampere architecture with so-called computational capabilities of sm_8x. Some parts of the cuda computations are not running the way they should under cu101 as required in PICK's implementation. You can try different Pytorch versions with different cuda (e.g., cu111).
You will experience different issues with cu111 if you used that one, i.e., on the decoder side of the model. There are a few quick patches that you can utilize to avoid those issues as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants