You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ajaysurya1221 A100 has Ampere architecture with so-called computational capabilities of sm_8x. Some parts of the cuda computations are not running the way they should under cu101 as required in PICK's implementation. You can try different Pytorch versions with different cuda (e.g., cu111).
You will experience different issues with cu111 if you used that one, i.e., on the decoder side of the model. There are a few quick patches that you can utilize to avoid those issues as well.
Hi, i'm using one A100 GPU to train PICK and i've set distributed to false.
[2022-06-08 01:41:58,561 - train - INFO] - One GPU or CPU training mode start...
[2022-06-08 01:41:58,565 - train - INFO] - Dataloader instances created. Train datasets: 100 samples Validation datasets: 20 samples.
[2022-06-08 01:41:59,276 - train - INFO] - Model created, trainable parameters: 68571598.
[2022-06-08 01:41:59,277 - train - INFO] - Optimizer and lr_scheduler created.
[2022-06-08 01:41:59,277 - train - INFO] - Max_epochs: 35 Log_per_step: 20 Validation_per_step: 100.
[2022-06-08 01:41:59,277 - train - INFO] - Training start...
[2022-06-08 01:41:59,289 - trainer - WARNING] - Training is using GPU 0!
I've been struck here for so long and after 10-15 mins, it throws CuDNN error. any solution?
cuda version = 10.1 and pythorch = 1.5.1+101
The text was updated successfully, but these errors were encountered: