Training using Nvidia A100 GPU #114

ajaysurya1221 · 2022-06-08T05:48:16Z

Hi, i'm using one A100 GPU to train PICK and i've set distributed to false.

[2022-06-08 01:41:58,561 - train - INFO] - One GPU or CPU training mode start...
[2022-06-08 01:41:58,565 - train - INFO] - Dataloader instances created. Train datasets: 100 samples Validation datasets: 20 samples.
[2022-06-08 01:41:59,276 - train - INFO] - Model created, trainable parameters: 68571598.
[2022-06-08 01:41:59,277 - train - INFO] - Optimizer and lr_scheduler created.
[2022-06-08 01:41:59,277 - train - INFO] - Max_epochs: 35 Log_per_step: 20 Validation_per_step: 100.
[2022-06-08 01:41:59,277 - train - INFO] - Training start...
[2022-06-08 01:41:59,289 - trainer - WARNING] - Training is using GPU 0!

I've been struck here for so long and after 10-15 mins, it throws CuDNN error. any solution?

cuda version = 10.1 and pythorch = 1.5.1+101

bankh · 2023-01-12T17:15:53Z

@ajaysurya1221 A100 has Ampere architecture with so-called computational capabilities of sm_8x. Some parts of the cuda computations are not running the way they should under cu101 as required in PICK's implementation. You can try different Pytorch versions with different cuda (e.g., cu111).
You will experience different issues with cu111 if you used that one, i.e., on the decoder side of the model. There are a few quick patches that you can utilize to avoid those issues as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training using Nvidia A100 GPU #114

Training using Nvidia A100 GPU #114

ajaysurya1221 commented Jun 8, 2022

bankh commented Jan 12, 2023 •

edited

Loading

Training using Nvidia A100 GPU #114

Training using Nvidia A100 GPU #114

Comments

ajaysurya1221 commented Jun 8, 2022

bankh commented Jan 12, 2023 • edited Loading

bankh commented Jan 12, 2023 •

edited

Loading