-
My script for training Transformer: The training goes well, like this: (training one epoch takes about 1.5 hour) However, when one epoch ends, the validation and testing process are very slow. For example, it takes about 30-40 minutes to finish the validation process (3000 samples for validation in total). Any body has idea? Perhaps it is due to something missing, but I am not sure. I am using:
My platform info:
|
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
@ymjiang 30-40m for validation definitely sounds abnormal. Are you using gluonnlp 0.5.0 when running the script? Would you mind trying the version on master branch, with mxnet nightly? Could you share a bit more on the platform on which you run the experiment?
|
Beta Was this translation helpful? Give feedback.
-
cc @eric-haibin-lin @szhengac in case they have other suggestions. |
Beta Was this translation helpful? Give feedback.
-
@szha Thanks for the quick response. I am using gluonnlp 0.5.1, cloned from master branch (f4275c0). I will try gluonnlp 0.5.0. I updated my platform info in the description. |
Beta Was this translation helpful? Give feedback.
-
Training throughput also seems low. What's your GPU utilization during training? |
Beta Was this translation helpful? Give feedback.
-
@eric-haibin-lin The utilization is about 50% for each GPU during training. What should be the ideal throughput? And by the way, I set |
Beta Was this translation helpful? Give feedback.
-
To achieve good performance on transformer, a large batch is typically needed, so we set num_accumulated to 16 for the pretrained model. Batch size 7000 per GPU may be too large as some buckets may not have enough samples to support this number and therefore cause imbalance. Thus, I would recommend using larger num_accumulated instead of batch size. In the first 1-3 epochs, the testing is slow, as the prediction is not accurate and it would take a lot of steps to get EOS token or stop until it reaches the maximum length. When the training proceeds, you will find that the testing takes less time. The final model typically takes 4-5 minutes to finish the testing. |
Beta Was this translation helpful? Give feedback.
-
On one 8GPU V100 machine, it is supposed to get 160k wps. |
Beta Was this translation helpful? Give feedback.
-
@szhengac Thank you very much for the detailed explanations. Let me try to set larger |
Beta Was this translation helpful? Give feedback.
-
Problem solved, I hard-code the optimizer to SGD in |
Beta Was this translation helpful? Give feedback.
Problem solved, I hard-code the optimizer to SGD in
train_transformer.py
for some reasons. When I use Adam instead, the loss decrease much more quickly. At final step, the validation only takes about 5 minutes. And the training speed reaches 130 kwps. Sorry about my mistake.