Validation and testing too slow when train Transformer #565

ymjiang · 2019-01-21T05:37:00Z

ymjiang
Jan 21, 2019
Collaborator

My script for training Transformer:
MXNET_GPU_MEM_POOL_TYPE=Round python train_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 7000 --optimizer sgd --lr 2.0 --warmup_steps 4000 --save_dir transformer_en_de_u512 --epochs 30 --gpus 0,1,2,3,4,5,6,7 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --log_interval 10

The training goes well, like this: (training one epoch takes about 1.5 hour)
[Epoch 0 Batch 50/2853] loss=10.5437, ppl=37939.2041, throughput=70.37K wps, wc=942.95K

However, when one epoch ends, the validation and testing process are very slow. For example, it takes about 30-40 minutes to finish the validation process (3000 samples for validation in total).

Any body has idea? Perhaps it is due to something missing, but I am not sure.

I am using:

Gluonnlp 0.5.1 (commit f4275c0)
MXNet 1.3.1
python 2.7

My platform info:

64 CPU Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
200 GB memory
GPU: 32GB V100 (utilization ~50% during training)
Isolated environment (as Docker container, but I ensure no other process is running on the same machine)
CUDA 10.0

Answered by ymjiang

Jan 23, 2019

Problem solved, I hard-code the optimizer to SGD in train_transformer.py for some reasons. When I use Adam instead, the loss decrease much more quickly. At final step, the validation only takes about 5 minutes. And the training speed reaches 130 kwps. Sorry about my mistake.

View full answer

szha · 2019-01-21T06:10:24Z

szha
Jan 21, 2019
Maintainer

@ymjiang 30-40m for validation definitely sounds abnormal. Are you using gluonnlp 0.5.0 when running the script? Would you mind trying the version on master branch, with mxnet nightly?

Could you share a bit more on the platform on which you run the experiment?

GPU, CPU, RAM size
Whether it's a shared environment
CUDA version

0 replies

szha · 2019-01-21T06:10:56Z

szha
Jan 21, 2019
Maintainer

cc @eric-haibin-lin @szhengac in case they have other suggestions.

0 replies

ymjiang · 2019-01-21T06:28:19Z

ymjiang
Jan 21, 2019
Collaborator Author

@szha Thanks for the quick response. I am using gluonnlp 0.5.1, cloned from master branch (f4275c0). I will try gluonnlp 0.5.0.

I updated my platform info in the description.

0 replies

eric-haibin-lin · 2019-01-21T06:42:03Z

eric-haibin-lin
Jan 21, 2019
Maintainer

Training throughput also seems low. What's your GPU utilization during training?

0 replies

ymjiang · 2019-01-21T06:48:21Z

ymjiang
Jan 21, 2019
Collaborator Author

@eric-haibin-lin The utilization is about 50% for each GPU during training.

What should be the ideal throughput? And by the way, I set num_accumulated to 1. Could this be the reason that the throughput is low?

0 replies

szhengac · 2019-01-21T07:29:16Z

szhengac
Jan 21, 2019
Maintainer

To achieve good performance on transformer, a large batch is typically needed, so we set num_accumulated to 16 for the pretrained model. Batch size 7000 per GPU may be too large as some buckets may not have enough samples to support this number and therefore cause imbalance. Thus, I would recommend using larger num_accumulated instead of batch size. In the first 1-3 epochs, the testing is slow, as the prediction is not accurate and it would take a lot of steps to get EOS token or stop until it reaches the maximum length. When the training proceeds, you will find that the testing takes less time. The final model typically takes 4-5 minutes to finish the testing.

0 replies

szhengac · 2019-01-21T07:31:19Z

szhengac
Jan 21, 2019
Maintainer

On one 8GPU V100 machine, it is supposed to get 160k wps.

0 replies

ymjiang · 2019-01-21T08:25:40Z

ymjiang
Jan 21, 2019
Collaborator Author

@szhengac Thank you very much for the detailed explanations. Let me try to set larger num_accumulated and see if it works as you described.

0 replies

ymjiang · 2019-01-23T08:37:10Z

ymjiang
Jan 23, 2019
Collaborator Author

Problem solved, I hard-code the optimizer to SGD in train_transformer.py for some reasons. When I use Adam instead, the loss decrease much more quickly. At final step, the validation only takes about 5 minutes. And the training speed reaches 130 kwps. Sorry about my mistake.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation and testing too slow when train Transformer #565

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Validation and testing too slow when train Transformer #565

ymjiang Jan 21, 2019 Collaborator

Replies: 9 comments

szha Jan 21, 2019 Maintainer

szha Jan 21, 2019 Maintainer

ymjiang Jan 21, 2019 Collaborator Author

eric-haibin-lin Jan 21, 2019 Maintainer

ymjiang Jan 21, 2019 Collaborator Author

szhengac Jan 21, 2019 Maintainer

szhengac Jan 21, 2019 Maintainer

ymjiang Jan 21, 2019 Collaborator Author

ymjiang Jan 23, 2019 Collaborator Author

ymjiang
Jan 21, 2019
Collaborator

szha
Jan 21, 2019
Maintainer

szha
Jan 21, 2019
Maintainer

ymjiang
Jan 21, 2019
Collaborator Author

eric-haibin-lin
Jan 21, 2019
Maintainer

ymjiang
Jan 21, 2019
Collaborator Author

szhengac
Jan 21, 2019
Maintainer

szhengac
Jan 21, 2019
Maintainer

ymjiang
Jan 21, 2019
Collaborator Author

ymjiang
Jan 23, 2019
Collaborator Author