Source code for "Accelerating Neural Transformer via an Average Attention Network"
The source code is developed upon THUMT
The used THUMT for experiments in our paper is downloaded at Jan 11, 2018
We introduce two sub-layers for AAN in our ACL paper: one FFN layer (Eq. (1)) and one gating layer (Eq. (2)). However, after our extensive experiments, we observe that the FFN layer is redundant and can be removed without loss of translation quality. In addition, removing FFN layer reduces the amount of model parameters and slightly improves the training speed. It also largely improves the decoding speed.
For re-implementation, we suggest other researchers to use the AAN model without the FFN sub-layer! See how we disable this layer.
- Marian: an efficient NMT toolkit implemented by C++.
- Neutron: a pytorch NMT toolkit
- translate: a fairseq-based NMT translation toolkit
- OpenNMT: a pytorch NMT toolkit
train.sh
: provides the training script with our used configuration.
test.sh
: provides the testing script.
directory train
and test
are generated on WMT14 en-de translation task.
train/eval/log
records the approximate BLEU score on development set during training.test/
contains the decoded development and test dataset, for researchers who are interested in the translations generated by our model.
The processed WMT14 en-de dataset can be found at Transformer-AAN-Data. (Original files are downloaded from Stanford NMT website.)
- Python: 2.7
- Tensorflow >= 1.4.1 (The used version for experiments in our paper is 1.4.1)
batch_size=3125,device_list=[0],eval_steps=5000,train_steps=100000,save_checkpoint_steps=1500,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=false,update_cycle=8,aan_mask=True,use_ffn=False
- train_steps: The total training steps, we used 100000 in most experiments.
- eval_steps: We obtain the approximate BLEU score on development set in every 5000 training steps.
- shared_embedding_and_softmax_weights: We shared the target-side word embedding and target-side pre-softmax parameters
- shared_source_target_embedding: We used separate source and target vocabulary, so the source-side word embedding and target-side word embedding were not shared.
- aan_mask:
- This setting enables the mask-matrix multiplication for accumulative-average computation.
- Without this setting, we used the native
tf.cumsum()
implementation. - In practice, the speed of both implementations is similar.
- For long target sentences, we recommend the native implementation, because it is more memory-compact.
- use_ffn:
- With this setting, the AAN model includes the FFN layer as presented in Eq. (1) in our paper.
- Why we add this option?
- Because FFN introduces many model parameters, and significantly slows our model.
- Without FFN, our AAN can generate very similar performance, as shown in Table 2 in our paper.
- Furthermore, we surprisingly find that in some cases, removing FFN improves AAN's performance.
- batch_size, device_list, update_cycle: This is used for parallel training. For one training step, the training procedure is as follows:
for device_i in device_list: (this runs in parallel):
for cycle_i in range(update_cycle): (this runs in sequence):
train a batch of size `batch_size`
collect gradients and costs
update the model
Therefore, the actual training batch size is: batch_size x len(device_list) x update_cycle.
- In our paper, we train the model in one GPU card, so we only set the device_list to [0]. For researchers who have more available GPU card, we encourage you to reduce the update_cycle and increase the device_list. This can improve your training speed. Particularly, training one model for WMT 14 en-de with
batch_size=3125, device_list=[0,1,2,3,4,5,6,7], update_cycle=1
takes less than 1 day.
We have received several discussions from other researchers, and we'd like to show some great discussion here.
- Why AAN can accelerate the Transformer with a factor of 4~7?
The acceleration is for Transformer without cache strategy
In theory,
Suppose both the source and target sentence have a length ofn_s
andn_t
respectively, and the model dimension isd
. In one step of the Transformer decoder, the original model has a computational complexity ofO([n_tgt d^2] (self-attention) + [n_src d^2] (cross-attention) + [d^2] (FFN))
. By contrast, the AAN has a computational complexity ofO([d^2] (AAN FFN+Gate) + [n_src d^2] (cross-attention))
.
Therefore, the theoretical acceleration is around(n_tgt + n_src) / n_src
, and the longer the target sentence is, the larger the acceleration will be.
- Welcome more discussions :).
Please cite the following paper:
Biao Zhang, Deyi Xiong and Jinsong Su. Accelerating Neural Transformer via an Average Attention Network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
@InProceedings{zhang-Etal:2018:ACL2018accelerating,
author = {Zhang, Biao and Xiong, Deyi and Su, Jinsong},
title = {Accelerating Neural Transformer via an Average Attention Network},
booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
month = {July},
year = {2018},
address = {Melbourne, Australia},
publisher = {Association for Computational Linguistics},
}
For any further comments or questions about AAN, please email Biao Zhang.