All notable changes to the project are documented in this file.
Version numbers are of the form 1.0.0
.
Any version bump in the last digit is backwards-compatible, in that a model trained with the previous version can still
be used for translation with the new version.
Any bump in the second digit indicates a backwards-incompatible change,
e.g. due to changing the architecture or simply modifying model parameter names.
Note that Sockeye has checks in place to not translate with an old model that was trained with an incompatible version.
Each version section may have have subsections for: Added, Changed, Removed, Deprecated, and Fixed.
- Added Tensorboard logging for all parameter values and gradients as histograms/distributions. The logged values correspond to the current batch at checkpoint time.
- Tensorboard logging now is done with the MXNet compatible 'mxboard' that supports logging of all kinds of events (scalars, histograms, embeddings, etc.). If installed, training events are written out to Tensorboard compatible even files automatically.
- Removed the
--use-tensorboard
argument fromsockeye.train
. Tensorboard logging is now enabled by default ifmxboard
is installed.
- Change default target vocab name in model folder to
vocab.trg.0.json
- Changed serialization format of top-k lexica to pickle/Numpy instead of JSON.
sockeye-lexicon
now supports two subcommands: create & inspect. The former provides the same functionality as the previous CLI. The latter allows users to pass source words to the top-k lexicon to inspect the set of allowed target words.
- Added ability to choose a smaller
k
at decoding runtime for lexicon restriction.
- Added a flag
--strip-unknown-words
tosockeye.translate
to remove any<unk>
symbols from the output strings.
- Added a flag
--fixed-param-names
to prevent certain parameters from being optimized during training. This is useful if you want to keep pre-trained embeddings fixed during training. - Added a flag
--dry-run
tosockeye.train
to not perform any actual training, but print statistics about the model and mode of operation.
sockeye.evaluate
can now handle multiple hypotheses files by simply specifying--hypotheses file1 file2...
. For each metric the mean and standard deviation will be reported across files.
- Optionally store the beam search history to a
json
output using thebeam_store
output handler.
- Use stack operator instead of expand_dims + concat in RNN decoder. Reduces memory usage.
- Updated to MXNet 1.1.0
-
Source factors, as described in
Linguistic Input Features Improve Neural Machine Translation (Sennrich & Haddow, WMT 2016) PDF bibtex
Additional source factors are enabled by passing
--source-factors file1 [file2 ...]
(-sf
), where file1, etc. are token-parallel to the source (-s
). An analogous parameter,--validation-source-factors
, is used to pass factors for validation data. The flag--source-factors-num-embed D1 [D2 ...]
denotes the embedding dimensions and is required if source factor files are given. Factor embeddings are concatenated to the source embeddings dimension (--num-embed
).At test time, the input sentence and its factors can be passed in via STDIN or command-line arguments.
- For STDIN, the input and factors should be in a token-based factored format, e.g.,
word1|factor1|factor2|... w2|f1|f2|... ...1
. - You can also use file arguments, which mirrors training:
--input
takes the path to a file containing the source, and--input-factors
a list of files containing token-parallel factors. At test time, an exception is raised if the number of expected factors does not match the factors passed along with the input.
- For STDIN, the input and factors should be in a token-based factored format, e.g.,
-
Removed bias parameters from multi-head attention layers of the transformer.
- Loading/Saving auxiliary parameters of the models. Before aux parameters were not saved or used for initialization. Therefore the parameters of certain layers were ignored (e.g., BatchNorm) and randomly initialized. This change enables to properly load, save and initialize the layers which use auxiliary parameters.
- Device locking: Only one process will be acquiring GPUs at a time. This will lead to consecutive device ids whenever possible.
- Internal change: Standardized all data to be batch-major both at training and at inference time.
- When a device lock file exists and the process has no write permissions for the lock file we assume that the device is locked. Previously this lead to an permission denied exception. Please note that in this scenario we an not detect if the original Sockeye process did not shut down gracefully. This is not an issue when the sockeye process has write permissions on existing lock files as in that case locking is based on file system locks, which cease to exist when a process exits.
- Changed to a custom speedometer that tracks samples/sec AND words/sec. The original MXNet speedometer did not take variable batch sizes due to word-based batching into account.
- Fixed entry points in
setup.py
.
- Update to MXNet 1.0.0 which adds more advanced indexing features, benefitting the beam search implementation.
--kvstore
now accepts 'nccl' value. Only works if MXNet was compiled withUSE_NCCL=1
.
--gradient-compression-type
and--gradient-compression-threshold
flags to use gradient compression. See MXNet FAQ on Gradient Compression.
- Taking the BOS and EOS tag into account when calculating the maximum input length at inference.
- fixed a problem with
--num-samples-per-shard
flag not being parsed as int.
- New CLI
sockeye.prepare_data
for preprocessing the training data only once before training, potentially splitting large datasets into shards. At training time only one shard is loaded into memory at a time, limiting the maximum memory usage.
- Instead of using the
--source
and--target
argumentssockeye.train
now accepts a--prepared-data
argument pointing to the folder containing the preprocessed and sharded data. Using the raw training data is still possible and now consumes less memory.
- Optionally apply query, key and value projections to the source and target hidden vectors in the CNN model
before applying the attention mechanism. CLI parameter:
--cnn-project-qkv
.
- A warning will be printed if the checkpoint decoder slows down training.
- Exposing the xavier random number generator through
--weight-init-xavier-rand-type
.
- Exposing MXNet's Nesterov Accelerated Gradient, Adadelta and Adadelta optimizers.
- A tool that initializes embedding weights with pretrained word representations,
sockeye.init_embedding
.
- Added support for Swish-1 (SiLU) activation to transformer models
(Ramachandran et al. 2017: Searching for Activation Functions,
Elfwing et al. 2017: Sigmoid-Weighted Linear Units for Neural Network Function Approximation
in Reinforcement Learning). Use
--transformer-activation-type swish1
. - Added support for GELU activation to transformer models (Hendrycks and Gimpel 2016: Bridging Nonlinearities and
Stochastic Regularizers with Gaussian Error Linear Units.
Use
--transformer-activation-type gelu
.
- Fast decoding for transformer models. Caches keys and values of self-attention before softmax.
Changed decoding flag
--bucket-width
to apply only to source length.
- Gradient norm clipping (
--gradient-clipping-type
) and monitoring.
- Changed
--clip-gradient
to--gradient-clipping-threshold
for consistency.
- Sorting sentences during decoding before splitting them into batches.
- Default chunk size: The default chunk size when batching is enabled is now batch_size * 500 during decoding to avoid users accidentally forgetting to increase the chunk size.
- Downscaled fixed positional embeddings for CNN models.
- Renamed
--monitor-bleu
flag to--decode-and-evaluate
to illustrate that it computes other metrics in addition to BLEU.
--decode-and-evaluate-use-cpu
flag to use CPU for decoding validation data.--decode-and-evaluate-device-id
flag to use a separate GPU device for validation decoding. If not specified, the existing and still default behavior is to use the last acquired GPU for training.
- A tool that extracts specified parameters from params.x into a .npz file for downstream applications or analysis.
- Added chrF metric
(Popovic 2015: chrF: character n-gram F-score for automatic MT evaluation) to Sockeye.
sockeye.evaluate now accepts
bleu
andchrf
as values for--metrics
- Transformer models do not ignore
--num-embed
anymore as they did silently before. As a result there is an error thrown if--num-embed
!=--transformer-model-size
. - Fixed the attention in upper layers (
--rnn-attention-in-upper-layers
), which was previously not passed correctly to the decoder.
- Removed RNN parameter (un-)packing and support for FusedRNNCells (removed
--use-fused-rnns
flag). These were not used, not correctly initialized, and performed worse than regular RNN cells. Moreover, they made the code much more complex. RNN models trained with previous versions are no longer compatible. - Removed the lexical biasing functionality (Arthur ETAL'16) (removed arguments
--lexical-bias
and--learn-lexical-bias
).
- Updated to MXNet 0.12.1, which includes an important bug fix for CPU decoding.
- Removed dependency on sacrebleu pip package. Now imports directly from
contrib/
.
- Transformers now always use the linear output transformation after combining attention heads, even if input & output depth do not differ.
- Fixed a bug where vocabulary slice padding was defaulting to CPU context. This was affecting decoding on GPUs with very small vocabularies.
- Fixed an issue with the use of
ignore
inCrossEntropyMetric::cross_entropy_smoothed
. This was affecting runs with Eve optimizer and label smoothing. Thanks @kobenaxie for reporting.
- Lexicon-based target vocabulary restriction for faster decoding. New CLI for top-k lexicon creation, sockeye.lexicon.
New translate CLI argument
--restrict-lexicon
.
- Bleu computation based on Sacrebleu.
- Fixed yet another bug with the data iterator.
- Fixed a bug with the revised data iterator not correctly appending EOS symbols for variable-length batches. This reverts part of the commit added in 1.10.1 but is now correct again.
- Fixed a bug with max_observed_{source,target}_len being computed on the complete data set, not only on the
sentences actually added to the buckets based on
--max_seq_len
.
--max-num-epochs
flag to train for a maximum number of passes through the training data.
- Reduced memory footprint when creating data iterators: integer sequences are streamed from disk when being assigned to buckets.
- Updated MXNet dependency to 0.12 (w/ MKL support by default).
- Changed
--smoothed-cross-entropy-alpha
to--label-smoothing
. Label smoothing should now require significantly less memory due to its addition to MXNet'sSoftmaxOutput
operator. --weight-normalization
now applies not only to convolutional weight matrices, but to output layers of all decoders. It is also independent of weight tying.- Transformers now use
--embed-dropout
. Before they were using--transformer-dropout-prepost
for this. - Transformers now scale their embedding vectors before adding fixed positional embeddings. This turns out to be crucial for effective learning.
.param
files now use 5 digit identifiers to reduce risk of overflowing with many checkpoints.
- Added CUDA 9.0 requirements file.
--loss-normalization-type
. Added a new flag to control loss normalization. New default is to normalize by the number of valid, non-PAD tokens instead of the batch size.--weight-init-xavier-factor-type
. Added new flag to control Xavier factor type when--weight-init=xavier
.--embed-weight-init
. Added new flag for initialization of embeddings matrices.
--smoothed-cross-entropy-alpha
argument. See above.--normalize-loss
argument. See above.
- Batch decoding. New options for the translate CLI:
--batch-size
and--chunk-size
. Translator.translate() now accepts and returns lists of inputs and outputs.
- Exposing the MXNet KVStore through the
--kvstore
argument, potentially enabling distributed training.
- Optional smart rollback of parameters and optimizer states after updating the learning rate
if not improved for x checkpoints. New flags:
--learning-rate-decay-param-reset
,--learning-rate-decay-optimizer-states-reset
- The RNN variational dropout mask is now independent of the input (previously any zero initial state led to the first state being canceled).
- Correctly pass
self.dropout_inputs
float tomx.sym.Dropout
inVariationalDropoutCell
.
- Instead of truncating sentences exceeding the maximum input length they are now translated in chunks.
- Convolutional decoder.
- Weight normalization (for CNN only so far).
- Learned positional embeddings for the transformer.
--attention-*
CLI params renamed to--rnn-attention-*
.--transformer-no-positional-encodings
generalized to--transformer-positional-embedding-type
.