- Introduction.
1.1 Text-to-Text Transfer Transformer (T5).
1.2 Multilingual T5. - Fine-tuning MT5.
2.1 Data preparation.
2.2 Encoding configuration.
2.3 Training results.
2.4 Model Testing and Discussion. - Tips to run the code.
T5 is a pre-trained language model whose primary distinction is its use of a unified “text-to-text” format for all text-based NLP problems.
This approach is natural for generative tasks where the task format requires the model to generate text conditioned on some input. It is more unusual for classification tasks, where T5 is trained to generate the literal text of the class label instead of a class index. The primary advantage of this approach is that it allows the use a single set of hyperparameters for effective fine-tuning on any downstream task.
T5 uses a basic encoder-decoder Transformer architecture. T5 is pre-trained on C4 Common Crawl dataset using BERT-style masked language modeling “span-corruption” objective, where consecutive spans of input tokens are replaced with a mask token and the model is trained to reconstruct the masked-out tokens.
The authors trained 5 different size variants of T5: small model, base model, large model, and models with 3 billion and 11 billion parameters.
MT5 is a multilingual variant of T5 that was pre-trained on a new Common Crawl-based mC4 dataset covering 101 languages. MT5 pre-training uses suitable data sampling strategies to boost lower-resource languages, and to avoid over and under fitting of the model. Similar to T5, MT5 casts all the tasks into the text-to-text format.
Similar to T5, the authors trained 5 different size variants of MT5: small model, base model, large model, XL, and XXL model. The increase in parameter counts compared to the corresponding T5 model variants comes from the larger vocabulary used in mT5.
MT5-small is fine-tuned on a new task of predicting the language a given text is written in, using the XNLI dataset, which contains text in 15 languages. The XNLI 15-way parallel corpus consists of 15 tab-separated columns, each corresponding to one language as indicated by the column headers. The column headers, each representing a language is given below,
ar: Arabic
bg: Bulgarian
de: German
el: Greek
en: English
es: Spanish
fr: French
hi: Hindi
ru: Russian
sw: Swahili
th: Thai
tr: Turkish
ur: Urdu
vi: Vietnamese
zh: Chinese (Simplified)
These column headers are used as the target text during fine-tuning. MT5 models are supported by Hugging Face transformers package, and the details about model evaluation and fine-tuning can be found in the documentation.
The xnli dataset is cleaned and then prepared as a two-column data frame, with the column headers 'input_text' and 'target_text'. Since MT5 is a text-to-text model, to specify which task the model should perform, a prefix text is added to the original input sequence before feeding it. The prefix helps the model better when fine-tuning it on multiple downstream tasks, e.g., machine translation between many languages. The prefix <idf.lang> is added as a special token to the tokenizer. As stated in the documentation, if the new number of tokens is not equal to the model.config.vocab_size, then resize the input token embeddings matrix of the model. A few of the prepared training samples are shown below,
input_text | target_text |
---|---|
<idf.lang> सांप नदी सांपों से भरा है। | hi |
<idf.lang> Anaokulu öğrencilerinin taklit yapma konusunda o kadar fazla yardıma ihtiyaçları yok. | tr |
<idf.lang> Важно показать пределы данных, или люди сделают плохие выводы, которые уничтожат исследование. | ru |
<idf.lang> Музеят е в близост до египетския музей. | bg |
<idf.lang> O Mungu kwa sababu jina jina tu nimelisahau lakini ni Amani ya Bunge | sw |
T5 paper (source) : " There are some extra parameters in the decoder due to the encoder-decoder attention and there are also some computational costs in the attention layers that are quadratic in the sequence lengths "
Since the input and target token id lengths are task-specific, the distribution of the token id lengths of the dataset needs to be first analyzed. Decent values need to be chosen for sequence lengths without requiring high computational power.
- The maximum input sequence length is set to 40.
- The maximum target sequence length is set to 3.
- truncation=True truncates the sequence to a maximum length specified by the max_length argument.
- padding='max_length' pads the sequence to a length specified by the max_length argument.
The optimizer used is AdamW with the learning rate 5e-4. The learning rate scheduler used is a linear schedule with a warmup, which creates a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer. Warmup is a way to reduce the primacy effect of the early training examples. The training and the validation losses computed during fine-tuning are plotted in the below graph,
The model.forward() function is used to perform the forward pass using a batch of input and target token ids. The forward pass automatically creates the correct decoder_input_ids required for training. The calculated loss is backpropagated to update the model weights for each training step. During training, the model is also validated and saved at regular intervals.
Model Test Accuracy: 99.49%
At inference time, the model.generate() function is used, which auto-regressively generates the decoder output, given the input text token ids. Then the tokenizer is used to decode the output ids into text. The model is tested on 10,000 examples, out of which only 51 are wrongly predicted. To understand better, let's try to take a close look at the wrong predictions. All the wrong predictions are listed in the table below.
Almost 40% of the wrong predictions are either Hindi sentences predicted as Urdu or vice versa. In a day-to-day colloquial conversation, it is very common to write Hindi text messages using English letters, instead of using the original Hindi script. And the same goes with Urdu. The dataset contains both versions of examples in Hindi and Urdu. All these wrongly predicted sentences marked with red squares are Hindi or Urdu sentences written using English letters. But why ?
-
Both Hindi and Urdu originally developed from Khari Boli, a dialect of the Delhi region, and the spoken languages are extremely similar to one another. They have the exact same grammatical structure, and at the beginner level, they share over 70 - 80% of their vocabulary. If you heard people speaking in India, you wouldn’t know if it was Hindi or Urdu. Although spoken Urdu and Hindi are very similar, the written portions of the languages are quite different from one another, and it is their separate scripts and literary traditions that have largely contributed to their status as separate languages rather than dialects. Hindi is developed from Sanskrit and written left-to-right, while Urdu has a right-to-left script that is derived from Persian and Arabic.
-
So when Hindi/Urdu is written using English letters, the model might find it difficult to differentiate them since they sound very similar, and hence they contribute to almost 40% of the wrong predictions.
Almost 20% of the wrong predictions are either Bulgarian sentences predicted as Russian or vice versa. All these wrongly predicted sentences marked with blue squares are Bulgarian or Russian sentences written using their original script. Despite using their original script, what might be the reason for the wrong predictions?
- Both Bulgarian and Russian belong to the Slavic languages family. The most obvious common feature between both of them is that they use the Cyrillic alphabet. However, both languages have adapted it to their own sound systems and have differences in terms of Grammer. But still, they are very similar in terms of the script, and there are only minor differences. The similarity in the text between both the languages might be the reason for this 20% of wrong predictions.
The rows marked with double blue squares contribute to almost 10% of the wrong predictions. The following are the similarities between them,
- The script of these texts is very similar to the English language, and the length of the sequences is short.
- All those texts contain the name of a person or place, whose short description is given below based on the context of the sentences.
- Brock (an American liberal political consultant), Cambridge (a city in eastern England), Eugene Debs (an American socialist), James Cook (a British explorer), Wolverstone (an English name).
- All these names are somehow related to the English language.
- The language of all these sentences is wrongly predicted as English.
A few rows (10,15,16,17,21,25,30,34,41) in the above table contain few English words, but still they are not wrongly predicted as English. But all of those rows either do not contain short sentences or the script/vocabulary of the input sentence is not similar to English.
Few words in a sentence that are more relevant to a particular language, and cannot be translated to other languages (e.g., name of a person/place) might influence the model output, especially if the input sentence is short and/or the script and vocabulary of the language from which those words originated are similar to the that of the input text.
The model prediction 'zhur' does not correspond to any of the languages mentioned in the xnli dataset. Or the prediction can be viewed as a combination of two languages (Chinese and Urdu) for the given input text. What might be the reason for such an output?
- Unlike encoder-based models, there is no hard constraint to generate a well-formed prediction (e.g., exact class labels as predicted text) since the decoder part of the mt5 model is auto-regressive. Hence this kind of output is acceptable in text-to-text generative models.
- This entire project was run in Google Colaboratory. To directly open the .ipynb file in Google colab, click .
Before running the code in Google Colab, change the runtime type to GPU. To do that, click the “Runtime” dropdown menu, then select "Change runtime type". Now select GPU in the “Hardware accelerator” dropdown menu. - Download the XNLI-15way dataset here directly, or after cloning the repository, the file 'xnli15.tsv' can be found in the dataset directory. Make sure the directory of the file is correct in the code before running the script.
- If you are running the code in a local machine, there is no need to mount the Google drive, and the training results can be saved in the local machine by modifying the required directories in the code. Since many packages are pre-installed in Google Colab, only the required additional packages(klib,sentencepiece and transformers) are installed separately. If you are running the code in a local machine, make sure that all the loaded packages are installed in the working environment before running the script.
- If you want to load the fine-tuned model, then the model checkpoints saved during fine-tuning can be downloaded from here.