Skip to content

Tutorial on training a RoBERTa Transformers model from scratch

Notifications You must be signed in to change notification settings

sv-v5/train-roberta-ua

Repository files navigation

Train RoBERTa Ukrainian Model from Scratch (language en | ua) ci

Credits to https://github.com/youscan/language-models for their documentation and their roberta-ukrainian model.
This repository serves as a complete example of training--from downloading data to testing the final model. The goal is to provide a working tutorial with instructions in English (and Ukrainian) which locks the python dependencies and can be run on GNU/Linux and on whichever system Docker is available.

Requirements

CUDA supported GPU if not training on CPU
using pipenv --python 3.8 (pipenv guide)
Install Python 3.8 (example for Debian based systems, example through apt, example for Windows through Windows installer 64-bit (python3.8 must be first on the $Path and do alias python3.8="python") )
python3.8 -m pip install pipenv && python3.8 -m pipenv install
Install CUDA if training on a GPU

Main Process

Tested on Ubuntu 20.04.3 LTS and Python 3.8.10
./run.sh will use the tiny wiki dataset, train a tokenizer, train a roberta language model and test the model on a fillmask example. To use the complete wiki dataset, delete the folder ./text and un-comment lines 13-15 in run.sh and run ./run.sh.

Windows 10 64-bit and Python 3.8.10

in a git-bash shell execute ./run.sh. The full wiki dataset can be downloaded with commands from run.sh if wget for windows is installed

Nvidia docker

install and run docker build -t robertua-train . && docker run --gpus all robertua-train. remove --gpus all if using CPU

Training Script

run_language_modeling.py is adapted from a 2021 version of transformers' language modeling example

Training Time

Duration of training will depend on the hardware used and dataset size. On a GP104 GeForce GTX 1070 8 GB training was estimated to complete in 30 hours for the ukwiki-latest-pages-articles dataset.
Training time was ~3 minutes for the tiny dataset included in this repository (text/AF/{wiki_00,wiki_03,wiki_04,wiki_06,wiki_08}) with a batch_size of 4 on the GPU, and ~33 minutes for the tiny dataset on a i7-10710U CPU.

Device Dataset Training Time
GeForce GTX 1070 8 GB GPU full wiki 30 hours
GeForce GTX 1070 8 GB GPU tiny wiki 3 minutes
i7-10710U CPU tiny wiki 33 minutes

Final Model

The trained model, sized at 487MB, will be output to ./models/robertua-v1/

Example training output:

[INFO|trainer.py:1168] 2021-09-09 18:18:15,218 >> ***** Running training *****
[INFO|trainer.py:1169] 2021-09-09 18:18:15,218 >>   Num examples = 1247
[INFO|trainer.py:1170] 2021-09-09 18:18:15,218 >>   Num Epochs = 1
[INFO|trainer.py:1171] 2021-09-09 18:18:15,218 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1172] 2021-09-09 18:18:15,218 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1173] 2021-09-09 18:18:15,218 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1174] 2021-09-09 18:18:15,218 >>   Total optimization steps = 312
100%|██████████████████████████████████████████████████████████████████████████████████| 312/312 [02:40<00:00,  2.06it/s][INFO|trainer.py:1366] 2021-09-09 18:20:55,950 >> 

Training completed. Do not forget to share your model on huggingface.co/models =)


{'train_runtime': 160.7356, 'train_samples_per_second': 7.758, 'train_steps_per_second': 1.941, 'train_loss': 10.262784517728365, 'epoch': 1.0}
100%|██████████████████████████████████████████████████████████████████████████████████| 312/312 [02:40<00:00,  1.94it/s]
[INFO|trainer.py:1935] 2021-09-09 18:20:55,954 >> Saving model checkpoint to ./models/robertua-v1
[INFO|configuration_utils.py:391] 2021-09-09 18:20:55,955 >> Configuration saved in ./models/robertua-v1/config.json
[INFO|modeling_utils.py:1001] 2021-09-09 18:20:56,417 >> Model weights saved in ./models/robertua-v1/pytorch_model.bin
[INFO|tokenization_utils_base.py:2020] 2021-09-09 18:20:56,417 >> tokenizer config file saved in ./models/robertua-v1/tokenizer_config.json
[INFO|tokenization_utils_base.py:2026] 2021-09-09 18:20:56,418 >> Special tokens file saved in ./models/robertua-v1/special_tokens_map.json
{'sequence': 'вони їдуть до..', 'score': 0.008971989154815674, 'token': 18, 'token_str': '.'}
{'sequence': 'вони їдуть до\n.', 'score': 0.002718620002269745, 'token': 203, 'token_str': '\n'}
{'sequence': 'вони їдуть до,.', 'score': 0.0021304022520780563, 'token': 16, 'token_str': ','}
{'sequence': 'вони їдуть донко.', 'score': 0.0002438406809233129, 'token': 11254, 'token_str': 'нко'}
{'sequence': 'вони їдуть до вірогід.', 'score': 0.0001972682512132451, 'token': 42912, 'token_str': ' вірогід'}

Model Extension

Use a new dataset to extend the ukr-roberta-base model.

Resources