- Script to pre-train hugginface transformers BART
- Training BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Text infilling
andSentence Permutation
functions are available now
You can train huggingface transformers model simply like below example. (below example works without change as itself using sample data)
$ CUDA_VISIBLE_DEVICES=1 python -m scripts.train \
--model-config-path configs/base.json \
--train-dataset-path tests/data/sample1.txt \
--dev-dataset-path tests/data/sample1.txt \
--sp-model-path sp_model/sp_model_unigram_8K.model \
--device GPU \
--auto-encoding \
--batch-size 2 \
--steps-per-epoch 100 \
--mask-token "[MASK]" \
--mixed-precision
File Paths:
--model-config-path MODEL_CONFIG_PATH
model config file
--train-dataset-path TRAIN_DATASET_PATH
training dataset, a text file or multiple files ex)
*.txt
--dev-dataset-path DEV_DATASET_PATH
dev dataset, a text file or multiple files ex) *.txt
--pretrained-checkpoint PRETRAINED_CHECKPOINT
pretrained checkpoint path
--output-path OUTPUT_PATH
output directory to save log and model checkpoints
--sp-model-path SP_MODEL_PATH
sentencepiece model path to tokenizer
Training Parameters:
--mask-token MASK_TOKEN
mask token ex) [MASK]
--mask-token-id MASK_TOKEN_ID
mask token id of vocab
--epochs EPOCHS
--steps-per-epoch STEPS_PER_EPOCH
--learning-rate LEARNING_RATE
--min-learning-rate MIN_LEARNING_RATE
--warmup-steps WARMUP_STEPS
--warmup-rate WARMUP_RATE
--batch-size BATCH_SIZE
total training batch size of all devices
--dev-batch-size DEV_BATCH_SIZE
--num-total-dataset NUM_TOTAL_DATASET
--shuffle-buffer-size SHUFFLE_BUFFER_SIZE
--prefetch-buffer-size PREFETCH_BUFFER_SIZE
--max-sequence-length MAX_SEQUENCE_LENGTH
--weight-decay WEIGHT_DECAY
use weight decay
--clipnorm CLIPNORM clips gradients to a maximum norm.
--disable-text-infilling
disable input noising
--disable-sentence-permutation
disable input noising
--masking-rate MASKING_RATE
text infilling masking rate
--permutation-segment-token-id PERMUTATION_SEGMENT_TOKEN_ID
segment token id for sentence permutation
Other settings:
--tensorboard-update-freq TENSORBOARD_UPDATE_FREQ
log losses and metrics every after this value step
--mixed-precision Use mixed precision FP16
--auto-encoding train by auto encoding with text lines dataset
--use-tfrecord train using tfrecord dataset
--repeat-each-file repeat each dataset and uniform sample for train
example
--debug-nan-loss Trainin with this flag, print the number of Nan loss
(not supported on TPU)
--seed SEED random seed
--skip-epochs SKIP_EPOCHS
skip this number of epochs
--device {CPU,GPU,TPU}
device to train model
--max-over-sequence-policy {filter,slice}
Policy for sequences of which length is over the max
model-config-path
is huggingface bart model config file path.pretrained-checkpoint
is trained model checkpoint path.sp-model-path
is sentencepiece tokenizer model path.- with
repeat-each-file
flag, you can repeat each dataset files forever even if one of dataset were run out.