This repository contains the code and data processing for finetuning VinaLLaMA-7B and VinaLLaMA-7B-chat in the paper "VinaLLaMA-7B: A Large-Scale Vietnamese-English Machine Translation Model" by Hieu Pham, Dat Quoc Nguyen, Thi Ngoc Diep Do, Minh Nguyen, and Son N. Tran on machine translation task.
The model is finetuned on teencode and slang data from social media text data UIT-VSMEC (translated to English using GPT4), synthetic data (generated using GPT4), parallel dataset mt_eng_vietnamese (HuggingFace).
The instruction prompt used for finetuning is MTInstruct, AlignInstruct, HintInstruct, ReviseInstruct in the paper "Tuning LLMs with Contrastive Alignment Instructions for Machine Translation in Unseen, Low-resource Languages" by Zhuoyuan Mao and Yen Yu.