Nepali RoBERTa Model

This repository contains code to train a Byte-Level Byte Pair Encoding (BPE) tokenizer for Nepali text using the Hugging Face tokenizers library. Additionally, it provides code to train a RoBERTa model for masked language modeling tasks using the tokenizer.

Requirements

Python 3.x
transformers library
tokenizers library
torch library
datasets library

Usage

Clone the Repository

git clone https://github.com/bkhanal-11/nepali-roberta
cd nepali-roberta

Download Dataset

The dataset is present in the google drive in this link. The original dataset has been chunked down for easier storage.

Install Dependencies

pip3 install -r requirements.txt

Training the Tokenizer

Move the dataset in the nepali-text directory. Modify the parameters in train_tokenizer.py as needed. Run the training script:

python3 train_tokenizer.py

Training the RoBERTa Model

Use the trained tokenizer to tokenize the Nepali text data. Modify parameters in train_roberta.py if required. Run the RoBERTa training script:

python train_roberta.py

Using the Trained Model

After training, utilize the trained RoBERTa model for tasks like masked language modeling, text generation, etc.

A sample example from the tokenizer is given below:

sentence2 = "सोमबार १११औँ अन्तर्राष्ट्रिय श्रमिक महिला दिवसको सन्दर्भमा अनेरास्ववियूले आयोजना गरेको टेम्पो चालक महिला सम्मान कार्यक्रमलाई सम्बोधन गर्दै भुसालले ५० प्रतिशत भन्दा बढी संख्यामा रहेका महिलाहरुले सबै क्षेत्रमा ५० प्रतिशतभन्दा बढी अधिकार प्राप्तिको निम्ति"
encoded_input = tokenizer.encode(sentence2)
tokenizer.decode(encoded_input.ids)

Output:

'सोमबार १११औँ अन्तर्राष्ट्रिय श्रमिक महिला दिवसको सन्दर्भमा अनेरास्ववियूले आयोजना गरेको टेम्पो चालक महिला सम्मान कार्यक्रमलाई सम्बोधन गर्दै भुसालले ५० प्रतिशत भन्दा बढी संख्यामा रहेका महिलाहरुले सबै क्षेत्रमा ५० प्रतिशतभन्दा बढी अधिकार प्राप्तिको निम्ति'

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
roberta-tokenizer.ipynb		roberta-tokenizer.ipynb
roberta-train.py		roberta-train.py
train_tokenizer.py		train_tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nepali RoBERTa Model

Requirements

Usage

About

Releases

Packages

Languages

License

bkhanal-11/nepali-roberta

Folders and files

Latest commit

History

Repository files navigation

Nepali RoBERTa Model

Requirements

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages