Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus
[Accepted at EMNLP 2023 Workshop BLP, Paper — Link will be updated]
The Bangla punctuation restoration corpus, christened as BanglaPRcorpus, is constituted by 1.48 million source-target pairs. Within these pairs, the omission of punctuation from source sentences is conspicuous, while the target sentences epitomize the rectified versions where the supplementation of missing punctuation is executed. The process of correction entails the methodical removal of punctuation marks across the sentences, spanning a spectrum of quantities, ranging from 1 to 10, within each sentence. Moreover, it is of significance to underscore that the sentences within our corpus manifest a divergence in length, with the minimum sentence being characterized by a mere 2 words, the maximum sentence expanding to a substantial 127 words, and the average sentence length averaging at 12.9 words.
Clone the GitHub repository of the paper.
git clone https://github.com/mehedihasanbijoy/Jatikarok-and-BanglaPRCorpus.git
Alternatively, you can manually download and extract the GitHub repository of Jatikarok-and-BanglaPRCorpus.
Install the required packages.
conda env create -f requirements.yml
Afterward, activate the virtual environment and navigate to the paper directory.
conda activate jatikarok
cd Jatikarok and BanglaPRCorpus
gdown https://drive.google.com/drive/folders/1V1OrkJ4okSgw5swmhrbXAZFqkDB8g7QX?usp=share_link -O ./BanglaPRCorpus/BanglaPRCorpus/ --folder
or manually download the folder from here and keep the extracted files into ./BanglaPRCorpus/BanglaPRCorpus/
Go to ./BanglaPRCorpus
directory and follow the instructions.
The experiments in this paper involves benchmarking three methods, namely Jatikarok, BanglaT5, and T5 Small, on three different corpora, including BanglaPRCorpus, ProthomAloBalanced, and BanglaOPUS.
python main.py --CORPUS_PATH "./BanglaPRCorpus/BanglaPRCorpus/corpus.csv" --KNOWLEDGE_PATH "./KnowledgeToBeTransferred/gecJatikarok.pth" --CHECKPOINT_PATH "./ModelCheckpoints/prJatikarok.pth" --MODEL_NAME "jatikarok" --BATCH_SIZE 16 --N_EPOCHS 50