Please use colab for getting no problem. For transformer model, please install simpleTransformer first or use bn_nlp for static models. I uploaded dataset and training details in my github. There is a problem in sentiment analyzer. I Will fix it soon.
SUST-Bangla Natural Language toolkit. A python module for Bangla NLP tasks.
Demo Version : 2.0.2
NEED python 3.6+ vesrion!! Use virtual Environment for not getting unessessary Issues!!
pip3 install sbnltk
pip3 install simpletransformers
pip3 install fasttext
pip3 install scikit-learn
- Clone this project
- Install all the requirements
- Call the setup.py from terminal
- Bangla Text Preprocessor
- Bangla word dust,punctuation,stop word removal
- Bangla word sorting according to Bangla or English alphabet
- Bangla word normalization
- Bangla word stemmer
- Bangla Sentiment analysis(logisticRegression,LinearSVC,Multilnomial_naive_bayes,Random_Forst)
- Bangla Sentiment analysis with Bert
- Bangla sentence pos tagger (static, sklearn)
- Bangla sentence pos tagger with BERT(Multilingual-cased,Multilingual uncased)
- Bangla sentence NER(Static,sklearn)
- Bangla sentence NER with BERT(Bert-Cased, Multilingual Cased/Uncased)
- Bangla word word2vec(gensim,glove,fasttext)
- Bangla sentence embedding(Contexual,Transformer/Bert)
- Bangla Document Summarization(Feature based, Contexual, sementic Based)
- Bangla Bi-lingual project(Bangla to english google translator without blocking IP)
- Bangla document information Extraction
SEE THE CODE DOCS FOR USES!
TASK | MODEL | ACCURACY | DATASET | About | Code DOCS |
---|---|---|---|---|---|
Preprocessor | Punctuation, Stop Word, DUST removal Word normalization, others.. | ------ | ----- | docs | |
Word tokenizers | basic tokenizers Customized tokenizers | ---- | ---- | docs | |
Sentence tokenizers | Basic tokenizers Customized tokenizers Sentence Cluster | ----- | ----- | docs | |
Stemmer | StemmerOP | 85.5% | ---- | docs | |
Sentiment Analysis | logisticRegression | 88.5% | 20,000+ | docs | |
LinearSVC | 82.3% | 20,000+ | docs | ||
Multilnomial_naive_bayes | 84.1% | 20,000+ | docs | ||
Random Forest | 86.9% | 20,000+ | docs | ||
BERT | 93.2% | 20,000+ | docs | ||
POS tagger | Static method | 55.5% | 1,40,973 words | docs | |
SK-LEARN classification | 81.2% | 6,000+ sentences | docs | ||
BERT-Multilingual-Cased | 69.2% | 6,000+ | docs | ||
BERT-Multilingual-Uncased | 78.7% | 6,000+ | docs | ||
NER tagger | Static method | 65.3% | 4,08,837 Entity | docs | |
SK-LEARN classification | 81.2% | 65,000+ | docs | ||
BERT-Cased | 79.2% | 65,000+ | docs | ||
BERT-Mutilingual-Cased | 75.5% | 65,000+ | docs | ||
BERT-Multilingual-Uncased | 90.5% | 65,000+ | docs | ||
Word Embedding | Gensim-word2vec-100D- 1,00,00,000+ tokens | - | 2,00,00,000+ sentences | docs | |
Glove-word2vec-100D- 2,30,000+ tokens | - | 5,00,000 sentences | docs | ||
fastext-word2vec-200D 3,00,000+ | - | 5,00,000 sentences | docs | ||
Sentence Embedding | Contextual sentence embedding | - | ----- | docs | |
Transformer embedding_hd | - | 3,00,000+ human data | docs | ||
Transformer embedding_gd | - | 3,00,000+ google data | docs | ||
Extractive Summarization | Feature-based based | 70.0% f1 score | ------ | docs | |
Transformer sentence sentiment Based | 67.0% | ------ | docs | ||
Word2vec--sentences contextual Based | 60.0% | ----- | docs | ||
Bi-lingual projects | google translator with large data detector | ---- | ---- | docs | |
Information Extraction | Static word features | - | docs | ||
Semantic and contextual | - | docs | |||
Bangla Coreference Resolution | - |
Task | Version |
---|---|
Coreference Resolution | v1.1 |
Language translation | V1.1 |
Masked Language model | V1.1 |
Information retrieval Projects | V1.1 |
Entity Segmentation | v1.3 |
Factoid Question Answering | v1.2 |
Question Classification | v1.2 |
sentiment Word embedding | v1.3 |
So many others features | --- |
You have to install these packages manually, if you get any module error.
- simpletransformers
- fasttext
Everything is automated here. when you call a model for the first time, it will be downloaded automatically.
- With GPU, you can run any models without getting any warnings.
- Without GPU, You will get some warnings. But this will not affect in result.
With approximately 228 million native speakers and another 37 million as second language speakers,Bengali is the fifth most-spoken native language and the seventh most spoken language by total number of speakers in the world. But still it is a low resource language. Why?
For all sbnltk dataset and existing Dataset, see this link Bangla NLP Dataset
For training, You can see this Colab Trainer . In future i will make a Trainer module!
Very soon. We are working on paper and improvement our modules. It will be released sequentially.
Accuracy can be varied for the different datasets. We measure our model with random datasets but small scale. As human resources for this project are not so large.
- If you found any issue, please create an issue or contact with me.