Releases: sagorbrur/bnlp
BNLP 4.0.3
BNLP 4.0.2
- Update NLTK version from 3.8.1 to 3.8.2. NLTK Version 3.8.1 has security vulnerabilities.
Reference issue: #46
BNLP 4.0.1 Patch Release
BNLP 4.0.1 Patch Release
- Minor change with version adding in requirements to fix install problem with scipy
BNLP 4.0.0-dev3
The internal build version of bnlp 4.0.0
BNLP 4.0.0-dev4
fixed build problem dev version 3
BNLP 4.0.0
BNLP 4.0.0: Re-design of BNLP version 3 with proper OOP methods for re-use model, use separate train module, and so on
Highlights
BNLP v4.0.0 is re-design with proper object-orient programming method. In the earlier version pre-trained model was loading every time we try to tokenize or embed a text. But this version model will load only once and re-use for tokenization, embedding, and other task as well. Also added automatic model downloading so if someone passes no pre-train model path it will automatically load a pre-train model from the hub. In the earlier version training module was embedded with the same prediction module. Which was creating a problem to add some separate functionalities for train and predicting. So, we separated the training module for every task like tokenization, and embeddings. The Corpus module is now a class to reuse and add new features.
API Changes
Model loading changes: Previously model was loading every time it generate a results
- Model was loading while initiating any classes
- If no model passes through it will automatically load a pre-train model from the hub.
3.3.2 | 4.0.0 |
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
model_path = "bengali_word2vec.model"
word = 'গ্রাম'
similar = bwv.most_similar(model_path, word, topn=10)
print(similar) |
from bnlp import BengaliWord2Vec
model_path = "path/mymodel.model"
bwv = BengaliWord2Vec(model_path=model_path)
word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape) |
Training module changes
The training module separated from the main module and added relevant features into it.
3.3.2 | 4.0.0 |
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train(data_file, model_name, vector_name, epochs=5) |
from bnlp import Word2VecTraining
trainer = Word2VecTraining()
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5) |
Corpus is now class
3.3.2 | 4.0.0 |
from bnlp.corpus import stopwords, punctuations, letters, digits
print(stopwords)
print(punctuations)
print(letters)
print(digits) |
from bnlp import BengaliCorpus as corpus
print(corpus.stopwords)
print(corpus.punctuations)
print(corpus.letters)
print(corpus.digits)
print(corpus.vowels) |
Contributors
- Ibrahim (automatic model downloading, fixing glove vector loading)
BNLP 4.0.0-dev2
v4.0.0dev2 add 4.0.0 dev2 version for building
BNLP 3.3.2
Bug fix
- NLTK sentence tokenizer dummy token replacement bug fixed. It was not tokening the (.) based on the algorithm.
Incompatibility warning
- The upcoming bnlp version 4.0.0 (dev release available) will be totally incompatible with the present and past versions. Added a deprecation warning so every time someone tries to import this version it will warn the user to put the exact version if they do not want to upgrade to the newer version.
v3.3.1: Patch release
Fixed version incompatibility of gensim and python 3.10
- remove the exact version of Gensim and replace it with the latest Gensim version to fix the build problem in Python 3.10 (#29 )
BNLP 3.3.0
Bug Fix
- remove
wasabi
text formatting for updated version build problem in different os, python version
New Feature
Text Cleaning
We adopted different text-cleaning formulas, and codes from clean-text and modified for Bangla. Now you can normalize and clean your text using the following methods.
from bnlp import CleanText
clean_text = CleanText(
fix_unicode=True,
unicode_norm=True,
unicode_norm_form="NFKC",
remove_url=False,
remove_email=False,
remove_emoji=False,
remove_number=False,
remove_digits=False,
remove_punct=False,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_number="<NUMBER>",
replace_with_digit="<DIGIT>",
replace_with_punct = "<PUNC>"
)
input_text = "আমার সোনার বাংলা।"
clean_text = clean_text(input_text)
print(clean_text)