Skip to content

Foysal87/bn_nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bn_nlp

Bangla NLP toolkit. This toolkit was fully made by dataset and pretrained. This is version 2.0(Summarizer and Paper will come next version). You can use it now.

This repository was made Public at 29,jan 2020

what will you get here?

Required package(python 3.7)

  • numpy
  • scipy

Dataset

  • Bangla word Count(6,15,621++)
  • Bangla root Word count (83,665)
  • Bangla Stop Word(356++)
  • Bangla Suffix (100++)
  • Bangla root word Postag count(1,33,973++)
  • Bangla word2Vec embedding(7,25,061)
  • Bangla NER tag(4,08,837++)

'++' sign means data will increase later
Must Download Word2Vec from google drive or it will make error

Punctuation Remove

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§œāĻ•ā§‡āĻ° ‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§‡āĻ–āĻž āĻ—ā§‡āĻ˛ āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤"
print(bp.punctuation_remove(text))

output

āĻ¸ā§œāĻ•ā§‡āĻ°  āĻ•āĻžāĻ°āĻŖā§‡  āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§‡āĻ–āĻž āĻ—ā§‡āĻ˛ āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž  āĻšāĻžāĻŦā§āĻĄā§āĻŦā§  āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡ 

Stopword removal

Remove some constant word from sentence. you can find those word in 'stop_word.txt'.

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§œāĻ•ā§‡āĻ° ‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§‡āĻ–āĻž āĻ—ā§‡āĻ˛ āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤"
print(bp.stop_word_remove(text))

output

āĻ¸ā§œāĻ•ā§‡āĻ° ‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤

Stopword add

Add word in stopword list

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§œāĻ•ā§‡āĻ° ‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§‡āĻ–āĻž āĻ—ā§‡āĻ˛ āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤"
bp.add_stopword('āĻ¸ā§œāĻ•ā§‡āĻ°')
print(bp.stop_word_remove(text))

output

‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤

Dust Removal

Everything will remove from word with out bangla character

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ¸ā§œāĻ•ā§‡āĻ°12A'--,.:BāĻ•āĻžāĻ°āĻŖā§‡"
print(bp.dust_removal(text))

output

āĻ¸ā§œāĻ•ā§‡āĻ°āĻ•āĻžāĻ°āĻŖā§‡

Word Normalize

similar vowel defines same character for better accuracy.

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ…āĻ¸āĻšāĻ¨ā§€ā§Ÿ āĻ­āĻžāĻ°ā§€ āĻŦāĻ°ā§āĻˇāĻŖā§‡"
print(bp.word_normalize(text))

output

āĻ…āĻ¸āĻšāĻ¨āĻŋā§Ÿ āĻ­āĻžāĻ°āĻŋ āĻŦāĻ°ā§āĻˇāĻŖā§‡

Bangla word to english equivalent word conversion

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
text="āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€"
print(bp.bn2enCon(text))

output

rajadhani

Bangla word Sort according to english alphabet

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
vec=['ā§§', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻ­āĻžāĻ°ā§€' ,'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ°', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ']
print(bp.bn_word_sort(vec))

output

['ā§§', 'āĻ­āĻžāĻ°ā§€', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ', 'āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ°', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°']

Bangla word Sort according to bangla alphabet

from bn_nlp.preprocessing import ban_processing
bp=ban_processing()
vec=['ā§§', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻ­āĻžāĻ°ā§€' ,'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ°', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ']
print(bp.bn_word_sort_bn_sys(vec))

output

['āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻ­āĻžāĻ°ā§€', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ°', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°', 'ā§§']

Bangla Basic word tokenizer

from bn_nlp.tokenizer import wordTokenizer
wordtoken=wordTokenizer()
text="ā§§ āĻ˜āĻŖā§āĻŸāĻžāĻ° āĻ­āĻžāĻ°ā§€ āĻŦāĻ°ā§āĻˇāĻŖā§‡ āĻ¸ā§‹āĻŽāĻŦāĻžāĻ° āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ° āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨ āĻāĻ˛āĻžāĻ•āĻžā§Ÿ āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž āĻĻā§‡āĻ–āĻž āĻĻā§‡ā§Ÿ"
print(wordtoken.basic_tokenizer(text))

output

['ā§§', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻ­āĻžāĻ°ā§€', 'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ°', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ']

Bangla Normalize word tokenizer

from bn_nlp.tokenizer import wordTokenizer
wordtoken=wordTokenizer()
text="ā§§ āĻ˜āĻŖā§āĻŸāĻžāĻ° āĻ­āĻžāĻ°ā§€ āĻŦāĻ°ā§āĻˇāĻŖā§‡ āĻ¸ā§‹āĻŽāĻŦāĻžāĻ° āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĻ° āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨ āĻāĻ˛āĻžāĻ•āĻžā§Ÿ āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž āĻĻā§‡āĻ–āĻž āĻĻā§‡ā§Ÿ"
print(wordtoken.normalize_tokenizer(text))

output

['ā§§', 'āĻ˜āĻŖā§āĻŸāĻžāĻ°', 'āĻ­āĻžāĻ°āĻŋ', 'āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻ¸ā§‹āĻŽāĻŦāĻžāĻ°', 'āĻ°āĻžāĻœāĻ§āĻžāĻ¨āĻŋāĻ°', 'āĻŦāĻŋāĻ­āĻŋāĻ¨ā§āĻ¨', 'āĻāĻ˛āĻžāĻ•āĻžā§Ÿ', 'āĻœāĻ˛āĻžāĻŦāĻĻā§āĻ§āĻ¤āĻž', 'āĻĻā§‡āĻ–āĻž', 'āĻĻā§‡ā§Ÿ']

Bangla Basic Sentence tokenizer

from bn_nlp.tokenizer import sentenceTokenizer
senttoken=sentenceTokenizer()
text="āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡ āĻĒā§œā§‡āĻ¨ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸ā§€āĨ¤ āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§Ÿ āĻ¯āĻžāĻ¨ āĻšāĻ˛āĻžāĻšāĻ˛āĨ¤ āĻ—āĻ¤āĻ•āĻžāĻ˛ āĻ¸āĻ•āĻžāĻ˛āĻŦā§‡āĻ˛āĻž āĻ›āĻŋāĻ˛ āĻ…āĻ¸āĻšāĻ¨ā§€ā§Ÿ āĻ—āĻ°āĻŽāĨ¤"
print(senttoken.basic_tokenizer(text))

output

['āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡ āĻĒā§œā§‡āĻ¨ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸ā§€', ' āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§Ÿ āĻ¯āĻžāĻ¨ āĻšāĻ˛āĻžāĻšāĻ˛', ' āĻ—āĻ¤āĻ•āĻžāĻ˛ āĻ¸āĻ•āĻžāĻ˛āĻŦā§‡āĻ˛āĻž āĻ›āĻŋāĻ˛ āĻ…āĻ¸āĻšāĻ¨ā§€ā§Ÿ āĻ—āĻ°āĻŽ']

Bangla Normalize Sentence tokenizer

No Dust. No punctuation. Normalize words.

from bn_nlp.tokenizer import sentenceTokenizer
senttoken=sentenceTokenizer()
text="āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡ āĻĒā§œā§‡āĻ¨ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸ā§€āĨ¤ āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§Ÿ āĻ¯āĻžāĻ¨ āĻšāĻ˛āĻžāĻšāĻ˛āĨ¤ āĻ—āĻ¤āĻ•āĻžāĻ˛ āĻ¸āĻ•āĻžāĻ˛āĻŦā§‡āĻ˛āĻž āĻ›āĻŋāĻ˛ āĻ…āĻ¸āĻšāĻ¨ā§€ā§Ÿ āĻ—āĻ°āĻŽāĨ¤"
print(senttoken.basic_tokenizer(text))

output

['āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡ āĻĒā§œā§‡āĻ¨ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸āĻŋ', 'āĻŦā§āĻ¯āĻžāĻšāĻ¤ āĻšā§Ÿ āĻ¯āĻžāĻ¨ āĻšāĻ˛āĻžāĻšāĻ˛', 'āĻ—āĻ¤āĻ•āĻžāĻ˛ āĻ¸āĻ•āĻžāĻ˛āĻŦā§‡āĻ˛āĻž āĻ›āĻŋāĻ˛ āĻ…āĻ¸āĻšāĻ¨āĻŋā§Ÿ āĻ—āĻ°āĻŽ']

Bangla Word Checker

Is this word exist in bangla dictionary?

from bn_nlp.Stemmer import stemmerOP
stemmer=stemmerOP()
text="āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡"
print(stemmer.search(text))

output

True

Bangla word Stemmer

finding root word.

from bn_nlp.Stemmer import stemmerOP
stemmer=stemmerOP()
text="āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡"
print(stemmer.stem(text))
text="āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋāĻ¤ā§‡ āĻĒā§œā§‡āĻ¨ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸ā§€"
print(stemmer.stemSent(text))

output

āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋ
āĻ­ā§‹āĻ—āĻžāĻ¨ā§āĻ¤āĻŋ āĻĒā§œ āĻ¨āĻ—āĻ°āĻŦāĻžāĻ¸āĻŋ

Bangla word2vec embedding

pretrained word2vec embedding download link:

After downloading, paste this file in bn_nlp directory.

from bn_nlp.word2vec_embedding import word2vec
w2v=word2vec()
text="āĻŦāĻ°ā§āĻˇāĻŖā§‡"
print(w2v.closure_word(text,5))
text2="āĻŦā§ƒāĻˇā§āĻŸāĻŋ"
print(w2v.dist(text,text2))
# you can get embedding vector by calling 'w2v.embedding_vec'

output

['āĻŦāĻ°ā§āĻˇāĻŖā§‡', 'āĻŦā§ƒāĻˇā§āĻŸāĻŋāĻĒāĻžāĻ¤ā§‡', 'āĻŦā§ƒāĻˇā§āĻŸāĻŋāĻ¤ā§‡', 'āĻ•āĻžāĻ˛āĻŦā§ˆāĻļāĻžāĻ–ā§€', 'āĻœāĻ˛ā§‹āĻšā§āĻ›ā§āĻŦāĻžāĻ¸ā§‡']
26.64097023010254

Bangla sent2sent embedding/similiarty from word2vec

Less value closure similarity. Built from word2vec. you can make embedding vector from similarity. I directly implement dist, cause we basically need distance.

from bn_nlp.sent2sent_embedding import sent2sent
s2s=sent2sent()
text1="āĻ†āĻŽāĻŋ āĻ­āĻžāĻ¤ āĻ–āĻžāĻ‡"
text2="āĻ†āĻŽāĻŋ āĻĒāĻžāĻ¸ā§āĻ¤āĻž āĻ–ā§‡āĻ¤ā§‡ āĻšāĻžāĻ‡"
print(s2s.dist(text1,text2))
# 'sent2sent_dist' function takes vector and gives 2D array with every sent to other sent dist

output

37.503074645996094

Bangla Word Postag

from bn_nlp.posTag import postag
tagger=postag()
text="āĻ¸ā§œāĻ•ā§‡āĻ° ‘āĻ•āĻžāĻ°āĻŖā§‡â€™ āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ° āĻĻā§‡āĻ–āĻž āĻ—ā§‡āĻ˛ āĻĒā§āĻ°ā§‹ āĻāĻ˛āĻžāĻ•āĻž ‘āĻšāĻžāĻŦā§āĻĄā§āĻŦā§â€™ āĻ–āĻžāĻšā§āĻ›ā§‡ āĻ…āĻĨā§ˆ āĻĒāĻžāĻ¨āĻŋāĻ¤ā§‡āĨ¤"
print(tagger.tag(text))

Output

[('āĻ¸ā§œāĻ•', 'noun'), ('āĻ•āĻžāĻ°āĻŖā§‡', 'preposition'), ('āĻŦā§ƒāĻšāĻ¸ā§āĻĒāĻ¤āĻŋāĻŦāĻžāĻ°', 'noun'), ('āĻĻā§‡āĻ–āĻž', 'verb'), ('āĻ—ā§‡āĻ˛', 'verb'), ('āĻĒā§āĻ°ā§‹', 'verb'), ('āĻāĻ˛āĻžāĻ•āĻž', 'noun'), ('āĻšāĻžāĻŦā§āĻĄā§āĻŦā§', 'noun'), ('āĻ–āĻžāĻšā§āĻ›ā§‡', 'verb'), ('āĻ…āĻĨā§ˆ', 'adverb'), ('āĻĒāĻžāĻ¨āĻŋ', 'noun')]

Bangla Word NER

Good accuracy for single entity.

from bn_nlp.NER import UncustomizeNER
ner=UncustomizeNER()
text="āĻ†āĻ°ā§āĻœā§‡āĻ¨ā§āĻŸāĻŋāĻ¨āĻž āĻĻāĻ•ā§āĻˇāĻŋāĻŖ āĻ†āĻŽā§‡āĻ°āĻŋāĻ•āĻžāĻ° āĻāĻ•āĻŸāĻŋ āĻ°āĻžāĻˇā§āĻŸā§āĻ°āĨ¤ āĻŦā§āĻ¯āĻŧā§‡āĻ¨ā§‹āĻ¸ āĻ†āĻ‡āĻ°ā§‡āĻ¸ āĻĻā§‡āĻļāĻŸāĻŋāĻ° āĻŦā§ƒāĻšāĻ¤ā§āĻ¤āĻŽ āĻļāĻšāĻ° āĻ“ āĻ°āĻžāĻœāĻ§āĻžāĻ¨ā§€āĨ¤"
print(ner.NER(text))

output

{'āĻ†āĻ°ā§āĻœā§‡āĻ¨ā§āĻŸāĻŋāĻ¨āĻž': 'LOC', 'āĻĻāĻ•ā§āĻˇāĻŋāĻŖ āĻ†āĻŽā§‡āĻ°āĻŋāĻ•āĻžāĻ°': 'LOC', 'āĻ°āĻžāĻˇā§āĻŸā§āĻ°': 'LOC', 'āĻŦā§āĻ¯āĻŧā§‡āĻ¨ā§‹āĻ¸ āĻ†āĻ‡āĻ°ā§‡āĻ¸': 'PER', 'āĻĻā§‡āĻļāĻŸāĻŋāĻ°': 'LOC', 'āĻŦā§ƒāĻšāĻ¤ā§āĻ¤āĻŽ āĻļāĻšāĻ°': 'LOC'}

Thank you
Let's make better resources for Bangla