Skip to content

Noureldien334/Arabic-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic NLP

In this project we used data scrapped from https://www.arab-books.com/

First, We scrapped books data from the website like Book Name, Author, Category, Pages’ number, Dar El-Nashr, Book Size, and Description.

Second, We cleaned the data from English words, numbers, and tags.

Third, We used an autocorrect Library "Hunspell-ar" to correct mismatcched words.

Fourth, we stemmed and lemmatized the data using Farassa, We tokenized, removed the stop words, lemmatized and stemmed our data.

Note: We have tried different lemmatizers and stemmers (e.g. Qalsadi, Snowball stemmer, ISRI stemmer and Madamira) but we found that Farassa has outperformed all of them on Arabic words.

Fifth, we train the model using RandomForestClassifier (RFC) on the features(description column) and labels(category column) after applying TF-IDF vectorizer then evaluated and tested the model on 30% of the dataset using new unseen data and achieved an accuracy of 73%.

Sixth, we created two Question-Answer Datasets, then we matched the input query with the questions in the two Datasets and retrieved the most relevant answer.

Seventh, we weighted all the words of All Descriptions of the books, we weighted the important keywords by calculating Tf-Idf for each one. Note: We removed all stop words from the Descriptions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published