Arabic is the fifth most widely spoken language worldwide; it is one of the languages that have not been affected by changes through centuries as it is the language of Quran. In these recent years, automatic text summarization showed a continuous and remarkable development due to its importance and applications. However, the number of research studies investigating Arabic text summarization is relatively small comparing to other languages. In this project, we propose to develop a text highlighter tool that allows the user to receive summarized feedback on the given Arabic text using EASC (Essex Arabic Summaries Corpus). The proposed system will take a single Arabic document text as input and returns the top-ranked sentences highlighted as its final result. It is implemented based on three different approaches, the first approach that achieved an accuracy of 59% is the Bag of Words (BoW), the second approach that achieved an accuracy of 60% is the Term Frequency - Inverse Document Frequency (TF-IDF), and the third approach that achieved an accuracy of 56% for the first appearance-based and 48% for the closest to the center-based is Word2Vec: SkipGram model with the K-Means clustering.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
The libraries used for the Bag of Words (BoW) summarizer:
- To access the single article URL ling
from urllib import request
- To extract the web information (article content)
from bs4 import BeautifulSoup as bs
- To Clean the web information (article content)
import re
- To import the sentence and words tokenizer
import nltk
- To import the Arabic stopwords list
from nltk.corpus import stopwords
- To import the Arabic stemmer (ISRI Arabic stemmer)
from nltk.stem.isri import ISRIStemmer
The libraries used for Term Frequency-Inverse Document Frequency (TF-IDF) summarizer:
- To import the sentence and words tokenizer
import nltk
The libraries used for the Word2Vec summarizer:
Install the pre-trained model for the 100 vector length skip-gram (wikipedia) and the utility file
github.com/bakrianoo/aravec
install: full_grams_sg_100_wiki
- To access the single article URL ling
from urllib import request
- To extract the web information (article content)
from bs4 import BeautifulSoup as bs
- To Clean the web information (article content)
import re
- To import the sentence and words tokenizer
import nltk
- To import the Arabic stopwords list
from nltk.corpus import stopwords
- To import and use the KMEANS cluster
from sklearn.manifold import TSNE
from sklearn import cluster
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.cluster import KMeans
The python files are integrated as a CGI scripts with the web files that were developed using HTML5\CSS3\JavaScript\PHP and deployed except for the Word2Vec model on the following link: http://arabic.highlight.heliohost.org/
Deployment Testing
-
Input: the article or the link of it for ex.: https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%83%D9%88%D9%86
- Anaconda Spyder - The Python environment used
- heliohostRicky - The webhost server
- VisualStudioCode - The web code editor
- Raghad Alshaikh - RaghadAlshaikh
- Ghaidaa Aflah - GhaidaaAflah
- Nada Alamouadi - NadaAlamouadi
Supervised By: Dr.Amal Almansour - AmalAlmansour
- The Arabic Word2Vec Pre-trained model: github.com/bakrianoo/aravec
- The EASC dataset: sourceforge.net/projects/easc-corpus/