Arabic Text Summarization: Text Highlighter for Important Information in Arabic Text

Arabic is the fifth most widely spoken language worldwide; it is one of the languages that have not been affected by changes through centuries as it is the language of Quran. In these recent years, automatic text summarization showed a continuous and remarkable development due to its importance and applications. However, the number of research studies investigating Arabic text summarization is relatively small comparing to other languages. In this project, we propose to develop a text highlighter tool that allows the user to receive summarized feedback on the given Arabic text using EASC (Essex Arabic Summaries Corpus). The proposed system will take a single Arabic document text as input and returns the top-ranked sentences highlighted as its final result. It is implemented based on three different approaches, the first approach that achieved an accuracy of 59% is the Bag of Words (BoW), the second approach that achieved an accuracy of 60% is the Term Frequency - Inverse Document Frequency (TF-IDF), and the third approach that achieved an accuracy of 56% for the first appearance-based and 48% for the closest to the center-based is Word2Vec: SkipGram model with the K-Means clustering.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

The libraries used for the Bag of Words (BoW) summarizer:

To access the single article URL ling

from urllib import request

To extract the web information (article content)

from bs4 import BeautifulSoup as bs

To Clean the web information (article content)

import re

To import the sentence and words tokenizer

import nltk

To import the Arabic stopwords list

from nltk.corpus import stopwords

To import the Arabic stemmer (ISRI Arabic stemmer)

from nltk.stem.isri import ISRIStemmer

The libraries used for Term Frequency-Inverse Document Frequency (TF-IDF) summarizer:

To import the sentence and words tokenizer

import nltk

The libraries used for the Word2Vec summarizer:

Install the pre-trained model for the 100 vector length skip-gram (wikipedia) and the utility file

github.com/bakrianoo/aravec
install: full_grams_sg_100_wiki

To access the single article URL ling

from urllib import request

To extract the web information (article content)

from bs4 import BeautifulSoup as bs

To Clean the web information (article content)

import re

To import the sentence and words tokenizer

import nltk

To import the Arabic stopwords list

from nltk.corpus import stopwords

To import and use the KMEANS cluster

from sklearn.manifold import TSNE
from sklearn import cluster
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.cluster import KMeans

Deployment

The python files are integrated as a CGI scripts with the web files that were developed using HTML5\CSS3\JavaScript\PHP and deployed except for the Word2Vec model on the following link: http://arabic.highlight.heliohost.org/

Testing

Deployment Testing

Input: the article or the link of it for ex.: https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D9%83%D9%88%D9%86
Output: the summary represented as a highlighted text

Built With

Anaconda Spyder - The Python environment used
heliohostRicky - The webhost server
VisualStudioCode - The web code editor

Authors

Raghad Alshaikh - RaghadAlshaikh
Ghaidaa Aflah - GhaidaaAflah
Nada Alamouadi - NadaAlamouadi

Supervised By: Dr.Amal Almansour - AmalAlmansour

Acknowledgments

The Arabic Word2Vec Pre-trained model: github.com/bakrianoo/aravec
The EASC dataset: sourceforge.net/projects/easc-corpus/

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Arabic_Word2Vec_SkipGramSummarize.py		Arabic_Word2Vec_SkipGramSummarize.py
BagOfWords_Summarize.py		BagOfWords_Summarize.py
README.md		README.md
TFIDF_Summarize_withoutStopWordsRemoval.py		TFIDF_Summarize_withoutStopWordsRemoval.py
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Text Summarization: Text Highlighter for Important Information in Arabic Text

Getting Started

Prerequisites

Deployment

Testing

Built With

Authors

Acknowledgments

About

Releases

Packages

Languages

RaghadAlshaikh/Automatic-Arabic-Text-Summarizer

Folders and files

Latest commit

History

Repository files navigation

Arabic Text Summarization: Text Highlighter for Important Information in Arabic Text

Getting Started

Prerequisites

Deployment

Testing

Built With

Authors

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages