Skip to content

Classifying different books that are semantically close based on their name, then analyzing the misclassified segments by XAI

Notifications You must be signed in to change notification settings

hosnaa/NLP_Gutenberg_Book_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

NLP_Gutenberg_Book_Classification

Problem Overview

In this task, we are using semantically-close books from the Gutenberg project and we aim to classify text segments to its corresponding book name.

Libraries and Dependencies

The versions are the defaults set by colaB

  • Python
  • NLTK
  • SKlearn
  • Eli5
  • Lime
  • Matplotlib
  • Jupyter/spider/colaB

Output Example:

An example for the output of eli5 for the top 10 words of 5 books: image

Steps:

  1. We start by 5 books that are semantically close to each other.
  2. Extract 200 samples from each book, each sample comprises 100 words.
  3. Data preprocessing is performed on these segments:
  • Tokenization
  • Punctuation and stop words Removal
  • Lowercasing
  • Lemmatization
  1. Feature Engineering on the clean data from (3):
  • Bag of Words
  • TF-IDF
  1. Splitting the data into train/test splits (80/20) and 10 fold cross validation.
  2. Modelling for TF-IDF features:
  • Decision Tree
  • KNN
  • SVM
  • Logistic Regression
  1. Evaluation: Accuracy, Bias-Variance tradeoff
  2. Error Analysis for Misclassified segments:
  • eli5
  • Lime
  1. Insights, Analysis then modify some hyperparameters (e.g. number of words per segment) and retrain.