Skip to content

gstdl/Amazon-IPhone-XR-Product-Review-Topic-Modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Welcome to this repository. This repository contains my code repository in a project related to Topic Modeling. Currently, I'm using Latent Dirichlet Allocation (LDA) to build the model.

Why Should I Do Topic Modeling?

Scenario

Imagine a scenario where you run a customer service center and your company is receiving at least 100 complains a day and the average complain resolving time is 1 hour per complain. If 1 employee can only work for 10 hour a day, you'll need to employ 10 people to handle all complains with very diverse topics.

As an employer, you have built a mechanism to categorize your customer's complains. However, most of the time, your customer's doesn't care about the mechanism and sends their complain in random pages as long as the page says "complain".

Problem

  1. Customer's complains are not categorized properly even when there's a mechanism to prevent this situation.
  2. Employee might spend extra time in resolving complains if they have to face to many diverse topics. In simple words, an employee might spend most of his/her time understanding the varying complains. It will be so much faster if he/she can focus on a particular topic in a row.

Solution

  1. Develop a classification model to identify whether a complain is correctly categorized. This can be completed using Logistic Regression, Support Vector Machine, Tree-based Models, Deep Learning Model, or Ensemble Models. However, this is feasible only when the data is already labelled or annotated.
  2. Use an unsupervised machine learning model such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSI), Hierarchical Dirichlet Process (HDP), or other clustering methods. Unlike the first option, this option is always feasible even when the data is unlabelled.
  3. A simple if-then rule. If the solutions above doesn't satisfy your topic modeling result or you think that they're overkill, you can always make an if-then rule to categorize your complains.

Business Impact

Complains received will be pre-sorted based on model output and employees can quickly resolve the issues if they're facing complains with similar topics consequtively. If average complain resolving time is reduced, the number of employees needed to operate the customer service center can be reduced. Thus, minimizing operational cost.

A Work in Progress

Keep in mind that this is an unfinished project. It contains code only. Please expect the following updates to this repository:

  1. Documentation
  2. Topic Modeling using other methods (Low priority)
  3. A deployed web app in Heroku

You can check this Kanban Board for more details.

Data Source

The data used in this project is scraped from Amazon India's website. The data used is review data of iPhone XR obtained from this page. The scraped data is saved in scraped_data folder.

Why am I using review data when the practical example is about customer complaints?

I'm using review data because complaint data is not easy to obtain and the procedure to build the model will be almost the same.

Skills Demonstrated

  1. Data Scraping / Crawling (Using BeautifulSoup)
  2. Data Visualization (Using Wordcloud, Matplotlib, & Plotly)
  3. Data Cleaning (Using Pandas, Numpy, spacy, & nltk)
  4. Natural Language Processing - Topic Modeling (Using gensim)
  5. Object Oriented Programming (in Python)

Accessing the Jupyter Notebooks

Jupyter Notebooks contains step by step procedures in completing this project. It explains my process thought and reasoning behind certain codes.

Please open the Jupyter Notebooks using Google Colab or by visiting the URLs listed below. Google Colab is required because some plots does not render outside Google Colab's Python environment. If you insist on using a Python environment outside Google Colab, you'll need to delete one code cell and rerun the Jupyter Notebook.

  1. Building the Data Scraper
  2. Pre-Modeling Data Analysis & Data Cleaning
  3. Topic Modeling (LDA)

About

Developing Machine Learning Model to Categorize Text Data

Topics

Resources

Stars

Watchers

Forks