Welcome to this repository. This repository contains my code repository in a project related to Topic Modeling. Currently, I'm using Latent Dirichlet Allocation (LDA) to build the model.
Imagine a scenario where you run a customer service center and your company is receiving at least 100 complains a day and the average complain resolving time is 1 hour per complain. If 1 employee can only work for 10 hour a day, you'll need to employ 10 people to handle all complains with very diverse topics.
As an employer, you have built a mechanism to categorize your customer's complains. However, most of the time, your customer's doesn't care about the mechanism and sends their complain in random pages as long as the page says "complain"
.
- Customer's complains are not categorized properly even when there's a mechanism to prevent this situation.
- Employee might spend extra time in resolving complains if they have to face to many diverse topics. In simple words, an employee might spend most of his/her time understanding the varying complains. It will be so much faster if he/she can focus on a particular topic in a row.
- Develop a classification model to identify whether a complain is correctly categorized. This can be completed using Logistic Regression, Support Vector Machine, Tree-based Models, Deep Learning Model, or Ensemble Models. However, this is feasible only when the data is already labelled or annotated.
- Use an unsupervised machine learning model such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSI), Hierarchical Dirichlet Process (HDP), or other clustering methods. Unlike the first option, this option is always feasible even when the data is unlabelled.
- A simple if-then rule. If the solutions above doesn't satisfy your topic modeling result or you think that they're overkill, you can always make an if-then rule to categorize your complains.
Complains received will be pre-sorted based on model output and employees can quickly resolve the issues if they're facing complains with similar topics consequtively. If average complain resolving time is reduced, the number of employees needed to operate the customer service center can be reduced. Thus, minimizing operational cost.
Keep in mind that this is an unfinished project. It contains code only. Please expect the following updates to this repository:
- Documentation
- Topic Modeling using other methods (Low priority)
- A deployed web app in Heroku
You can check this Kanban Board for more details.
The data used in this project is scraped from Amazon India's website. The data used is review data of iPhone XR obtained from this page. The scraped data is saved in scraped_data folder.
Why am I using review data when the practical example is about customer complaints?
I'm using review data because complaint data is not easy to obtain and the procedure to build the model will be almost the same.
- Data Scraping / Crawling (Using BeautifulSoup)
- Data Visualization (Using Wordcloud, Matplotlib, & Plotly)
- Data Cleaning (Using Pandas, Numpy, spacy, & nltk)
- Natural Language Processing - Topic Modeling (Using gensim)
- Object Oriented Programming (in Python)
Jupyter Notebooks contains step by step procedures in completing this project. It explains my process thought and reasoning behind certain codes.
Please open the Jupyter Notebooks using Google Colab or by visiting the URLs listed below. Google Colab is required because some plots does not render outside Google Colab's Python environment. If you insist on using a Python environment outside Google Colab, you'll need to delete one code cell and rerun the Jupyter Notebook.