NLP Capstone Project YouTube Demo Link
Table of Contents
Gerson Lehrman Group(GLG) is a financial and information services firm. It is insight network that connects decision makers to a network of experts so they can act with the confidence that comes from true clarity and have what it takes to get ahead. GLG receives a large amount requests (including requests related to health and tech) from clients seeking insights on different topics. Manually preprocessing these client requests and extracting relevant topics/keywords is time-consuming and requires a large manpower. This project uses Natural Language Processing (NLP) to improve the topic/keyword detection process from client-submitted reports and identifying the underlying patterns in submitted requests over time. The primary challenges include Named Entity Recognition (NER) and Pattern Recognition for Hierarchical Clustering of Topics.
The purpose of this project is to develop an NLP model capable of recognizing and clustering topics related to technological and healthcare terms given a large text corpus and to develop an NER model capable of extracting entities from a given sentence.
- Python
- NumPy/pandas
- Scikit-learn
- Matplotlib
- Keras
- PyTorch
- Seaborn
- Streamlit
- Language Models
- SBERT
- NLTK
- Jupyter Notebook
- Visual Studio Code
-
All the News 2.0 — This dataset contains 2,688,878 news articles and essays from 27 American publications, spanning January 1, 2016 to April 2, 2020.
-
Annotated Corpus for NER — Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. The different entities in this dataset are:
- geo = Geographical Entity
- org = Organization
- per = Person per = Person
- gpe = Geopolitical Entity
- tim = Time Indicator
- art = Artificact
- eve = Event
- nat =Natural Phenomenon
Topic models are useful tools to discover latent topics in collections of documents. In this section below, we look into details of the various parts of the topic modeling pipeline with highlights and key findings.
Data Cleaning and Data Exploration: The first step in the pipeline is data cleaning and data exploration of the news article dataset. From the original data we extract the news articles that focus only on the health and technology section. Then we performed different kinds of text data cleaning steps like:
- punctuation and non-alphanumeric character removal.
- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Remove Words that have fewer than 3 characters.
- Stopword removal.
- Lemmatization
Document Embedding: We embed documents to create representations in vector space that can be compared semantically. We assume that documents containing the same topic are semantically similar. To perform the embedding step, first we extract a sentences in each document using NLTK sentence tokenizer and we apply the [Sentence-BERT (SBERT) framework](https://arxiv.org/abs/1908.10084) in each sentence and generate vector representation for each sentence, finally we represent a single document using dot product of each sentence vector representation and generate an embedding vector for the document. These embeddings, however, are primarily used to cluster semantically similar documents and not directly used in generating the topics.
Feature Reduction: In the above document embedding step, we embed each document using SBERT which generates a 768 long dense vector. Working with such a high dimension vector is computationally heavy and complex, hence, we apply dimensionality reduction technique called UMAP([Uniform Manifold Approximation and Projection](http://arxiv.org/abs/1802.03426)) to reduce the number of features/vectors without losing important information.
Document clustering: Finally we apply the [HDBSCAN](https://www.theoj.org/joss-papers/joss.00205/10.21105.joss.00205.pdf) (Hierarchical density based clustering) algorithm in order to extract clusters of semantically similar documents. It is an ex-tension of DBSCAN that finds clusters of varying densities by converting DBSCAN into a hierarchi-cal clustering algorithm. HDBSCAN models clusters using a soft-clustering approach allowing noise to be modeled as outliers. This prevents unrelated documents from being assigned to any cluster and is expected to improve topic representations.
Topic Representation: The topic representations are modeled based on the documents in each cluster where each cluster will be assigned more than one Global and Local topics. Using HDBSCAN algorithm we access Hierarchical structure of the documents in each cluster. This means in each cluster the documents distributed as parent and child hierarchical structure. Therefore, for each cluster we can extract Global and Local topics by applying the [LDA](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) (Latent Dirichlet Allocation) model on those documents. Thus, we have 2 LDA Models for each cluster responsible to generate Global and Local topics for parent and child documents respectively.
-
NER is a widely used NLP technique that recognizes entities contained in a piece of text, commonly things like people organization, locations etc. This project also includes an NER model implemented using BERT and huggingface PyTorch library to quickly and efficiently fine-tune the BERT model to do the state of the art performance in Named Entity Recognition. The transformer package provides a BertForTokenClassification class for token-level predictions. BertForTokenClassification is a fine-tuning model that wraps BertModel and adds a token-level classifier on top of the BertModel. The token-level classifier is a linear layer that takes as input the last hidden state of the sequence.
-
Below is an example of an input and output of our named entity model, served with a streamlit app.
- Install Git lfs for installation guide see tutorial
- Install Docker for installation guide see tutorial
- Install Docker Compose for installation guide see tutorial
To package the whole solution which uses multiple images/containers, I used Docker Compose. Please follow the steps below for successful installation.
- Clone the repo
git lfs clone https://github.com/kedir/GLG--Topic-Modeling-and-Document-Clustering.git
- Go to the project directory
cd GLG--Topic-Modeling-and-Document-Clustering
- Create a bridge network
Since we have multiple containers communicating with each other, I created a bridge network called AIservice.
First create the network AIService by running this command:
docker network create AIservice
- Run the whole application by executing this command:
docker-compose up -d --build
Frontend app with Streamlit
You can see the frontend app in the browser using : http://localhost:8501/ or If you are launching the app in the cloud, replace localhost with your public Ip address.
Please refer to this Documentation for more.
For more examples, please refer to the Documentation
Contributions, issues, and feature requests are welcome!
Give a ⭐️ if you like this project!
Distributed under the MIT License. See LICENSE.txt
for more information.
- Kedir Ahmed - @linkedin - kedirhamid@gmail.com
- Ranganai Gwati - ranganaigwati@gmail.com
- Aklilu Gebremichail - akliluet@gmail.com
- Sentence-bert: Sentence embeddings using siamese bert-networks.
- UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
- Hierarchical density based clustering
- BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
- BERTopic: Neural topic modeling with a class-based TF-IDF procedure
- Named Entity Recognation with BERT
- Best-Readme-Template