Medical Search Engine using Word2Vec and FastText with Gensim

Business Objective

In the domain of Natural Language Processing (NLP), extracting context from text data is a significant challenge. Word embeddings, which represent words as semantically meaningful dense vectors, address many limitations of other techniques such as one-hot encodings and TFIDF. They enhance generalization and performance in downstream NLP applications, even with limited data. Word embedding is a feature learning technique that maps words or phrases from the vocabulary to real-number vectors, capturing contextual relationships.

General word embeddings may not perform optimally across all domains. Therefore, this project focuses on creating domain-specific medical word embeddings using Word2Vec and FastText in Python. Word2Vec is a combination of models for distributed word representations, while FastText is an efficient library for learning word representations and sentence classification developed by the Facebook Research Team.

The project's ultimate goal is to use the trained models (Word2Vec and FastText) to build a search engine and a Streamlit user interface.

Data Description

For this project, we are using a clinical trials dataset related to Covid-19. You can access the dataset here. The dataset comprises 10,666 rows and 21 columns, with the following two essential columns:

Title
Abstract

Aim

The project's objective is to train Skip-gram and FastText models to perform word embeddings and then build a search engine for clinical trials dataset with a Streamlit user interface.

Tech Stack

Language: Python
Libraries and Packages: pandas, numpy, matplotlib, plotly, gensim, streamlit, nltk

Approach

Import the required libraries.
Read the dataset.
Data preprocessing:
- Remove URLs
- Convert text to lowercase
- Remove numerical values
- Remove punctuation
- Tokenization
- Remove stop words
- Lemmatization
- Remove '\n' character from the columns.
Exploratory Data Analysis (EDA):
- Data Visualization using word cloud.
Train the 'Skip-gram' model.
Train the 'FastText' model.
Model embeddings - Similarity.
PCA plots for Skip-gram and FastText models.
Convert abstract and title to vectors using the Skip-gram and FastText model.
Use the Cosine similarity function.
Perform input query pre-processing.
Define a function to return the top 'n' similar results.
Result evaluation.
Run the Streamlit Application.

Modular Code Overview

input: Contains the data used for analysis, a clinical trials dataset based on Covid-19 (Dimension-covid.csv).
src: This is the most important folder and contains modularized code for all the steps in a modularized manner. It includes:
- engine.py
- ML_pipeline: A folder with functions split into different Python files, which are appropriately named. These functions are called within engine.py.
output: Contains the best-fitted model trained on this data, which can be easily loaded and used for future applications without the need to retrain the models from scratch.
lib: A reference folder with:
- The original iPython notebook.
- The Medical.py notebook for running the Streamlit UI.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
input		input
lib		lib
output		output
src		src
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical Search Engine using Word2Vec and FastText with Gensim

Business Objective

Data Description

Aim

Tech Stack

Approach

Modular Code Overview

About

Releases

Packages

Languages

License

AjNavneet/Clinical-Trials-Search-Engine-Word-Embeddings-SkipGram-FastText

Folders and files

Latest commit

History

Repository files navigation

Medical Search Engine using Word2Vec and FastText with Gensim

Business Objective

Data Description

Aim

Tech Stack

Approach

Modular Code Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages