Skip to content

Skip-gram and FastText models to perform word embeddings for building a search engine for clinical trials dataset with a Streamlit user interface.

License

Notifications You must be signed in to change notification settings

AjNavneet/Clinical-Trials-Search-Engine-Word-Embeddings-SkipGram-FastText

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical Search Engine using Word2Vec and FastText with Gensim

Business Objective

In the domain of Natural Language Processing (NLP), extracting context from text data is a significant challenge. Word embeddings, which represent words as semantically meaningful dense vectors, address many limitations of other techniques such as one-hot encodings and TFIDF. They enhance generalization and performance in downstream NLP applications, even with limited data. Word embedding is a feature learning technique that maps words or phrases from the vocabulary to real-number vectors, capturing contextual relationships.

General word embeddings may not perform optimally across all domains. Therefore, this project focuses on creating domain-specific medical word embeddings using Word2Vec and FastText in Python. Word2Vec is a combination of models for distributed word representations, while FastText is an efficient library for learning word representations and sentence classification developed by the Facebook Research Team.

The project's ultimate goal is to use the trained models (Word2Vec and FastText) to build a search engine and a Streamlit user interface.


Data Description

For this project, we are using a clinical trials dataset related to Covid-19. You can access the dataset here. The dataset comprises 10,666 rows and 21 columns, with the following two essential columns:

  • Title
  • Abstract

Aim

The project's objective is to train Skip-gram and FastText models to perform word embeddings and then build a search engine for clinical trials dataset with a Streamlit user interface.


Tech Stack

  • Language: Python
  • Libraries and Packages: pandas, numpy, matplotlib, plotly, gensim, streamlit, nltk

Approach

  1. Import the required libraries.
  2. Read the dataset.
  3. Data preprocessing:
    • Remove URLs
    • Convert text to lowercase
    • Remove numerical values
    • Remove punctuation
    • Tokenization
    • Remove stop words
    • Lemmatization
    • Remove '\n' character from the columns.
  4. Exploratory Data Analysis (EDA):
    • Data Visualization using word cloud.
  5. Train the 'Skip-gram' model.
  6. Train the 'FastText' model.
  7. Model embeddings - Similarity.
  8. PCA plots for Skip-gram and FastText models.
  9. Convert abstract and title to vectors using the Skip-gram and FastText model.
  10. Use the Cosine similarity function.
  11. Perform input query pre-processing.
  12. Define a function to return the top 'n' similar results.
  13. Result evaluation.
  14. Run the Streamlit Application.

Modular Code Overview

  1. input: Contains the data used for analysis, a clinical trials dataset based on Covid-19 (Dimension-covid.csv).

  2. src: This is the most important folder and contains modularized code for all the steps in a modularized manner. It includes:

    • engine.py
    • ML_pipeline: A folder with functions split into different Python files, which are appropriately named. These functions are called within engine.py.
  3. output: Contains the best-fitted model trained on this data, which can be easily loaded and used for future applications without the need to retrain the models from scratch.

  4. lib: A reference folder with:

    • The original iPython notebook.
    • The Medical.py notebook for running the Streamlit UI.

About

Skip-gram and FastText models to perform word embeddings for building a search engine for clinical trials dataset with a Streamlit user interface.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published