Covid19_Search_Tool

Active colab notebook : Resources for working with CORD19 (Novel Coronovirus 2019) NLP dataset -

Getting started

Via Docker

The easiest way to run this package is with Docker.

Install Docker

Pull the Docker image from Docker Hub:

 docker pull rccreager/covid19-search-tool:Covid19_Search_Tool_03-25-20

Run the Docker image:

 docker run -it -p 8888:8888 rccreager/covid19-search-tool:Covid19_Search_Tool_03-25-20

(Optional) Start Jupyter from inside the docker image:

 jupyter notebook --ip 0.0.0.0 --no-browser --allow-root

(Optional) Open Jupyter on your local machine by copy-pasting the printed address into a web brower. It will look something like:
```
 http://127.0.0.1:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
```

Building Yourself:

Visit COVID-19 Open Research Dataset Challenge (CORD-19) and download the data (requires Kaggle account)
Clone this repository, move the data to Covid19_Search_Tool/data, and unzip the files
build the attached conda environment

conda create --name cord19 python=3.6.9
source activate cord19
pip install -r requirements.txt
~/.profile

Dowload the NLTK packages for text processing and search

python -m nltk.downloader punkt
python -m nltk.downloader stopwords
python -m nltk.downloader wordnet

Downloading the BERT model by going to Covid_Search_Tool/models

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
pip install bert-serving-server==1.10 --no-deps
rm uncased_L-12_H-768_A-12.zip

Interactive visualization of COVID-19 related academic articles

TSNE Visualization of COVID-19 related academic articles

Color encodes journal
BERT sentance embeddings are article abstracts
Using standard BERT pre-trained model (no retraining yet)
6200 total articles

Custom CORD19 NLP Search engine

BM25 natural language search engine
Data Processing
1. Remove duplicate articles
2. Remove (or annotate) non-academic articles (TODO)
NLP Preprocessing
1. Remove punctuations and special characters
2. Convert to lowercase
3. Tokenize into individual tokens (words mostly)
4. Remove stopwords like (and, to))
5. Lemmatize
Thanks DwightGunning for the great starting point here!

Plan of action

Topic modeling with LDA @Rachael Creager
NLU feature engineering with TF-IDF @Maryana Alegro
NLU feature engineering with BERT @Matt rubashkin
Feature engineering with metadata
Making an embedding search space via concatenating the TOPIC, NLU and metadata vectors @Kevin Li
Then Creating a cosine sim search engine that creates the same datatype as the above vector
Streamlit app that has search bar, and a way to visualize article information (Mike Lo)

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
data/CORD-19-research-challenge		data/CORD-19-research-challenge
img		img
notebooks		notebooks
src		src
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
lda.py		lda.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid19_Search_Tool

Getting started

Via Docker

Building Yourself:

Interactive visualization of COVID-19 related academic articles

Custom CORD19 NLP Search engine

Plan of action

Current work based on:

About

Releases

Packages

Languages

License

rccreager/Covid19_Search_Tool

Folders and files

Latest commit

History

Repository files navigation

Covid19_Search_Tool

Getting started

Via Docker

Building Yourself:

Interactive visualization of COVID-19 related academic articles

Custom CORD19 NLP Search engine

Plan of action

Current work based on:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages