Active colab notebook : Resources for working with CORD19 (Novel Coronovirus 2019) NLP dataset -
The easiest way to run this package is with Docker.
-
Install Docker
-
Pull the Docker image from Docker Hub:
docker pull rccreager/covid19-search-tool:Covid19_Search_Tool_03-25-20
-
Run the Docker image:
docker run -it -p 8888:8888 rccreager/covid19-search-tool:Covid19_Search_Tool_03-25-20
-
(Optional) Start Jupyter from inside the docker image:
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
-
(Optional) Open Jupyter on your local machine by copy-pasting the printed address into a web brower. It will look something like:
http://127.0.0.1:8888/?token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- Visit COVID-19 Open Research Dataset Challenge (CORD-19) and download the data (requires Kaggle account)
- Clone this repository, move the data to Covid19_Search_Tool/data, and unzip the files
- build the attached conda environment
conda create --name cord19 python=3.6.9
source activate cord19
pip install -r requirements.txt
~/.profile
- Dowload the NLTK packages for text processing and search
python -m nltk.downloader punkt
python -m nltk.downloader stopwords
python -m nltk.downloader wordnet
- Downloading the BERT model by going to Covid_Search_Tool/models
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
pip install bert-serving-server==1.10 --no-deps
rm uncased_L-12_H-768_A-12.zip
TSNE Visualization of COVID-19 related academic articles
- Color encodes journal
- BERT sentance embeddings are article abstracts
- Using standard BERT pre-trained model (no retraining yet)
- 6200 total articles
- BM25 natural language search engine
- Data Processing
- Remove duplicate articles
- Remove (or annotate) non-academic articles (TODO)
- NLP Preprocessing
- Remove punctuations and special characters
- Convert to lowercase
- Tokenize into individual tokens (words mostly)
- Remove stopwords like (and, to))
- Lemmatize
- Thanks DwightGunning for the great starting point here!
- Topic modeling with LDA @Rachael Creager
- NLU feature engineering with TF-IDF @Maryana Alegro
- NLU feature engineering with BERT @Matt rubashkin
- Feature engineering with metadata
- Making an embedding search space via concatenating the TOPIC, NLU and metadata vectors @Kevin Li
- Then Creating a cosine sim search engine that creates the same datatype as the above vector
- Streamlit app that has search bar, and a way to visualize article information (Mike Lo)