This is a summary of what was done, the full report is at NLP_COVID_Presentation.
- COVID-19 Open Research Dataset
- ~13 GB
- Over 135,000 scholarly articles
- Including over 68,000 with full text
- First goal: What do we know about COVID-19 symptoms?
- Second goal: How can we cluster papers into coherent groups?
- Chosen method: GloVe – Global Vectors for word representation
- Generate corpus using the provided dataset
- Create word vectors
- Measure cosine distance
- Main idea: Cluster words represented by their vector using k-means algorithm
- Create feature vector for each paper using BOW model
- Cluster vectors into coherent groups
- Visualize clusters in a 2D plot