Skip to content

Latest commit

 

History

History
12 lines (7 loc) · 1.83 KB

README.md

File metadata and controls

12 lines (7 loc) · 1.83 KB

CORD-19 Data Processing & Topic Modeling

The Data Processing notebook is designed to interactively guide the user through processing the machine-readable corpus of COVID-19 research made available by the White House on 2020-03-16. After downloading the original dataset, the user is simply required to input their directory (using the text boxes embedded in the Notebook) to read-in, process, and export the processed data. This workflow is designed for anyone looking to leverage Python to explore and analyze the COVID-19 text.

The Topic Modeling notebook allows for the interactive development of topic-modeling on the the COVID-19 research-text which has been made available by the White House as of 2020-03-16. After generating the processed outputs of the raw text using the CORD-19 Data Processing notebook, the user is simply required to input their directory (using the text boxes embedded in the Notebook) to read-in the pre-processed data before making their topic-modeling selections. This workflow is designed for anyone looking to leverage Python to explore and analyze the COVID-19 text.

In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.

This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease. The corpus will be updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others.

Read more about the dataset: https://www.semanticscholar.org/cord19