COVID19 Data Scraper

A python web scraper utilizing various tools to aggregate up-to-date COVID19 (novel coronavirus 2019) case data. Currently operates at the level of US counties.

For any implemented method below, each state is completely configurable in yaml stateConfig.yml such that new data sources can be added without code changes.

Methods

Scrape HTML <table> using requests module and pandas
Scrape JSON return from a web API call. Process in a configurable manner using dpath
HTML text scrape using lxml for Xpath search and regular expressions (re)
Scrape PDF for table data using pdftotext
Scrape image for table data using Pillow for image manipulation and pytesseract optical character recognition (OCR) functionality
Pre-rendering JavaScript on a page using html-request
Pre-establishing session id using requests.Session, used for Tableau
Index-data lookup of 'post' json data, used for Tableau

Proposed Methods for Remaining States

County-level page scraping

Progress

45 / 50 US states
- ✅ Alabama
- ✅ Alaska
- ❌ American Samoa
- ✅ Arizona
- ✅ Arkansas
- ✅ California
- ✅ Colorado
- ✅ Connecticut
- ✅ Delaware
- ✅ District of Columbia
- ✅ Florida
- ✅ Georgia
- ❌ Guam
- ❌ Hawaii
- ✅ Idaho
- ✅ Illinois
- ✅ Indiana
- ✅ Iowa
- ✅ Kansas
- ❌ Kentucky
- ✅ Lousiana
- ✅ Maine
- ❌ Marshall Islands
- ✅ Maryland
- ✅ Massachusetts
- ✅ Michigan
- ❌ Micronesia
- ✅ Minnesota
- ✅ Mississippi
- ✅ Missouri
- ✅ Montana
- ✅ Nebraska
- ❌ Nevada
- ✅ New Hampshire
- ✅ New Jersey
- ✅ New Mexico
- ✅ New York
- ✅ North Carolina
- ✅ North Dakota
- ❌ Northern Mariana Islands
- ✅ Ohio
- ✅ Oklahoma
- ✅ Oregon
- ✅ Pennsylvania
- ❌ Puerto Rico
- ❌ Republic of Palau
- ✅ Rhode Island
- ✅ South Carolina
- ✅ South Dakota
- ✅ Tennessee
- ✅ Texas
- ❌ US Virgin Islands
- ✅ Utah
- ❌ Vermont
- ✅ Virginia
- ✅ Washington
- ❌ West Virginia
- ✅ Wisconsin
- ✅ Wyoming
1 / 9 districts, territories, and freely associated states

Noteable Python Modules Used

pandas
yaml
numpy
ssl
requests
urllib
json
dpath.util
lxml
pdftotext (system utility, not python module of same name)
pillow
pytesseract

TODO

Expand coverage internationally
Aggregate time-series data
Schedule automated run
Better death count
Scrape historical data (wayback machine or other methods)
Individual case level data
Confirm data with official reports using NLP or other text processing approaches

Terms of Use

This GitHub repo and its contents herein, including all data, code, mapping, and analysis, copyright 2020 Stuart Wheaton, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. I hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
stateConfig.yml		stateConfig.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID19 Data Scraper

Methods

Proposed Methods for Remaining States

Progress

Noteable Python Modules Used

TODO

Terms of Use

About

Releases

Packages

Languages

swheaton/covid19-data-scraper

Folders and files

Latest commit

History

Repository files navigation

COVID19 Data Scraper

Methods

Proposed Methods for Remaining States

Progress

Noteable Python Modules Used

TODO

Terms of Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages