A python web scraper utilizing various tools to aggregate up-to-date COVID19 (novel coronavirus 2019) case data. Currently operates at the level of US counties.
For any implemented method below, each state is completely configurable in yaml stateConfig.yml such that new data sources can be added without code changes.
- Scrape HTML
<table>
using requests module and pandas - Scrape JSON return from a web API call. Process in a configurable manner using dpath
- HTML text scrape using lxml for Xpath search and regular expressions (re)
- Scrape PDF for table data using pdftotext
- Scrape image for table data using Pillow for image manipulation and pytesseract optical character recognition (OCR) functionality
- Pre-rendering JavaScript on a page using html-request
- Pre-establishing session id using requests.Session, used for Tableau
- Index-data lookup of 'post' json data, used for Tableau
- County-level page scraping
- 45 / 50 US states
- β Alabama
- β Alaska
- β American Samoa
- β Arizona
- β Arkansas
- β California
- β Colorado
- β Connecticut
- β Delaware
- β District of Columbia
- β Florida
- β Georgia
- β Guam
- β Hawaii
- β Idaho
- β Illinois
- β Indiana
- β Iowa
- β Kansas
- β Kentucky
- β Lousiana
- β Maine
- β Marshall Islands
- β Maryland
- β Massachusetts
- β Michigan
- β Micronesia
- β Minnesota
- β Mississippi
- β Missouri
- β Montana
- β Nebraska
- β Nevada
- β New Hampshire
- β New Jersey
- β New Mexico
- β New York
- β North Carolina
- β North Dakota
- β Northern Mariana Islands
- β Ohio
- β Oklahoma
- β Oregon
- β Pennsylvania
- β Puerto Rico
- β Republic of Palau
- β Rhode Island
- β South Carolina
- β South Dakota
- β Tennessee
- β Texas
- β US Virgin Islands
- β Utah
- β Vermont
- β Virginia
- β Washington
- β West Virginia
- β Wisconsin
- β Wyoming
- 1 / 9 districts, territories, and freely associated states
- pandas
- yaml
- numpy
- ssl
- requests
- urllib
- json
- dpath.util
- lxml
- pdftotext (system utility, not python module of same name)
- pillow
- pytesseract
- Expand coverage internationally
- Aggregate time-series data
- Schedule automated run
- Better death count
- Scrape historical data (wayback machine or other methods)
- Individual case level data
- Confirm data with official reports using NLP or other text processing approaches
This GitHub repo and its contents herein, including all data, code, mapping, and analysis, copyright 2020 Stuart Wheaton, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. I hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.