Skip to content

swheaton/covid19-data-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

COVID19 Data Scraper

A python web scraper utilizing various tools to aggregate up-to-date COVID19 (novel coronavirus 2019) case data. Currently operates at the level of US counties.

For any implemented method below, each state is completely configurable in yaml stateConfig.yml such that new data sources can be added without code changes.

Methods

  • Scrape HTML <table> using requests module and pandas
  • Scrape JSON return from a web API call. Process in a configurable manner using dpath
  • HTML text scrape using lxml for Xpath search and regular expressions (re)
  • Scrape PDF for table data using pdftotext
  • Scrape image for table data using Pillow for image manipulation and pytesseract optical character recognition (OCR) functionality
  • Pre-rendering JavaScript on a page using html-request
  • Pre-establishing session id using requests.Session, used for Tableau
  • Index-data lookup of 'post' json data, used for Tableau

Proposed Methods for Remaining States

  • County-level page scraping

Progress

  • 45 / 50 US states
    • βœ… Alabama
    • βœ… Alaska
    • ❌ American Samoa
    • βœ… Arizona
    • βœ… Arkansas
    • βœ… California
    • βœ… Colorado
    • βœ… Connecticut
    • βœ… Delaware
    • βœ… District of Columbia
    • βœ… Florida
    • βœ… Georgia
    • ❌ Guam
    • ❌ Hawaii
    • βœ… Idaho
    • βœ… Illinois
    • βœ… Indiana
    • βœ… Iowa
    • βœ… Kansas
    • ❌ Kentucky
    • βœ… Lousiana
    • βœ… Maine
    • ❌ Marshall Islands
    • βœ… Maryland
    • βœ… Massachusetts
    • βœ… Michigan
    • ❌ Micronesia
    • βœ… Minnesota
    • βœ… Mississippi
    • βœ… Missouri
    • βœ… Montana
    • βœ… Nebraska
    • ❌ Nevada
    • βœ… New Hampshire
    • βœ… New Jersey
    • βœ… New Mexico
    • βœ… New York
    • βœ… North Carolina
    • βœ… North Dakota
    • ❌ Northern Mariana Islands
    • βœ… Ohio
    • βœ… Oklahoma
    • βœ… Oregon
    • βœ… Pennsylvania
    • ❌ Puerto Rico
    • ❌ Republic of Palau
    • βœ… Rhode Island
    • βœ… South Carolina
    • βœ… South Dakota
    • βœ… Tennessee
    • βœ… Texas
    • ❌ US Virgin Islands
    • βœ… Utah
    • ❌ Vermont
    • βœ… Virginia
    • βœ… Washington
    • ❌ West Virginia
    • βœ… Wisconsin
    • βœ… Wyoming
  • 1 / 9 districts, territories, and freely associated states

Noteable Python Modules Used

  • pandas
  • yaml
  • numpy
  • ssl
  • requests
  • urllib
  • json
  • dpath.util
  • lxml
  • pdftotext (system utility, not python module of same name)
  • pillow
  • pytesseract

TODO

  • Expand coverage internationally
  • Aggregate time-series data
  • Schedule automated run
  • Better death count
  • Scrape historical data (wayback machine or other methods)
  • Individual case level data
  • Confirm data with official reports using NLP or other text processing approaches

Terms of Use

This GitHub repo and its contents herein, including all data, code, mapping, and analysis, copyright 2020 Stuart Wheaton, all rights reserved, is provided to the public strictly for educational and academic research purposes. The Website relies upon publicly available data from multiple sources, that do not always agree. I hereby disclaims any and all representations and warranties with respect to the Website, including accuracy, fitness for use, and merchantability. Reliance on the Website for medical guidance or use of the Website in commerce is strictly prohibited.

About

Web scraper for COVID19🦠 case data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages