Skip to content

Latest commit

 

History

History
379 lines (274 loc) · 30.1 KB

README.md

File metadata and controls

379 lines (274 loc) · 30.1 KB

International Infectious Disease Data Archive (IIDDA)

CC BY-NC-SA 4.0

Classic IIDDA

David Earn started the IIDDA project to make historical epidemiological data available to the research community. This GitHub repository replaces classic IIDDA, which is currently offline. The classic IIDDA datasets are here.

Featured Datasets

The following table contains links that will download a zip archive containing one or more datasets and DataCite 4.3 metadata, as well as links to these metadata. The metadata include lists of all of the files used to produce the associated dataset. To understand how these links work please go here. The datasets below are classified as unharmonized, harmonized, and normalized -- please see the section on data harmonization for an explanation of these terms.

CANMOD Digitization Project

The CANMOD network funded the systematic curation and digitization of historical Canadian infectious disease data. Released data from this project appear in the table below.

❗Please acknowledge any use of these data by citing this preprint.

Description Links Size Compressed Breakdown Shortest Frequency Time Range Command to reproduce
Canadian Disease Incidence Data (CANDID), Unharmonized Data, Metadata 335MB 11.2MB prov/disease wk,mo,qr,yr (depending on breakdown) 1903-2020 make derived-data/canmod-cdi-unharmonized/canmod-cdi-unharmonized.csv
Canadian Disease Incidence Data (CANDID), Harmonized Data, Metadata 266MB 9.1MB prov/disease wk,mo,qr,yr (depending on breakdown) 1903-2020 make derived-data/canmod-cdi-harmonized/canmod-cdi-harmonized.csv
Canadian Disease Incidence Data (CANDID), Normalized Data, Metadata 235MB 10.1MB prov/disease wk,mo,qr,yr (depending on breakdown) 1903-2020 make derived-data/canmod-cdi-normalized/canmod-cdi-normalized.csv
Unharmonized population Data 33.5MB 2.5MB prov/sex/age-group yr,10yr 1881-2020 Not a single command
Normalized population Data, Metadata 2.5MB 0.5MB prov wk (interpolated) 1881-2020 make derived-data/canmod-pop-normalized/canmod-pop-normalized.csv

Name harmonization for the harmonized and normalized files is done using the following lookup tables.

  • Disease name lookup
    • Harmonized names are in disease and nesting_disease
    • Historical names are in historical_disease, historical_disease_family, and historical_disease_subclass
    • Remaining columns provide context and notes on how the mappings were chosen
  • Location name lookup

The current results on cross-tabulations for checking data quality in this project can be found here.

An example of investigating the provenance of a strange smallpox record in these data is here.

IIDDA API

The above tables contain links to featured data, but all data in the archive can be accessed using this API.

The list of all dataset IDs in the API can be found here. To download any of these datasets, along with their metadata, one may use the following URL formula.

https://math.mcmaster.ca/iidda/api/download?resource=csv&resource=metadata&dataset_ids={DATASET_ID}

There is also an R binding of the API. Here is a quick-start guide.

Data Dictionary

All fields in IIDDA datasets must appear in the data dictionary. If new fields must be added, a column metadata file needs to be added to this directory.

Data Harmonization

The featured datasets are each classified as one of the following types.

  • Unharmonized : Minimally processed to allow data from different sources to be stored in the same long-format dataset.
  • Harmonized : Excludes low quality records and includes location and disease names that simplify the combination of data from different sources (e.g., poliomyelitis whenever infantile paralysis is reported historically ).
  • Normalized : Excludes overlapping data enabling aggregation without double-counting and facilitating integration of complementary data. All normalized datasets are also harmonized.

Please see the following references for background on these terms.

The files in lookup-tables are used in the harmonization of historical names

Reproducing IIDDA Datasets

❗This is an advanced topic. If you would just like to access the data please see the featured datasets, links to classic IIDDA data, and the IIDDA API.

There are three alternatives each with different pros and cons.

  1. Makefile (Host OS) Runs natively on the host OS with make handling dependencies.
    Pros: Simple to set up, no container overhead, leverages native tools.
    Cons: Requires make and other tools installed on the host system.
  2. Makefile (Docker) Runs inside a Docker container with make managing dependencies.
    Pros: Ensures consistency across environments, isolates dependencies.
    Cons: Slightly more complex setup, requiring Docker installation.
  3. Interactive (e.g., RStudio)
    Runs interactively in an IDE like RStudio on the host OS, without requiring make or docker.
    Pros: Easy for users unfamiliar with make or docker, ideal for debugging when contributing data/code/fixes.
    Cons: Requires manual understanding of dataset dependencies, less automated.

Running Natively

If you have all/most of the requirements you could try taking the following three steps to make all of the derived datasets in the archive.

  1. (one-time) Clone this repository
  2. (one-time) make install
  3. make

For instructions on making a specific dataset see the Dependency Management section, but here is a simple example.

make derived-data/cdi_ca_1956_wk_prov_dbs/cdi_ca_1956_wk_prov_dbs.csv

Running in a Docker Container

The requirements are satisfied by a docker image that can be obtained with the following command.

docker pull stevencarlislewalker/iidda

With this image, one can skip steps 1 and 2 in the section on Running Locally and replace step 3 with the following command.

docker run --rm \
    -v "$(pwd):/usr/home/iidda" \
    stevencarlislewalker/iidda \
    make

Making specific datasets in the container can be done by modifying the make command to make a specific target. For example,

docker run --rm \
    -v "$(pwd):/usr/home/iidda" \
    stevencarlislewalker/iidda \
    make derived-data/cdi_ca_1956_wk_prov_dbs/cdi_ca_1956_wk_prov_dbs.csv

Datasets made in the container will be available in the derived-data directory, just as they would using make locally.

Running Interactively

The simplest way to reproduce an IIDDA dataset is to go into the pipelines directory, and use a tool like Rstudio to work with a source -- there is one source per sub-folder. Each source directory has sub-folders that may include any of the following.

  • scans -- Contains files of scans of original source documents.
  • digitizations -- Contains files in a format (typically .xlsx or .csv) that can be read into R or Python as tabular data as opposed to as images. Files in digitizations often have the same information as the files in scans, but in a format that is easier to read.
  • prep-scripts -- Contains scripts for generating a tidy derived dataset from the information in the other sub-folders.
  • access-scripts -- Contains scripts for programmatically obtaining scans or digitizations.

The scripts in prep-scripts can be run from the iidda project root directory to generate one or more datasets with metadata in a sub-folder of the top-level derived-data directory.

❗The derived-data folder is not pushed to the central repository because its contents can be produced by running the prep-scripts.

❗This simple approach will not work if the dataset you are attempting to reproduce depends on another dataset that has not yet been made. You can find lists of the dependencies for a particular dataset in the dataset-dependencies folder. If you have make than you should be able to use this utility to automatically respect these depenencies.

Dependency Management

The Makefile can be used to build the entire derived-data directory by typing make into a terminal. To make a specific dataset make derived-data/{DATASET_ID}/{DATASET_ID}.csv. These commands require that all recommended requirements that must be met.

Dependencies are declared using the .d files in dataset-dependencies folder, each of which lists the dependencies of the derived dataset of the same name. More technical dependencies (e.g., depending on the source metadata) do not need to be explicitly declared and are produced automatically in the .d files within the derived-data directory. The following table summarizes dependency declarations and automation.

File Type Purpose Path Formula
Derived dataset Generated data that is of interest derived-data/{DATASET_ID}/{DATASET_ID}.csv
User maintained dependency file Manual editing allows user to manage the dependencies of the derived dataset dataset-dependencies/{DATASET_ID}/{DATASET_ID}.d
Generated dependency file Updated version of the user maintained dependency file with technical changes that do not require user attention but are necessary for the build system derived-data/{DATASET_ID}/{DATASET_ID}.d

❗The derived-data directory is not pushed to the repository, because it is generated by pipelines. This is why only one of the path formulas above is associated with an active link. Most of the data declared in the dataset-dependencies folder is pushed to the API and can be accessed through there if you would not like to go through the trouble of reproducing the datasets yourself.

Requirements

  • Necessary
  • Recommended
    • Unix-like OS (includes macos).
    • Different R packages are used to create different derived datasets. The r-package-recommendations-check.R script will install missing packages and check if any package versions are different from what the maintainer has used.
      • If you have make you should be able to run make install to get this package check (among other potentially useful things).
      • If you have Rscript you should be able to run Rscript R/r-package-recommendations-check.R.
  • See here for additional requirements that project maintainers must also satisfy.

Project Structure

Although the project contains several top-level directories, the most import are pipelines, dataset-dependencies, derived-data, metadata, and lookup-tables. The derived-data folder is not found within this central repository because its contents can be produced by running the prep-scripts. The following example illustrates the structure of these folders.

- pipelines
    - source_1
        - prep-scripts
            - prep-script_1.R
            - prep-script_1.R.json
            - ...
        - access-scripts
        - digitizations
            - digitization_1.xlsx
            - digitization_1.xlsx.json
            - ...
        - scans
            - scan_1.pdf
            - scan_1.pdf.json
            - ...
    - source_2
    - ...
- dataset-dependencies
    - tidy-dataset_1
        - tidy-dataset_1.d
    - tidy-dataset_2
        - tidy-dataset_2.d
    - ...
- derived-data
    - tidy-dataset_1
        - tidy-dataset_1.csv
        - tidy-dataset_1.json
        - tidy-dataset_1.d
    - tidy-dataset_2
        - tidy-dataset_2.csv
        - tidy-dataset_2.json
        - tidy-dataset_2.d
    - ...
- metadata
    - columns
        - column_1.json
        - column_2.json
        - ...
    - organizations
        - org_1.json
        - org_2.json
        - ...
    - sources
        - source_1.json
        - source_2.json
        - ...
    - tidy-datasets
        - tidy-dataset_1.json
        - tidy-dataset_2.json
        - ...
- lookup-tables
    - lookup-table-1.csv
    - lookup-table-2.csv
    - ...

Data Sources and Pipelines

Data sources are folders in the pipelines directory containing source data and source code. To create a new data source, create a new folder within the pipelines directory using a name that gives an identifier for the source.

Source Data

We distinguish between two types of source data: scans and digitizations. A scan is a file containing images of a hardcopy data source. We assume that such a file cannot be processed into a format that is usable by an epidemiologist without some form of manual data entry (although we recognize that AI is a fast moving field!). A digitization on the other hand is a file containing information that can be cleaned and processed using code. Examples of digitizations are csv and xlsx files, but also pdf files that can be reliably processed using data extraction tools. Scans of books on the other hand cannot be processed using such tools.

To contribute source data, create a new data source or find an existing source. Within this source folder create scans and/or digitizations folders to place each scan and digitization file. The file name with the extension removed will become the unique identifier for that resource so follow the rules and guidelines when creating these names. For each file, create a metadata file of the same name but with .json added after the existing extension. See other data sources for valid formats for scans and digitizations. Here is a typical example of a source with scans and digitizations folders.

Source Code

We distinguish between two types of source data: prep scripts and access scripts. A prep script is used to convert a digitization or set of digitizations into a tidier dataset or to support one or more such scripts. An access script is used to automatically access another data archive or portal to produce a file to be placed in a digitizations or scans folder. Source code file names should follow the same rules as source-data and are also each associated with a metadata file following the convensions outlined in source-data. Here is a typical example of a source with both prep-script and access-script folders.

Derived Data and Tidy Datasets

The data sources in the pipelines folder can be used to produce derived data that has been 'tidied'. These datasets are the ultimate goal of all of this. Each dataset has metadata. See here for how to reproduce all of these datasets, and for pointers on how to avoid going through the trouble of reproducing them.

Identifiers

The following types of entities in the archive are each associated with a unique and human readable identifier that will never change.

For example, the dataset cdi_bot_ca_1933-55_wk_prov contains data on the communicable disease incidence (cdi) of botulism (bot) in Canada (ca) from 1933-55 (1933-55) weekly (wk) broken down by province (prov). Examples of entities include data sources, resources within a data source (e.g., a scan of an old book), and datasets that can be derived from source material.

We do our best to keep the underscore-delimited format of the identifiers consistent, but our only promises about the identifiers are as follows.

  • They contain only lowercase letters, digits, underscores, and dashes.
  • They never change.
  • Along with the type of entity, they uniquely identify an entity.

To clarify the last point, no two entities of the same type have the same identifier, but different types of entities can share an identifier to indicate a close association. For example a tidy dataset and the prep script that produces it should have the same identifier.

Dots cannot be used in dataset identifiers as they would interfere with assumptions made by the tools.

Metadata

All entities associated with an identifier are also associated with metadata. The following table illustrates how to find the metadata for each type of entity.

Type of Entity Synonym Path Formula (with example link)
Source Pipeline metadata/sources/{SOURCE_ID}.json
Tidy Dataset Derived Data metadata/sources/{DATASET_ID}.json
Column metadata/columns/{COLUMN_NAME}.json
Digitization pipelines/{SOURCE_ID}/digitizations/{DIGITIZATION_ID}.{FILE_EXT}.json
Scan pipelines/{SOURCE_ID}/scans/{SCAN_ID}.{FILE_EXT}.json
Prep Script pipelines/{SOURCE_ID}/prep-scripts/{PREP_ID}.{FILE_EXT}.json
Access Script pipelines/{SOURCE_ID}/access-scripts/{ACCESS_ID}.{FILE_EXT}.json

Ultimately we want to remove the need for synonyms, which arose organically while producing the archive.

Lookup Tables

The datasets in the lookup-tables folder are useful for data harmonization. Each lookup table is produced in a partially manual and partially automated manner. Each lookup table is associated with a derived dataset that summarizes all of the unique historical names in the datasets declared in the .d dependency file for that dataset. The script that produces this derived dataset also produces a lookup table, which has additional columns that define the harmonized names. If new historical names are discovered an error message is given prompting the pipeline author to update the lookup table with harmonized names for the new historical names. Once this lookup table contains harmonized names for all historical names, it can be used to harmonize the names of any dataset through a dataset join. This manual/automated hybrid is an example of a human-in-the-loop system.

Contributions

Thank you 🙏

Contributing Source Data and Pipelines

Just create a sub-folder of pipelines, and place source data in its digitizations or scans sub-folders.

That's it ... unless you want a gold star, in which case please do contribute prep script source code and do as much of the following as possible.

This is probably not enough information, but if you are interested in contributing please contact the maintainer who would be happy to help and perhaps expand the docs on how to contribute.

Contributing Fixes to Data and Pipelines

Make a changes to something in the pipelines folder and open a pull request. If you are just fixing data entry errors, that's all there is to do. If you are fixing code please read Reproducing IIDDA Datasets.

Contributing to IIDDA Project Development

Please contact the maintainer if you would like to contribute more than data and pipelines for processing them.

There are additional requirements for those involved in project development.

This additional setup allows one to deploy datasets to the IIDDA API using make commands of the following type.

make derived-data/{DATASET_ID}/{DATASET_ID}.deploy

One may also delete dataset versions from the API using the DeleteVersions class in the iidda-utilities python package.

Maintainer

https://github.com/stevencarlislewalker

Funding

This work was supported by NSERC through the CANMOD network.