- International Infectious Disease Data Archive (IIDDA)
David Earn started the IIDDA project to make historical epidemiological data available to the research community. This GitHub repository replaces classic IIDDA, which is currently offline. The classic IIDDA datasets are here.
The following table contains links that will download a zip archive containing one or more datasets and DataCite 4.3 metadata, as well as links to these metadata. The metadata include lists of all of the files used to produce the associated dataset. To understand how these links work please go here. The datasets below are classified as unharmonized, harmonized, and normalized -- please see the section on data harmonization for an explanation of these terms.
The CANMOD network funded the systematic curation and digitization of historical Canadian infectious disease data. Released data from this project appear in the table below.
❗Please acknowledge any use of these data by citing this preprint.
Description | Links | Size | Compressed | Breakdown | Shortest Frequency | Time Range | Command to reproduce |
---|---|---|---|---|---|---|---|
Canadian Disease Incidence Data (CANDID), Unharmonized | Data, Metadata | 335MB | 11.2MB | prov/disease | wk,mo,qr,yr (depending on breakdown) | 1903-2020 | make derived-data/canmod-cdi-unharmonized/canmod-cdi-unharmonized.csv |
Canadian Disease Incidence Data (CANDID), Harmonized | Data, Metadata | 266MB | 9.1MB | prov/disease | wk,mo,qr,yr (depending on breakdown) | 1903-2020 | make derived-data/canmod-cdi-harmonized/canmod-cdi-harmonized.csv |
Canadian Disease Incidence Data (CANDID), Normalized | Data, Metadata | 235MB | 10.1MB | prov/disease | wk,mo,qr,yr (depending on breakdown) | 1903-2020 | make derived-data/canmod-cdi-normalized/canmod-cdi-normalized.csv |
Unharmonized population | Data | 33.5MB | 2.5MB | prov/sex/age-group | yr,10yr | 1881-2020 | Not a single command |
Normalized population | Data, Metadata | 2.5MB | 0.5MB | prov | wk (interpolated) | 1881-2020 | make derived-data/canmod-pop-normalized/canmod-pop-normalized.csv |
Name harmonization for the harmonized and normalized files is done using the following lookup tables.
- Disease name lookup
- Harmonized names are in
disease
andnesting_disease
- Historical names are in
historical_disease
,historical_disease_family
, andhistorical_disease_subclass
- Remaining columns provide context and notes on how the mappings were chosen
- Harmonized names are in
- Location name lookup
- Harmonized names are in
iso_3166
andiso_3166_2
(https://www.iso.org/iso-3166-country-codes) - Historical names are in
location
- Context for
location
is inlocation_type
- Harmonized names are in
The current results on cross-tabulations for checking data quality in this project can be found here.
An example of investigating the provenance of a strange smallpox record in these data is here.
The above tables contain links to featured data, but all data in the archive can be accessed using this API.
The list of all dataset IDs in the API can be found here. To download any of these datasets, along with their metadata, one may use the following URL formula.
https://math.mcmaster.ca/iidda/api/download?resource=csv&resource=metadata&dataset_ids={DATASET_ID}
There is also an R binding of the API. Here is a quick-start guide.
All fields in IIDDA datasets must appear in the data dictionary. If new fields must be added, a column metadata file needs to be added to this directory.
The featured datasets are each classified as one of the following types.
- Unharmonized : Minimally processed to allow data from different sources to be stored in the same long-format dataset.
- Harmonized : Excludes low quality records and includes location and disease names that simplify the combination of data from different sources (e.g., poliomyelitis whenever infantile paralysis is reported historically ).
- Normalized : Excludes overlapping data enabling aggregation without double-counting and facilitating integration of complementary data. All normalized datasets are also harmonized.
Please see the following references for background on these terms.
- A general primer for data harmonization
- Harmonization-information trade-offs for sharing individual participant data in biomedicine
- Tidy data
The files in lookup-tables are used in the harmonization of historical names
❗This is an advanced topic. If you would just like to access the data please see the featured datasets, links to classic IIDDA data, and the IIDDA API.
There are three alternatives each with different pros and cons.
- Makefile (Host OS)
Runs natively on the host OS with
make
handling dependencies.
Pros: Simple to set up, no container overhead, leverages native tools.
Cons: Requiresmake
and other tools installed on the host system. - Makefile (Docker)
Runs inside a Docker container with
make
managing dependencies.
Pros: Ensures consistency across environments, isolates dependencies.
Cons: Slightly more complex setup, requiring Docker installation. - Interactive (e.g., RStudio)
Runs interactively in an IDE like RStudio on the host OS, without requiringmake
ordocker
.
Pros: Easy for users unfamiliar withmake
ordocker
, ideal for debugging when contributing data/code/fixes.
Cons: Requires manual understanding of dataset dependencies, less automated.
If you have all/most of the requirements you could try taking the following three steps to make all of the derived datasets in the archive.
- (one-time) Clone this repository
- (one-time)
make install
make
For instructions on making a specific dataset see the Dependency Management section, but here is a simple example.
make derived-data/cdi_ca_1956_wk_prov_dbs/cdi_ca_1956_wk_prov_dbs.csv
The requirements are satisfied by a docker image that can be obtained with the following command.
docker pull stevencarlislewalker/iidda
With this image, one can skip steps 1 and 2 in the section on Running Locally and replace step 3 with the following command.
docker run --rm \
-v "$(pwd):/usr/home/iidda" \
stevencarlislewalker/iidda \
make
Making specific datasets in the container can be done by modifying the make
command to make a specific target. For example,
docker run --rm \
-v "$(pwd):/usr/home/iidda" \
stevencarlislewalker/iidda \
make derived-data/cdi_ca_1956_wk_prov_dbs/cdi_ca_1956_wk_prov_dbs.csv
Datasets made in the container will be available in the derived-data
directory, just as they would using make
locally.
The simplest way to reproduce an IIDDA dataset is to go into the pipelines directory, and use a tool like Rstudio to work with a source -- there is one source per sub-folder. Each source directory has sub-folders that may include any of the following.
scans
-- Contains files of scans of original source documents.digitizations
-- Contains files in a format (typically.xlsx
or.csv
) that can be read into R or Python as tabular data as opposed to as images. Files indigitizations
often have the same information as the files inscans
, but in a format that is easier to read.prep-scripts
-- Contains scripts for generating a tidy derived dataset from the information in the other sub-folders.access-scripts
-- Contains scripts for programmatically obtainingscans
ordigitizations
.
The scripts in prep-scripts
can be run from the iidda
project root directory to generate one or more datasets with metadata in a sub-folder of the top-level derived-data
directory.
❗The derived-data
folder is not pushed to the central repository because its contents can be produced by running the prep-scripts
.
❗This simple approach will not work if the dataset you are attempting to reproduce depends on another dataset that has not yet been made. You can find lists of the dependencies for a particular dataset in the dataset-dependencies folder. If you have make than you should be able to use this utility to automatically respect these depenencies.
The Makefile can be used to build the entire derived-data
directory by typing make
into a terminal. To make a specific dataset make derived-data/{DATASET_ID}/{DATASET_ID}.csv
. These commands require that all recommended requirements that must be met.
Dependencies are declared using the .d
files in dataset-dependencies folder, each of which lists the dependencies of the derived dataset of the same name. More technical dependencies (e.g., depending on the source metadata) do not need to be explicitly declared and are produced automatically in the .d
files within the derived-data
directory. The following table summarizes dependency declarations and automation.
File Type | Purpose | Path Formula |
---|---|---|
Derived dataset | Generated data that is of interest | derived-data/{DATASET_ID}/{DATASET_ID}.csv |
User maintained dependency file | Manual editing allows user to manage the dependencies of the derived dataset | dataset-dependencies/{DATASET_ID}/{DATASET_ID}.d |
Generated dependency file | Updated version of the user maintained dependency file with technical changes that do not require user attention but are necessary for the build system | derived-data/{DATASET_ID}/{DATASET_ID}.d |
❗The derived-data
directory is not pushed to the repository, because it is generated by pipelines. This is why only one of the path formulas above is associated with an active link. Most of the data declared in the dataset-dependencies folder is pushed to the API and can be accessed through there if you would not like to go through the trouble of reproducing the datasets yourself.
- Necessary
- R > 4.0.
- Have Rscript on the path.
- The iidda, iidda.analysis, and iidda.api R packages included in iidda-tools. Please follow these instructions to install all three packages.
- Make.
- Recommended
- Unix-like OS (includes macos).
- Different R packages are used to create different derived datasets. The r-package-recommendations-check.R script will install missing packages and check if any package versions are different from what the maintainer has used.
- If you have
make
you should be able to runmake install
to get this package check (among other potentially useful things). - If you have
Rscript
you should be able to runRscript R/r-package-recommendations-check.R
.
- If you have
- See here for additional requirements that project maintainers must also satisfy.
Although the project contains several top-level directories, the most import are pipelines, dataset-dependencies, derived-data
, metadata, and lookup-tables. The derived-data
folder is not found within this central repository because its contents can be produced by running the prep-scripts
. The following example illustrates the structure of these folders.
- pipelines
- source_1
- prep-scripts
- prep-script_1.R
- prep-script_1.R.json
- ...
- access-scripts
- digitizations
- digitization_1.xlsx
- digitization_1.xlsx.json
- ...
- scans
- scan_1.pdf
- scan_1.pdf.json
- ...
- source_2
- ...
- dataset-dependencies
- tidy-dataset_1
- tidy-dataset_1.d
- tidy-dataset_2
- tidy-dataset_2.d
- ...
- derived-data
- tidy-dataset_1
- tidy-dataset_1.csv
- tidy-dataset_1.json
- tidy-dataset_1.d
- tidy-dataset_2
- tidy-dataset_2.csv
- tidy-dataset_2.json
- tidy-dataset_2.d
- ...
- metadata
- columns
- column_1.json
- column_2.json
- ...
- organizations
- org_1.json
- org_2.json
- ...
- sources
- source_1.json
- source_2.json
- ...
- tidy-datasets
- tidy-dataset_1.json
- tidy-dataset_2.json
- ...
- lookup-tables
- lookup-table-1.csv
- lookup-table-2.csv
- ...
Data sources are folders in the pipelines
directory containing source data and source code. To create a new data source, create a new folder within the pipelines
directory using a name that gives an identifier for the source.
We distinguish between two types of source data: scans and digitizations. A scan is a file containing images of a hardcopy data source. We assume that such a file cannot be processed into a format that is usable by an epidemiologist without some form of manual data entry (although we recognize that AI is a fast moving field!). A digitization on the other hand is a file containing information that can be cleaned and processed using code. Examples of digitizations are csv
and xlsx
files, but also pdf
files that can be reliably processed using data extraction tools. Scans of books on the other hand cannot be processed using such tools.
To contribute source data, create a new data source or find an existing source. Within this source folder create scans
and/or digitizations
folders to place each scan and digitization file. The file name with the extension removed will become the unique identifier for that resource so follow the rules and guidelines when creating these names. For each file, create a metadata file of the same name but with .json
added after the existing extension. See other data sources for valid formats for scans and digitizations. Here is a typical example of a source with scans and digitizations folders.
We distinguish between two types of source data: prep scripts and access scripts. A prep script is used to convert a digitization or set of digitizations into a tidier dataset or to support one or more such scripts. An access script is used to automatically access another data archive or portal to produce a file to be placed in a digitizations
or scans
folder. Source code file names should follow the same rules as source-data and are also each associated with a metadata file following the convensions outlined in source-data. Here is a typical example of a source with both prep-script and access-script folders.
The data sources in the pipelines
folder can be used to produce derived data that has been 'tidied'. These datasets are the ultimate goal of all of this. Each dataset has metadata. See here for how to reproduce all of these datasets, and for pointers on how to avoid going through the trouble of reproducing them.
The following types of entities in the archive are each associated with a unique and human readable identifier that will never change.
For example, the dataset cdi_bot_ca_1933-55_wk_prov contains data on the communicable disease incidence (cdi
) of botulism (bot
) in Canada (ca
) from 1933-55 (1933-55
) weekly (wk
) broken down by province (prov
). Examples of entities include data sources, resources within a data source (e.g., a scan of an old book), and datasets that can be derived from source material.
We do our best to keep the underscore-delimited format of the identifiers consistent, but our only promises about the identifiers are as follows.
- They contain only lowercase letters, digits, underscores, and dashes.
- They never change.
- Along with the type of entity, they uniquely identify an entity.
To clarify the last point, no two entities of the same type have the same identifier, but different types of entities can share an identifier to indicate a close association. For example a tidy dataset and the prep script that produces it should have the same identifier.
Dots cannot be used in dataset identifiers as they would interfere with assumptions made by the tools.
All entities associated with an identifier are also associated with metadata. The following table illustrates how to find the metadata for each type of entity.
Ultimately we want to remove the need for synonyms, which arose organically while producing the archive.
The datasets in the lookup-tables folder are useful for data harmonization. Each lookup table is produced in a partially manual and partially automated manner. Each lookup table is associated with a derived dataset that summarizes all of the unique historical names in the datasets declared in the .d
dependency file for that dataset. The script that produces this derived dataset also produces a lookup table, which has additional columns that define the harmonized names. If new historical names are discovered an error message is given prompting the pipeline author to update the lookup table with harmonized names for the new historical names. Once this lookup table contains harmonized names for all historical names, it can be used to harmonize the names of any dataset through a dataset join. This manual/automated hybrid is an example of a human-in-the-loop system.
Thank you 🙏
Just create a sub-folder of pipelines, and place source data in its digitizations
or scans
sub-folders.
That's it ... unless you want a gold star, in which case please do contribute prep script source code and do as much of the following as possible.
- Before embarking on prep scripting, please make sure that these requirements are satisfied.
- Write R scripts that prepare these data using valid IIDDA columns -- see
?iidda::register_prep_script
before starting. - Generate valid IIDDA metadata for data sources, derived data, and source data using
iidda::register_prep_script
.
This is probably not enough information, but if you are interested in contributing please contact the maintainer who would be happy to help and perhaps expand the docs on how to contribute.
Make a changes to something in the pipelines folder and open a pull request. If you are just fixing data entry errors, that's all there is to do. If you are fixing code please read Reproducing IIDDA Datasets.
Please contact the maintainer if you would like to contribute more than data and pipelines for processing them.
There are additional requirements for those involved in project development.
- Python >= 3.9.
- iidda-utilities (Private repo of Python and R tools. Contact the maintainer for access.)
- The iidda_api Python package included in iidda-tools.
This additional setup allows one to deploy datasets to the IIDDA API using make
commands of the following type.
make derived-data/{DATASET_ID}/{DATASET_ID}.deploy
One may also delete dataset versions from the API using the DeleteVersions
class in the iidda-utilities
python package.
https://github.com/stevencarlislewalker
This work was supported by NSERC through the CANMOD network.