Skip to content

epicentre-msf/datadict-cmd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

datadict-cmd: Utilities to run the R package datadict from the command line

Installing required R packages

From command line:

Rscript R/_install_deps.R

Or, from R:

install.packages(c("readxl", "remotes"))
remotes::install_github("epicentre-msf/datadict")

Check if data dictionary is valid

Rscript R/cmd_valid_dict.R [dict] [verbose]

Arguments:

Note that from command line arguments are currently unnamed and so must be specified in order (this can be improved in future).

  1. dict: path to data dictionary file (must be .xlsx)
  2. verbose: TRUE/FALSE indicating whether to give warnings describing the checks that have failed (if any). Optional, defaults to TRUE.

Outputs:

TRUE if all checks pass, FALSE if any checks fail. If verbose = TRUE and any checks fail, will also return description of checks that have failed.

Examples:

Path to valid dictionary, verbose unspecified (defaults to TRUE)

$ Rscript R/cmd_valid_dict.R data/dict_valid.xlsx
[1] TRUE

Path to valid dictionary, verbose = FALSE

$ Rscript R/cmd_valid_dict.R data/dict_valid.xlsx FALSE
[1] TRUE

Path to nonvalid dictionary, verbose unspecified (defaults to TRUE)

$ Rscript R/cmd_valid_dict.R data/dict_nonvalid.xlsx
[1] FALSE
Message d'avis :
- Missing values in column(s): "type"
- Duplicated values in column `variable_name`: "source_water" 

Path to nonvalid dictionary, verbose = FALSE

$ Rscript R/cmd_valid_dict.R data/dict_nonvalid.xlsx FALSE
[1] FALSE

Check if dataset is valid (i.e. corresponds to data dictionary)

Rscript R/cmd_valid_data.R [dict] [data] [format_coded] [verbose]

Arguments:

  1. data: path to dataset file (must be .xlsx), only first sheet is read
  2. dict: path to data dictionary file (must be .xlsx)
  3. format_coded: Are Coded-list type variables encoded as raw values (“value”) or labels (“label”) within the dataset. E.g. Variable sex might coded as 0/1 (“value”) or “Male”/“Female” (“label”). Defaults to “label”. Must be specified by user uploading the data.
  4. verbose: TRUE/FALSE indicating whether to give warnings describing the checks that have failed (if any). Optional, defaults to TRUE.

Outputs:

TRUE if all checks pass, FALSE if any checks fail. If verbose = TRUE and any checks fail, will also return description of checks that have failed. If dictionary does not pass all checks will fail with error.

Examples:

Path to valid dataset, verbose unspecified (defaults to TRUE)

$ Rscript R/cmd_valid_data.R data/data_valid.xlsx data/dict_valid.xlsx
[1] TRUE

Path to nonvalid dataset, verbose unspecified (defaults to TRUE)

$ Rscript R/cmd_valid_data.R data/data_nonvalid.xlsx data/dict_valid.xlsx
[1] FALSE
Message d'avis :
- Columns defined in `dict` but not present in `data`: "ilness_other"
- Variables of type 'Numeric' contain nonvalid values: "age_years" 

Path to nonvalid dataset, set verbose to FALSE

$ Rscript R/cmd_valid_data.R data/data_nonvalid.xlsx data/dict_valid.xlsx label FALSE
[1] FALSE

Path to valid dataset, but set format_coded to “value” when in fact the format in the dataset is “label”

$ Rscript R/cmd_valid_data.R data/data_valid.xlsx data/dict_valid.xlsx value
[1] FALSE
Message d'avis :
- Variables of type 'Coded list' contain nonvalid values: "location", "cluster", "source_water", "sex", "age_under_one", "arrived", "departed", "born", "died", "illness", "oedema", "source_water_other", "cause_death", "cause_death_other", "ilness_other" 

Path to valid dataset, but dictionary is nonvalid

$ Rscript R/cmd_valid_data.R data/data_valid.xlsx data/dict_nonvalid.xlsx
Erreur : Dictionary does not pass all checks
Exécution arrêtée

Check k anonymity, with manual specification of indirect identifiers

Rscript R/cmd_k_anonymity.R [data] [vars]

Arguments:

  1. data: path to dataset file (must be .xlsx), only first sheet is read
  2. vars: comma-separated list of relevant variables

Outputs:

Integer, the observed minimum value of k in the dataset. If this value is greater than or equal to the pre-specified k anonymity threshold for the project, then the data is sufficiently pseudonymized. If the observed value of k is lower than the threshold, further pseudonymization is required.

Examples:

Assuming a pre-specified k of 5, the example below is sufficiently pseudonymized

$ Rscript R/cmd_k_anonymity.R data/data_valid.xlsx location,cluster,sex
[1] 37

Assuming a pre-specified k of 5, the example below is not sufficiently pseudonymized

$ Rscript R/cmd_k_anonymity.R data/data_valid.xlsx location,cluster,sex,source_water
[1] 1

Specify variable that doesn’t exist in the dataset

$ Rscript R/cmd_k_anonymity.R data/data_valid.xlsx location,var_doesnt_exit
Erreur : The following variables do no exist in the dataset: "var_doesnt_exit"
Exécution arrêtée

Check k anonymity, pulling indirect identifiers from data dictionary

Rscript R/cmd_k_anonymity_dict.R [data] [dict]

Arguments:

  1. data: path to dataset file (must be .xlsx), only first sheet is read
  2. dict: path to data dictionary file (must be .xlsx)

Outputs:

Integer, the observed minimum value of k in the dataset. If this value is greater than or equal to the pre-specified k anonymity threshold for the project, then the data is sufficiently pseudonymized. If the observed value of k is lower than the threshold, further pseudonymization is required.

Examples:

Assuming a pre-specified k of 5, the example below is sufficiently pseudonymized

$ Rscript R/cmd_k_anonymity_dict.R data/data_valid.xlsx data/dict_valid.xlsx
[1] 37

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages