Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add documentation about constant data removal process. #191

Merged
merged 7 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions R/remove_constants.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,44 @@
#'
#' # introduce an empty column
#' data$empty_column <- NA

#' # introduce some missing values across some columns
#' data$study_id[3] = NA_character_
#' data$date.of.admission[3] = NA_character_
#' data$date.of.admission[4] = NA_character_
#' data$dateOfBirth[3] = NA_character_
#' data$dateOfBirth[4] = NA_character_
#' data$dateOfBirth[5] = NA_character_
#'
#' # remove the constant columns, empty rows and columns where empty rows and
#' # columns are defined as those with 100% NA.
#' dat <- remove_constants(
#' # with cutoff = 1, line 3, 4, and 5 are not removed
#' test <- cleanepi::remove_constants(
#' data = data,
#' cutoff = 1
#' )
#'
#' # remove the constant columns, empty rows and columns where empty rows and
#' # columns are defined as those with 50% NA.
#' dat <- remove_constants(
#' data = data,
#' # drop rows or columns with a percentage of constant values
#' # equal to or more than 50%
#' test <- cleanepi::remove_constants(
#' data = test,
#' cutoff = 0.5
#' )
#'
#' # drop rows or columns with a percentage of constant values
#' # equal to or more than 25%
#' test <- cleanepi::remove_constants(
#' data = test,
#' cutoff = 0.25
#' )
#'
#' # drop rows or columns with a percentage of constant values
#' # equal to or more than 15%
#' test <- cleanepi::remove_constants(
#' data = test,
#' cutoff = 0.15
#' )
#'
#' # check the report to see what has happened
#' report <- attr(dat, "report")
#' report <- attr(test, "report")
#' report$constant_data
remove_constants <- function(data, cutoff = 1.0) {
checkmate::assert_number(cutoff, lower = 0.0, upper = 1.0, na.ok = FALSE,
Expand Down
36 changes: 28 additions & 8 deletions man/remove_constants.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion vignettes/cleanepi.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,8 @@ dat %>%
fixed_thead = TRUE)
```

The `remove_constants()` function returns a dataset where all constant columns, empty rows and columns are removed.
The `remove_constants()` function returns a dataset where all constant columns, empty rows and columns are iteratively removed.
Note that when the first iteration of constant data removal results in a dataset with new empty rows and/or columns and constant columns, this process will be carried on several times until there is no more constant data. Rows and columns that were deleted at any iterations will be reported in the report object.


## Cleaning column names
Expand Down
5 changes: 3 additions & 2 deletions vignettes/design_principle.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,15 @@ By incorporating the `standardize_column_names()` function, {cleanepi} streamlin
**2. Removal of empty rows and columns and constant columns**

This module aims at eliminating irrelevant and redundant rows and columns, including empty rows and columns as well as constant columns.
The process of removing constant data is performed iteratively until there is no constant data.
The main function was initially (i.e. up to `version 1.0.2`) built based on the {janitor} R package. We used `janitor::remove_empty()` and `janitor::remove_conatant()` to remove empty rows and columns and constant columns respectively. In `janitor::remove_empty()`, the empty rows are removed first, then the empty columns. This maximizes the chance of keeping more columns after this operation.
As we noticed that the removal of the constant data might still result in a dataset with some empty row and/or columns and constant columns, we introduced the concept of iterative constant data removal in more recent versions of the package (`> v.1.0.2`). This means that the process of removing constant data is performed iteratively until there is no constant data. The report made from this operation informs about what rows and columns were removed at every iteration.

- **Main function:** `remove_constants()`
- **Input:** Accepts a `data.frame` or `linelist` object, along with:
* A cut-off that determines the percent of missing values beyond which a row or column should be deleted.
- **Output:** Returns the input object after applying the specified operations.
- **Report:**
* A data frame with four columns showcasing detected empty and columns and constant columns at every iteration.
* A data frame with four columns showcasing the removed empty and columns and constant columns at every iteration.
- **Mode:**
* explicit

Expand Down
Loading