Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an example on how cutoff works in remove_constants()? #190

Closed
avallecam opened this issue Oct 23, 2024 · 3 comments · Fixed by #191
Closed

add an example on how cutoff works in remove_constants()? #190

avallecam opened this issue Oct 23, 2024 · 3 comments · Fixed by #191
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@avallecam
Copy link
Member

I am trying to understand how cutoff works. When looking at the ?remove_constants documentation, the current example does not show a difference in the output from the default cutoff = 1. Could we provide one that make this feature more explicit?

I tried this reprex:

library(cleanepi)

data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# introduce an empty column
data$empty_column <- NA
data$empty_3_column <- c(rep(NA_character_,3),rep("1",7))
data$empty_5_column <- c(rep(NA_character_,5),rep("1",5))
data$empty_8_column <- c(rep(NA_character_,8),rep("1",2))

data
#>     study_id event_name country_code country_name date.of.admission dateOfBirth
#> 1    PS001P2      day 0            2       Gambia        01/12/2020  06/01/1972
#> 2    PS002P2      day 0            2       Gambia        28/01/2021  02/20/1952
#> 3  PS004P2-1      day 0            2       Gambia        15/02/2021  06/15/1961
#> 4    PS003P2      day 0            2       Gambia        11/02/2021  11/11/1947
#> 5    P0005P2      day 0            2       Gambia        17/02/2021  09/26/2000
#> 6    PS006P2      day 0            2       Gambia        17/02/2021         -99
#> 7    PB500P2      day 0            2       Gambia        28/02/2021  11/03/1989
#> 8    PS008P2      day 0            2       Gambia        22/02/2021  10/05/1976
#> 9    PS010P2      day 0            2       Gambia        02/03/2021  09/23/1991
#> 10   PS011P2      day 0            2       Gambia        05/03/2021  02/08/1991
#>    date_first_pcr_positive_test sex empty_column empty_3_column empty_5_column
#> 1                  Dec 01, 2020   1           NA           <NA>           <NA>
#> 2                  Jan 01, 2021   1           NA           <NA>           <NA>
#> 3                  Feb 11, 2021 -99           NA           <NA>           <NA>
#> 4                  Feb 01, 2021   1           NA              1           <NA>
#> 5                  Feb 16, 2021   2           NA              1           <NA>
#> 6                  May 02, 2021   2           NA              1              1
#> 7                  Feb 19, 2021   1           NA              1              1
#> 8                  Sep 20, 2021   2           NA              1              1
#> 9                  Feb 26, 2021   1           NA              1              1
#> 10                 Mar 03, 2021   2           NA              1              1
#>    empty_8_column
#> 1            <NA>
#> 2            <NA>
#> 3            <NA>
#> 4            <NA>
#> 5            <NA>
#> 6            <NA>
#> 7            <NA>
#> 8            <NA>
#> 9               1
#> 10              1

# remove the constant columns, empty rows and columns where empty rows and
# columns are defined as those with 100% NA.
dat_default <- remove_constants(
  data = data,
  cutoff = 1
)

dat_default
#>     study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1    PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2    PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 3  PS004P2-1        15/02/2021  06/15/1961                 Feb 11, 2021 -99
#> 4    PS003P2        11/02/2021  11/11/1947                 Feb 01, 2021   1
#> 5    P0005P2        17/02/2021  09/26/2000                 Feb 16, 2021   2
#> 6    PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7    PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8    PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9    PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10   PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

# remove the constant columns, empty rows and columns where empty rows and
# columns are defined as those with 50% NA.
dat_cutoff <- remove_constants(
  data = data,
  cutoff = 0.5
)

dat_cutoff
#>     study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1    PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2    PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 3  PS004P2-1        15/02/2021  06/15/1961                 Feb 11, 2021 -99
#> 4    PS003P2        11/02/2021  11/11/1947                 Feb 01, 2021   1
#> 5    P0005P2        17/02/2021  09/26/2000                 Feb 16, 2021   2
#> 6    PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7    PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8    PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9    PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10   PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

Created on 2024-10-23 with reprex v2.1.1

@Karim-Mane
Copy link
Member

Karim-Mane commented Oct 23, 2024

@avallecam Are you reffering to something like below?

data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# introduce an empty column
data$empty_column <- NA
data$empty_3_column <- c(rep(NA_character_,3),rep("1",7))
data$empty_5_column <- c(rep(NA_character_,5),rep("1",5))
data$empty_8_column <- c(rep(NA_character_,8),rep("1",2))

# introduce some missing values across some columns
data$study_id[3] = NA_character_
data$date.of.admission[3] = NA_character_
data$dateOfBirth[3] = NA_character_

# with cutoff = 1, line 3 is not removed
test_1 <- cleanepi::remove_constants(
    data = data,
    cutoff = 1
)
test_1
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 3      <NA>              <NA>        <NA>                 Feb 11, 2021 -99
#> 4   PS003P2        11/02/2021  11/11/1947                 Feb 01, 2021   1
#> 5   P0005P2        17/02/2021  09/26/2000                 Feb 16, 2021   2
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

# with cutoff = 0.5, line 3 will be removed as >50% (3/5) of the values on this line
# are empty
test_point_5 <- cleanepi::remove_constants(
    data = data,
    cutoff = 0.5
)
test_point_5
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 4   PS003P2        11/02/2021  11/11/1947                 Feb 01, 2021   1
#> 5   P0005P2        17/02/2021  09/26/2000                 Feb 16, 2021   2
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

Created on 2024-10-23 with reprex v2.1.0

@avallecam
Copy link
Member Author

yes, kind of in that direction. I wanted to have a clear understanding of what the denominator was and see the variability of the outcome to understand how the function works against different cutoff values.

After exploring I drafted this reprex. Do you think this could be added to the examples?

data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))

# introduce an empty column
data$empty_column <- NA

# introduce some missing values across some columns
data$study_id[3] = NA_character_
data$date.of.admission[3] = NA_character_
data$date.of.admission[4] = NA_character_
data$dateOfBirth[3] = NA_character_
data$dateOfBirth[4] = NA_character_
data$dateOfBirth[5] = NA_character_

# original
data
#>    study_id event_name country_code country_name date.of.admission dateOfBirth
#> 1   PS001P2      day 0            2       Gambia        01/12/2020  06/01/1972
#> 2   PS002P2      day 0            2       Gambia        28/01/2021  02/20/1952
#> 3      <NA>      day 0            2       Gambia              <NA>        <NA>
#> 4   PS003P2      day 0            2       Gambia              <NA>        <NA>
#> 5   P0005P2      day 0            2       Gambia        17/02/2021        <NA>
#> 6   PS006P2      day 0            2       Gambia        17/02/2021         -99
#> 7   PB500P2      day 0            2       Gambia        28/02/2021  11/03/1989
#> 8   PS008P2      day 0            2       Gambia        22/02/2021  10/05/1976
#> 9   PS010P2      day 0            2       Gambia        02/03/2021  09/23/1991
#> 10  PS011P2      day 0            2       Gambia        05/03/2021  02/08/1991
#>    date_first_pcr_positive_test sex empty_column
#> 1                  Dec 01, 2020   1           NA
#> 2                  Jan 01, 2021   1           NA
#> 3                  Feb 11, 2021 -99           NA
#> 4                  Feb 01, 2021   1           NA
#> 5                  Feb 16, 2021   2           NA
#> 6                  May 02, 2021   2           NA
#> 7                  Feb 19, 2021   1           NA
#> 8                  Sep 20, 2021   2           NA
#> 9                  Feb 26, 2021   1           NA
#> 10                 Mar 03, 2021   2           NA

# with cutoff = 1, line 3, 4, and 5 are not removed
test <- cleanepi::remove_constants(
  data = data,
  cutoff = 1
)

# line 3 have 60% (3/5) empty columns
# line 4 have 40% (2/5) empty columns
# line 5 have 20% (1/5) empty columns
test
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 3      <NA>              <NA>        <NA>                 Feb 11, 2021 -99
#> 4   PS003P2              <NA>        <NA>                 Feb 01, 2021   1
#> 5   P0005P2        17/02/2021        <NA>                 Feb 16, 2021   2
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

# drop rows or columns with constant values equal to or more than 50% 
cleanepi::remove_constants(
  data = test,
  cutoff = 0.5
)
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 4   PS003P2              <NA>        <NA>                 Feb 01, 2021   1
#> 5   P0005P2        17/02/2021        <NA>                 Feb 16, 2021   2
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

# drop rows or columns with constant values equal to or more than 25%
cleanepi::remove_constants(
  data = test,
  cutoff = 0.25
)
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 5   P0005P2        17/02/2021        <NA>                 Feb 16, 2021   2
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

# drop rows or columns with constant values equal to or more than 15%
cleanepi::remove_constants(
  data = test,
  cutoff = 0.15
)
#>    study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1   PS001P2        01/12/2020  06/01/1972                 Dec 01, 2020   1
#> 2   PS002P2        28/01/2021  02/20/1952                 Jan 01, 2021   1
#> 6   PS006P2        17/02/2021         -99                 May 02, 2021   2
#> 7   PB500P2        28/02/2021  11/03/1989                 Feb 19, 2021   1
#> 8   PS008P2        22/02/2021  10/05/1976                 Sep 20, 2021   2
#> 9   PS010P2        02/03/2021  09/23/1991                 Feb 26, 2021   1
#> 10  PS011P2        05/03/2021  02/08/1991                 Mar 03, 2021   2

Created on 2024-10-24 with reprex v2.1.1

@Karim-Mane
Copy link
Member

Sounds good with me - will add this to the function example section.

@Karim-Mane Karim-Mane linked a pull request Oct 24, 2024 that will close this issue
@Karim-Mane Karim-Mane self-assigned this Oct 24, 2024
@Karim-Mane Karim-Mane added the documentation Improvements or additions to documentation label Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants