-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add an example on how cutoff
works in remove_constants()
?
#190
Comments
@avallecam Are you reffering to something like below? data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# introduce an empty column
data$empty_column <- NA
data$empty_3_column <- c(rep(NA_character_,3),rep("1",7))
data$empty_5_column <- c(rep(NA_character_,5),rep("1",5))
data$empty_8_column <- c(rep(NA_character_,8),rep("1",2))
# introduce some missing values across some columns
data$study_id[3] = NA_character_
data$date.of.admission[3] = NA_character_
data$dateOfBirth[3] = NA_character_
# with cutoff = 1, line 3 is not removed
test_1 <- cleanepi::remove_constants(
data = data,
cutoff = 1
)
test_1
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 3 <NA> <NA> <NA> Feb 11, 2021 -99
#> 4 PS003P2 11/02/2021 11/11/1947 Feb 01, 2021 1
#> 5 P0005P2 17/02/2021 09/26/2000 Feb 16, 2021 2
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2
# with cutoff = 0.5, line 3 will be removed as >50% (3/5) of the values on this line
# are empty
test_point_5 <- cleanepi::remove_constants(
data = data,
cutoff = 0.5
)
test_point_5
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 4 PS003P2 11/02/2021 11/11/1947 Feb 01, 2021 1
#> 5 P0005P2 17/02/2021 09/26/2000 Feb 16, 2021 2
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2 Created on 2024-10-23 with reprex v2.1.0 |
yes, kind of in that direction. I wanted to have a clear understanding of what the denominator was and see the variability of the outcome to understand how the function works against different After exploring I drafted this reprex. Do you think this could be added to the examples? data <- readRDS(system.file("extdata", "test_df.RDS", package = "cleanepi"))
# introduce an empty column
data$empty_column <- NA
# introduce some missing values across some columns
data$study_id[3] = NA_character_
data$date.of.admission[3] = NA_character_
data$date.of.admission[4] = NA_character_
data$dateOfBirth[3] = NA_character_
data$dateOfBirth[4] = NA_character_
data$dateOfBirth[5] = NA_character_
# original
data
#> study_id event_name country_code country_name date.of.admission dateOfBirth
#> 1 PS001P2 day 0 2 Gambia 01/12/2020 06/01/1972
#> 2 PS002P2 day 0 2 Gambia 28/01/2021 02/20/1952
#> 3 <NA> day 0 2 Gambia <NA> <NA>
#> 4 PS003P2 day 0 2 Gambia <NA> <NA>
#> 5 P0005P2 day 0 2 Gambia 17/02/2021 <NA>
#> 6 PS006P2 day 0 2 Gambia 17/02/2021 -99
#> 7 PB500P2 day 0 2 Gambia 28/02/2021 11/03/1989
#> 8 PS008P2 day 0 2 Gambia 22/02/2021 10/05/1976
#> 9 PS010P2 day 0 2 Gambia 02/03/2021 09/23/1991
#> 10 PS011P2 day 0 2 Gambia 05/03/2021 02/08/1991
#> date_first_pcr_positive_test sex empty_column
#> 1 Dec 01, 2020 1 NA
#> 2 Jan 01, 2021 1 NA
#> 3 Feb 11, 2021 -99 NA
#> 4 Feb 01, 2021 1 NA
#> 5 Feb 16, 2021 2 NA
#> 6 May 02, 2021 2 NA
#> 7 Feb 19, 2021 1 NA
#> 8 Sep 20, 2021 2 NA
#> 9 Feb 26, 2021 1 NA
#> 10 Mar 03, 2021 2 NA
# with cutoff = 1, line 3, 4, and 5 are not removed
test <- cleanepi::remove_constants(
data = data,
cutoff = 1
)
# line 3 have 60% (3/5) empty columns
# line 4 have 40% (2/5) empty columns
# line 5 have 20% (1/5) empty columns
test
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 3 <NA> <NA> <NA> Feb 11, 2021 -99
#> 4 PS003P2 <NA> <NA> Feb 01, 2021 1
#> 5 P0005P2 17/02/2021 <NA> Feb 16, 2021 2
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2
# drop rows or columns with constant values equal to or more than 50%
cleanepi::remove_constants(
data = test,
cutoff = 0.5
)
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 4 PS003P2 <NA> <NA> Feb 01, 2021 1
#> 5 P0005P2 17/02/2021 <NA> Feb 16, 2021 2
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2
# drop rows or columns with constant values equal to or more than 25%
cleanepi::remove_constants(
data = test,
cutoff = 0.25
)
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 5 P0005P2 17/02/2021 <NA> Feb 16, 2021 2
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2
# drop rows or columns with constant values equal to or more than 15%
cleanepi::remove_constants(
data = test,
cutoff = 0.15
)
#> study_id date.of.admission dateOfBirth date_first_pcr_positive_test sex
#> 1 PS001P2 01/12/2020 06/01/1972 Dec 01, 2020 1
#> 2 PS002P2 28/01/2021 02/20/1952 Jan 01, 2021 1
#> 6 PS006P2 17/02/2021 -99 May 02, 2021 2
#> 7 PB500P2 28/02/2021 11/03/1989 Feb 19, 2021 1
#> 8 PS008P2 22/02/2021 10/05/1976 Sep 20, 2021 2
#> 9 PS010P2 02/03/2021 09/23/1991 Feb 26, 2021 1
#> 10 PS011P2 05/03/2021 02/08/1991 Mar 03, 2021 2 Created on 2024-10-24 with reprex v2.1.1 |
Sounds good with me - will add this to the function example section. |
I am trying to understand how
cutoff
works. When looking at the?remove_constants
documentation, the current example does not show a difference in the output from the defaultcutoff = 1
. Could we provide one that make this feature more explicit?I tried this reprex:
Created on 2024-10-23 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: