Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decide and document if remove_constants() should remove rows or cols first #171

Closed
Bisaloo opened this issue Aug 7, 2024 · 3 comments · Fixed by #191
Closed

Decide and document if remove_constants() should remove rows or cols first #171

Bisaloo opened this issue Aug 7, 2024 · 3 comments · Fixed by #191
Assignees
Labels
enhancement New feature or request

Comments

@Bisaloo
Copy link
Member

Bisaloo commented Aug 7, 2024

If cols are removed first, then a row might suddenly turn constant.

See the last row in this reprex:

x <- cars

x[nrow(x)+1, ] <- NA
x$year <- 2020

tail(x)
#>    speed dist year
#> 46    24   70 2020
#> 47    24   92 2020
#> 48    24   93 2020
#> 49    24  120 2020
#> 50    25   85 2020
#> 51    NA   NA 2020

Created on 2024-08-07 with reprex v2.1.1

Should it be removed or not? Should this choice be left to the user as an extra argument?

In all cases, the chosen behaviour should be documented and tested.

@avallecam
Copy link
Member

avallecam commented Sep 30, 2024

... or if this process should be done iteratively until getting no empty row/col or constant column

#load library
library(tidyverse)
#create dataset
df <- tibble(
  x = c(1,2),
  y = c(1,3)
) %>% 
  mutate(invariant = rep("a",nrow(.))) %>% 
  mutate(invariant2 = rep("b",nrow(.))) %>% 
  mutate(empty1 = rep(NA_Date_, nrow(.))) %>% 
  mutate(empty2 = rep(NA_Date_, nrow(.))) %>% 
  mutate(empty3 = rep(NA_Date_, nrow(.))) %>% 
  add_row(x = NA_integer_, invariant = "a") %>% 
  add_row(x = NA_integer_, invariant = "a") %>%
  add_row(x = NA_integer_, invariant = "a") %>% 
  add_row(x = NA_integer_)

df
#> # A tibble: 6 × 7
#>       x     y invariant invariant2 empty1 empty2 empty3
#>   <dbl> <dbl> <chr>     <chr>      <date> <date> <date>
#> 1     1     1 a         b          NA     NA     NA    
#> 2     2     3 a         b          NA     NA     NA    
#> 3    NA    NA a         <NA>       NA     NA     NA    
#> 4    NA    NA a         <NA>       NA     NA     NA    
#> 5    NA    NA a         <NA>       NA     NA     NA    
#> 6    NA    NA <NA>      <NA>       NA     NA     NA

df %>% 
  janitor::remove_constant(na.rm = TRUE) %>% 
  janitor::remove_empty(which = "rows")
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     3

df %>% 
  janitor::remove_empty(which = "rows") %>% 
  janitor::remove_constant(na.rm = TRUE) %>% 
  janitor::remove_empty(which = "rows")
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     3

df %>% 
  janitor::remove_empty(which = "cols") %>% 
  janitor::remove_empty(which = "rows") %>% 
  janitor::remove_constant(na.rm = TRUE) %>% 
  janitor::remove_empty(which = "rows")
#> # A tibble: 2 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     3

packageVersion("cleanepi")
#> [1] '1.0.2'

Created on 2024-09-30 with reprex v2.1.0

@Karim-Mane
Copy link
Member

Karim-Mane commented Oct 18, 2024

@Bisaloo - the order no longer matters. We introduced an iterative constant data removal process when solving issue #178. In the current remove_constants(), this process is iteratively carried on until there is no more constant data.

x <- cars

x[nrow(x)+1, ] <- NA
x$year <- 2020
tail(x)
#>    speed dist year
#> 46    24   70 2020
#> 47    24   92 2020
#> 48    24   93 2020
#> 49    24  120 2020
#> 50    25   85 2020
#> 51    NA   NA 2020
dim(x)
#> [1] 51  3

xx = cleanepi::remove_constants(x)
#> Constant data was removed after 2 iterations. See the report for more details.
dim(xx)
#> [1] 50  2
tail(xx)
#>    speed dist
#> 45    23   54
#> 46    24   70
#> 47    24   92
#> 48    24   93
#> 49    24  120
#> 50    25   85

Created on 2024-10-18 with reprex v2.1.0

@avallecam
Copy link
Member

In all cases, the chosen behaviour should be documented and tested.

I agree to have this documented and tested. Now, I am wondering if there may be a scenario when iteratively running cleanepi::remove_constants() may generate unwanted output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants