Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve network_plot and correlate for 1- and 2-column data.frames #122

Merged

Conversation

antoine-sachet
Copy link
Contributor

@antoine-sachet antoine-sachet commented Oct 28, 2020

This PR closes #118 #119 #120 and #121. This is what now works:

library(dplyr)
library(corrr)

# Correlation between numeric vectors
correlate(1:10, 1:10, quiet = TRUE)
#> # A tibble: 1 x 2
#>   rowname     x
#>   <chr>   <dbl>
#> 1 x           1

# Correlation of a df with only one column
mtcars %>% 
  select(cyl) %>% 
  correlate(quiet = TRUE)
#> # A tibble: 1 x 2
#>   rowname   cyl
#>   <chr>   <dbl>
#> 1 cyl         1

# Network plot with only one variable
mtcars %>% 
  select(cyl) %>% 
  correlate(quiet = TRUE) %>% 
  network_plot()

# Network plot with only two variables
mtcars %>% 
  select(cyl, mpg) %>% 
  correlate(quiet = TRUE) %>% 
  network_plot()

Created on 2020-10-28 by the reprex package (v0.3.0)

All of those were previously failing with an error.

For context, I worked on this because I have a shiny app that lets users select columns out of a data.frame and that plots their correlations in a network plot. I wanted the plots to work even when only 1 or 2 columns were selected.

I thought this would be a straightforward change but it turns out fixing a bug uncovered a couple of others, so I ended working later than I thought :)

If I may, I just wanted to give some feedback as an outside contributor new to the code: I think you ought to refactor the code a bit and move the functions in more natural places. I had to grep my way through the code because the file names were not matching function names at all. In particular, cor_df.R contains many foo.cor_df S3 methods that could be in their own foo.R file along with their generic function. For example, the function fashion has S3 methods spread across 3 generically named files when a fashion.R file would be more standard. To be clear, I think the package is great and this is meant as my 2ct of constructive criticism to foster collaboration :)

I love that network_plot function! Keep up the great work 👍

Copy link
Member

@juliasilge juliasilge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR @antoine-sachet complete with test cases! 🚀 It's clear that not many people have tried out the vector interface and found that problem.

I'll be honest that I'm not entirely convinced that results for the 1 or 2 term case are helpful or meaningful 😜 but I read what you wrote about your use case and returning nicer results is certainly not unreasonable.

I have just a couple of changes here to make.

DESCRIPTION Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
Comment on lines +195 to +204
points <- if (ncol(rdf) == 1) {
# 1 var: a single central point
matrix(c(0, 0), ncol = 2, dimnames = list(colnames(rdf)))
} else if (ncol(rdf) == 2) {
# 2 vars: 2 opposing points
matrix(c(0, -0.1, 0, 0.1), ncol = 2, dimnames = list(colnames(rdf)))
} else {
# More than 2 vars: multidimensional scaling to obtain x and y coordinates for points.
suppressWarnings(stats::cmdscale(distance, k = 2))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of assigning to an if() statement, can you use switch() here?
https://adv-r.hadley.nz/control-flow.html#switch

Copy link
Contributor Author

@antoine-sachet antoine-sachet Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea but not feasible here as there is no way to set a default value when switching on a numeric variable (which is actually really surprising!).

For example, the following does not work as it will set points to NULL when ncol(rdf) > 3:

points <- switch(
    ncol(rdf),
    # 1 var: a single central point
    matrix(c(0, 0), ncol = 2, dimnames = list(colnames(rdf))),
    # 2 vars: 2 opposing points
    matrix(c(0, -0.1, 0, 0.1), ncol = 2, dimnames = list(colnames(rdf))),
    # More than 2 vars: multidimensional scaling to obtain x and y coordinates for points.
    suppressWarnings(stats::cmdscale(distance, k = 2))
  )

You can set a default when switching on a character variable but I am not sure that's what you meant. I find having to cast ncol(rdf) to character to make it work a bit weird.

points <- switch(
    as.character(ncol(rdf)),
    # 1 var: a single central point
    "1" = matrix(c(0, 0), ncol = 2, dimnames = list(colnames(rdf))),
    # 2 vars: 2 opposing points
    "2" = matrix(c(0, -0.1, 0, 0.1), ncol = 2, dimnames = list(colnames(rdf))),
    # More than 2 vars: multidimensional scaling to obtain x and y coordinates for points.
    suppressWarnings(stats::cmdscale(distance, k = 2))
  )

Please let me know if I've missed something but I think we can stick with the if/else otherwise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, I forgot that there is no default value allowed for numeric expressions. 👍

@juliasilge
Copy link
Member

Also closes #115

@thisisdaryn
Copy link
Collaborator

Also closes #115

I don't think it closes #115. I agree with this comment that a true fix would involve either a) changing correlate() to return something that isn't a cor_df i.e. not square or b) removing the y argument.
(I would also add a 3rd possibility, c) checking to make sure that x and y have the same number of variables/columns, but I don't think that is a good option).

The underlying problem with #115 is that it wraps cor() which can return a non-square output if x and y are different.

More detailed examples of #115 still being an issue after this PR below this line:


Two examples that I believe still give incorrect behavior:
  1. The following code mislabels the rows still (the row names of the output should come from the column names of the first input argument and the column names of the output should come from the columns of the 2nd argument) :
> correlate(mtcars[,c("mpg", "cyl", "disp")],
+           mtcars[,c("hp", "drat", "wt")], quiet = TRUE)
# A tibble: 3 x 4
  rowname     hp   drat     wt
  <chr>    <dbl>  <dbl>  <dbl>
1 hp      NA      0.681 -0.868
2 drat     0.832 NA      0.782
3 wt       0.791 -0.710 NA 

The correlations shown are not accurate for the names of the rows/columns that they are in.

The correct row names are seen here:

> cor(mtcars[,1:3], mtcars[,4:6])
             hp       drat         wt
mpg  -0.7761684  0.6811719 -0.8676594
cyl   0.8324475 -0.6999381  0.7824958
disp  0.7909486 -0.7102139  0.8879799
  1. The following code still fails. It fails after getting a non-square output from cor() and attempting to cast to a cor_df
> correlate(mtcars[,1:3], mtcars[,1:4])

Correlation method: 'pearson'
Missing treated using: 'pairwise.complete.obs'

 Error in as_cordf(x, diagonal = diagonal) : 
  Input object x is not a square. The number of columns must be equal to the number of rows. 

@antoine-sachet
Copy link
Contributor Author

antoine-sachet commented Nov 1, 2020

Hi,

I've made the edits requested except replacing the if/else by a switch (because it's not as good a fit as it seems, see details in the discussion above).

Regarding the usefulness of this PR... Yes you're right it is mostly pointless 😅 !

This is actually why I felt obliged to explain why on earth I was concerned about the 1 and 2 term cases.

Now that it's implemented, I do feel it is a more satisfying behaviour to plot something that makes sense rather than fail.

@thisisdaryn Agreed, I was not trying to address the use of a data.frame as y. Reopen #115.

@juliasilge
Copy link
Member

Thanks for the reminders on what happens when x and y are different in dimension and such, as discussed over in #115. I see what you both are saying, and we do still have to make a call on how to better handle that situation. 👍

@juliasilge juliasilge merged commit da9cc25 into tidymodels:master Nov 2, 2020
@juliasilge
Copy link
Member

Thank you so much for your work on this @antoine-sachet! 🚀

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

network_plot() does not work with only 2 variables or less
3 participants