-
Notifications
You must be signed in to change notification settings - Fork 56
/
README.Rmd
144 lines (98 loc) · 5.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# corrr <a href='https://corrr.tidymodels.org'><img src='man/figures/logo.png' style='float: right' height="139" /></a>
<!-- badges: start -->
[![R-CMD-check](https://github.com/tidymodels/corrr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidymodels/corrr/actions/workflows/R-CMD-check.yaml)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/corrr)](https://cran.r-project.org/package=corrr)
[![Codecov test coverage](https://codecov.io/gh/tidymodels/corrr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidymodels/corrr?branch=main)
<!-- badges: end -->
corrr is a package for exploring **corr**elations in **R**. It focuses on creating and working with **data frames** of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the [tidyverse](https://www.tidyverse.org/). This, along with the primary corrr functions, is represented below:
<img src='man/figures/to-cor-df.png'>
You can install:
- the latest released version from CRAN with
```{r install_cran, eval = FALSE}
install.packages("corrr")
```
- the latest development version from GitHub with
```{r install_git, eval = FALSE}
# install.packages("remotes")
remotes::install_github("tidymodels/corrr")
```
## Using corrr
Using `corrr` typically starts with `correlate()`, which acts like the base correlation function `cor()`. It differs by defaulting to pairwise deletion, and returning a correlation data frame (`cor_df`) of the following structure:
- A `tbl` with an additional class, `cor_df`
- An extra "term" column
- Standardized variances (the matrix diagonal) set to missing values (`NA`) so they can be ignored.
### API
The corrr API is designed with data pipelines in mind (e.g., to use `%>%` from the magrittr package). After `correlate()`, the primary corrr functions take a `cor_df` as their first argument, and return a `cor_df` or `tbl` (or output like a plot). These functions serve one of three purposes:
Internal changes (`cor_df` out):
- `shave()` the upper or lower triangle (set to `r NA`).
- `rearrange()` the columns and rows based on correlation strengths.
Reshape structure (`tbl` or `cor_df` out):
- `focus()` on select columns and rows.
- `stretch()` into a long format.
Output/visualizations (console/plot out):
- `fashion()` the correlations for pretty printing.
- `rplot()` the correlations with shapes in place of the values.
- `network_plot()` the correlations in a network.
## Databases and Spark
The `correlate()` function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the `cor_df` object. This allows for those results integrate with the rest of the `corrr` API.
## Examples
```{r example, message = FALSE, warning = FALSE}
library(MASS)
library(corrr)
set.seed(1)
# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))
# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA
# Correlate
x <- correlate(d)
class(x)
x
```
**NOTE: Previous to corrr 0.4.3, the first column of a `cor_df` dataframe was named "rowname". As of corrr 0.4.3, the name of this first column changed to "term".**
As a `tbl`, we can use functions from data frame packages like `dplyr`, `tidyr`, `ggplot2`:
```{r, message = FALSE, warning = FALSE}
library(dplyr)
# Filter rows by correlation size
x %>% filter(v1 > .6)
```
corrr functions work in pipelines (`cor_df` in; `cor_df` or `tbl` out):
```{r combination, warning = FALSE, fig.height = 4, fig.width = 5}
x <- datasets::mtcars %>%
correlate() %>% # Create correlation data frame (cor_df)
focus(-cyl, -vs, mirror = TRUE) %>% # Focus on cor_df without 'cyl' and 'vs'
rearrange() %>% # rearrange by correlations
shave() # Shave off the upper triangle for a clean result
fashion(x)
rplot(x)
datasets::airquality %>%
correlate() %>%
network_plot(min_cor = .2)
```
## Contributing
This project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
- For questions and discussions about tidymodels packages, modeling, and machine learning, please [post on RStudio Community](https://community.rstudio.com/new-topic?category_id=15&tags=tidymodels,question).
- If you think you have encountered a bug, please [submit an issue](https://github.com/tidymodels/corrr/issues).
- Either way, learn how to create and share a [reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html) (a minimal, reproducible example), to clearly communicate about your code.
- Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/).