-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
202 lines (150 loc) · 7.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
output: github_document
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# ICIKendallTau
<!-- badges: start -->
[![ICIKendallTau status badge](https://moseleybioinformaticslab.r-universe.dev/badges/ICIKendallTau)](https://moseleybioinformaticslab.r-universe.dev)
<!-- badges: end -->
You can see the pkgdown site [here](https://moseleybioinformaticslab.github.io/ICIKendallTau/).
## Installation
You can install the current version of ICIKendallTau via GitHub:
``` r
remotes::install_github("MoseleyBioinformaticsLab/ICIKendallTau")
```
You can also install Windows or Mac binaries using our r-universe:
```r
options(repos = c(
moseleybioinformaticslab = 'https://moseleybioinformaticslab.r-universe.dev',
CRAN = "https://cloud.r-project.org"))
install.packages("ICIKendallTau")
```
## Problem
* How to handle missing data (i.e. `NA`'s) in calculating a correlation between two variables.
* Current calculations of correlation are based on having all pairs of observations for two variables.
* However, whether an observation is present or missing is semi-quantitative information for many analytical measurements with sensitivity limits.
* i.e. in many cases, missing observations are not "missing-at-random", but "missing-not-at-random" due to falling below the detection limit.
* In these cases, NA is informative.
* Therefore, in **most** analytical measurements (gene expression, proteomics, metabolomics), missing measurements should be included, and contribute to the correlation.
If you want to read more on **how** we solve this problem, see the package [vignette](https://moseleybioinformaticslab.github.io/ICIKendallTau/articles/ici-kendalltau.html).
## Package Functions
The functions that implement this include:
* `ici_kt`: the C++ workhorse, actually calculating a correlation between an X and Y.
* The option `perspective` will control how the `NA` values influence ties.
* When comparing samples, you likely want to use `perspective = "global"`.
* `ici_kendallt`: Handles comparisons for a large matrix.
* Rows are features, columns are samples.
* Implicitly parallel, but have to call:
* `library(furrr)`
* `plan(multiprocess)`
* Otherwise will only use a single core.
We've also included a function for testing if the missingness in your data comes from left-censorship, `test_left_censorship`. We walk through creating example data and testing it in the vignette [Testing for Left Censorship](https://moseleybioinformaticslab.github.io/ICIKendallTau/articles/testing-for-left-censorship).
In addition to testing, you can also visualize the missing data pattern by feature rank using the `rank_order_data` function, and use `visdat::vis_miss()` on the original and reordered missing data.
## Examples
The most common case is a large matrix of independent samples (columns) and measured features in each of the samples (i.e. gene expression).
Here we will make some artificial data to show how the correlation changes as we introduce missingness.
```{r fake_data, message = FALSE}
set.seed(1234)
library(ICIKendallTau)
s1 = sort(rnorm(1000, mean = 100, sd = 10))
s2 = s1 + 10
matrix_1 = cbind(s1, s2)
r_1 = ici_kendalltau(matrix_1)
r_1$cor
```
Now we introduce some missing values at the low end of each one.
We will just do the simplest thing and introduce `NA` values in the bottom set.
```{r introduce_NA}
s3 = s1
s3[sample(100, 50)] = NA
s4 = s2
s4[sample(100, 50)] = NA
matrix_2 = cbind(s3, s4)
r_2 = ici_kendalltau(matrix_2)
r_2$cor
```
## Is It Fast?
The C++ code implementation (thanks {Rcpp}!) is based on the SciPy implementation, which uses two merge sorts of the ranks of each vector, and then looks for differences between them.
This is the fastest method we know of, and has a complexity of O(nlogn).
The naive way of computing it, which explicitly examines all of the pairs, has a complexity of n^2.
Our implementation was compared to the {pcaPP::cov.fk} function, and the use of {Rcpp} and our inefficient copying of vectors makes ours 3X slower than that one.
Which honestly isn't too bad.
```{r speed}
library(microbenchmark)
x = rnorm(1000)
y = rnorm(1000)
x2 = rnorm(40000)
y2 = rnorm(40000)
microbenchmark(
cor(x, y, method = "kendall"),
ici_kt(x, y, "global"),
ici_kt(x2, y2, "global"),
times = 5
)
```
In the case of 40,000 features, the average time on a modern CPU is 14 milliseconds.
Of course, if you want to use it to calculate Kendall-tau-b without incorporating missingness, it can do that just fine as well.
```{r check_values}
k_tau = ici_kt(x, y, "global")
all.equal(k_tau[[1]] ,cor(x, y, method = "kendall"))
```
We also provide the `kt_fast` function, if you want something that treats `NA` values similarly to `stats::cor`.
```{r kt_fast}
k_tau_fast = kt_fast(x, y)
k_tau_fast
```
## P-Values
ICI-Kt functions only calculates the tau-b variant that handles ties.
P-value calculations use the asymptotic approximation in all cases, and thus
may vary slightly from the p-values returned by R's `cor.test` and Python's `scipy.stats.kendalltau`
depending on the number of values in *x* and *y*.
## Parallelism
If you have {future} and the {furrr} packages installed, then it is also possible to split up the a set of matrix comparisons across compute resources for any multiprocessing engine registered with {future}.
```{r}
#| label: future
#| eval: false
library(furrr)
future::plan(multicore, workers = 4)
r_3 = ici_kendalltau(matrix_2)
```
## Many Many Comparisons
In the case of hundreds of thousands of comparisons to be done, the result matrices can become very, very large, and require lots of memory for storage.
They are also inefficient, as both the lower and upper triangular components are stored.
An alternative storage format is as a `data.frame`, where there is a single row for each comparison performed.
This is actually how the results are stored internally, and then they are converted to a matrix form if requested (the default).
To keep the `data.frame` output, add the argument `return_matrix=FALSE` to the call of `ici_kendalltau`.
```{r}
#| label: matrix
r_4 = ici_kendalltau(matrix_2, return_matrix = FALSE)
r_4
```
## Other Correlations
`ici_kendalltau` and `ici_kt` calculate the p-value of the correlation as part of the overall calculation.
`stats::cor` does not, and `stats::cor.test` can only calculate the p-value for a single comparison of two vectors.
It is sometimes advantageous to obtain p-values for a large number of correlations.
We provide `cor_fast`, which works analogously to `kt_fast`, with the ability to choose `pearson` or `spearman` as the method.
Note that if a matrix is provided, the columns must be named.
```{r}
#| label: cor-fast
r_5 = cor_fast(x, y, method = "pearson")
r_5
```
```{r}
#| label: cor-fast2
m_3 = cbind(x, y, x)
colnames(m_3) = c("s1", "s2", "s3")
r_6 = cor_fast(m_3)
r_6
```
## Code of Conduct
Please note that the ICIKendallTau project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/0/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.