Permutation-based Maximum Covariance Analysis (PMCA) identifies common patterns in coupled datasets. To avoid the identification of spurious patterns, a chief criticism of Maximum Covariance Analysis, PMCA provides a permutation-based False Positive Rate (FPR) for each association. This FPR allows for ranking or filtering the results that fall below a given FPR tolerance.
You can install the released version of PMCA from github with:
library(devtools)
devtools::install_github("robyn-ball/PMCA")
This example uses a subset of the data outlined in the manuscript.
The full dataset is available at GEO, GSE72833.
Xexample (6 x 28) are the six substage proportions for each mouse (28 mice).
Yexample (1000 x 28) are the corresponding gene expression values (1000 genes) for each mouse (28 mice). Note that for simplicity, Yexample is only a subset of the gene expression data.
We wish to identify the substage-specific genes, i.e., we wish to map the genes of Yexample onto the substages of X.
library(PMCA)
mca.real <- get.mca(Xexample,Yexample)
Calculate Z_x, Z_y, and sigma for matrices Xexample and Yexample.
Next, plot the singular values of the covariance matrix of Xexample and Yexample.
plot(mca.real$sigma)
Now we make some decisions based on the singular values.
The following parameters are + by: by which component do you want the FPR <= alpha + alpha: specify alpha such that the FPR <= alpha by the given component (by) + method: “overall” to calculate an overall FPR across the entire component, “each” if you want a FPR for each rowterm of Xexample. Note that “overall” is faster. + B: number of permutations
- plot: TRUE will plot one of the FPR distributions across all B permutations.
The following will calculate an FPR for each rowterm of Xexample (substage) and ensure that the FPR is <= 0.05 by the 3rd component. It use 1000 permutations to calculate the FPR and plot a histogram of the FPR distribution for the 3rd component.
by<- 3; method <- "each"; B <- 1000; alpha <- 0.05; plot <- TRUE
To determine the degree of similarity of patterns, we calculate the distance between each element of Xexample and each element of Yexample across all components.
To avoid identifying spurious patterns, we use permutations to estimate the FPR for each of the associations.
For each of the B permutations, we break the relationship between Xexample and Yexample by independently shuffling the columns of Xexample -> Xstar and calculate the distance between the elements of Xstar and Yexample. Thus, for every element of Xexample and Yexample, we compare the observed distance (relative similarity) to B distances where the relationship between Xexample and Yexample was broken.
These permutation-based distances provide the basis for estimating the FPR.
First, calculate the observed distances
scores.real <- get.scores(mca.real$Zx,mca.real$Zy)
scores.neg <- get.scores(-mca.real$Zx,mca.real$Zy) #for anti-associated lists
Note, that we can also find patterns that are anti-associated (scores.neg). For example, it is useful to know the genes that have decreasing expression as a substage proportion is increasing, and vice versa.
set.seed(B) # for reproducibility
scores.rand <- permutation.proc(Xexample,Yexample,method=method,B=B) # get scores for all B permutations
Now use the permutation-based scores to identify the optimal window width and thus ensure that the FPR <= 0.05 by the 3rd component (as we specified above).
tau controls the width of the window w/tau. We start with w = the standard deviation of Z_x and tau=1. Note that a larger tau will be more strict and result in finer lists of associations. Try a larger or smaller tau if the lists are too stringent (fine) or too relaxed (large).
w <- apply(mca.real$Zx,2,sd) # get starting window vector
set.seed(B) # for reproducibilty
it.result <- iterative.proc(scores.rand,alpha=alpha,w=w,method=method,by=by,rowterms=rownames(Yexample), plot=plot,tau=1)
Print tau and the estimated FPR at each component and row of Xexample.
it.result$tau
#> [1] 1.9
it.result$FPR
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.341371 0.132415 0.047533 0.012241 0.002643
#> [2,] 0.397579 0.095322 0.015693 0.002542 0.000295
#> [3,] 0.397601 0.115143 0.024542 0.004695 0.000685
#> [4,] 0.343955 0.133472 0.048512 0.009684 0.001905
#> [5,] 0.341421 0.137083 0.048415 0.011889 0.001733
#> [6,] 0.389366 0.093160 0.031470 0.006232 0.001067
The way to read the FPR is that for row 1, the FPR for column 3 is 0.047 and for column 4, the FPR is 0.012.
You will get a finer list as you use more columns so that the genes that map if you use columns 1-3 have a FPR <= 0.05 and the genes that map to a rowterm of Xexample using the 4th component will have a lower FPR.
g = list of rownames(Yexample) that match patterns of rownames(Xexample)
g <- match.patterns(scores.real,w=it.result$wopt, rowterms = rownames(Yexample))
g[[i]][[j]]: all the rownames(Yexample) that match the pattern of rownames(Xexample)[i] when the component = j
For example, if you want to know what genes map to “LP_Dip” (row 6) when using 4 components, use g[[6]][[4]].
- g[[1]][[4]] are those genes that match celltype 1 and have a FPR < 0.05
- g[[1]][[5]] are those genes that match celltype 1 and have a FPR < 0.02
Note that necessarily, every gene in g[[1]][[5]] is also in g[[1]][[4]], g[[1]][[3]], g[[1]][[2]], g[[1]][[1]].
To see the number of rowterms of Yexample (genes) that map to the rowterms of Xexample (substages) and the intersections of these lists, use get.inter().
int <- get.inter(g,2)
rownames(int) <- colnames(int) <- rownames(Xexample)
int
#> Spermatogonia preleptotene Early.leptotene LL_Zyg
#> Spermatogonia 58 0 0 0
#> preleptotene 0 71 0 1
#> Early.leptotene 0 0 303 0
#> LL_Zyg 0 1 0 93
#> Early.pachytene 0 0 0 55
#> LP_Dip 0 0 0 0
#> Early.pachytene LP_Dip
#> Spermatogonia 0 0
#> preleptotene 0 0
#> Early.leptotene 0 0
#> LL_Zyg 55 0
#> Early.pachytene 134 0
#> LP_Dip 0 162
int[i,j] = number of rownames(Yexample) (genes) that mapped to both rownames(Xexample)[i] [substage i] & rownames(Xexample)[j] [substage j]
Here, we see that 58 genes are specific to Spermatogonia, 162 genes are specific to LP_Dip, and 55 genes are shared between LL_Zyg and EP.
Using a more stringent FPR, select component 3:
int <- get.inter(g,3)
rownames(int) <- colnames(int) <- rownames(Xexample)
int
#> Spermatogonia preleptotene Early.leptotene LL_Zyg
#> Spermatogonia 44 0 0 0
#> preleptotene 0 5 0 0
#> Early.leptotene 0 0 7 0
#> LL_Zyg 0 0 0 70
#> Early.pachytene 0 0 0 40
#> LP_Dip 0 0 0 0
#> Early.pachytene LP_Dip
#> Spermatogonia 0 0
#> preleptotene 0 0
#> Early.leptotene 0 0
#> LL_Zyg 40 0
#> Early.pachytene 115 0
#> LP_Dip 0 154
Now, only 44 genes are specific to Spermatogonia and 40 genes are shared between LL_Zyg and EP
Similarly, the anti-associations can be extracted using scores.neg.
h <- match.patterns(scores.neg,w=it.result$wopt, rowterms = rownames(Yexample))
int <- get.inter(h,2)
rownames(int) <- colnames(int) <- rownames(Xexample)
int
#> Spermatogonia preleptotene Early.leptotene LL_Zyg
#> Spermatogonia 100 0 0 0
#> preleptotene 0 6 0 1
#> Early.leptotene 0 0 216 0
#> LL_Zyg 0 1 0 52
#> Early.pachytene 0 0 0 44
#> LP_Dip 0 0 0 0
#> Early.pachytene LP_Dip
#> Spermatogonia 0 0
#> preleptotene 0 0
#> Early.leptotene 0 0
#> LL_Zyg 44 0
#> Early.pachytene 151 0
#> LP_Dip 0 230
Now, we see that 100 genes are anti-associated to spermatogonia, etc.
A detailed description of the mathematics behind PMCA can be found here.