-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmismatch_models.Rmd
212 lines (160 loc) · 11.3 KB
/
mismatch_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
---
title: "Modeling mismatched seed traits and microbiome"
author: "Quentin D. Read"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
html_document:
toc: true
toc_depth: 2
toc_float: true
code_folding: show
---
```{css, echo = FALSE}
.gt_table {
margin-bottom: 20px;
margin-top: 20px;
}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```
# Summary
Here, I took individual seed traits, fit a mixed model with country of origin as a fixed effect and maternal plant as a random intercept to each trait, then extracted the posterior predictions for each maternal plant from the models. Next, I aggregated the seed microbiomes from each maternal plant by adding the counts together. Then, for each of 100 posterior samples for each trait, I fit a random forest model using the aggregated microbiomes of each maternal plant to predict the modeled trait value for the maternal plant.
The random forest models performed poorly and do not predict trait given maternal microbiome. This is true for both 16S and ITS microbiome.
Two potential ways to extend this are (1) to make it a multivariate approach instead of a series of univariate models. This would involve replacing the univariate mixed models with a multivariate response, and replacing the series of random forests with a multioutput neural network or something similar. This is unlikely to yield a different result in the case of this dataset but may be worth working on for other cases. The other extension is (2) to use Aldex2 or some other approach to get a distribution of estimates for the aggregated maternal seed microbiome, instead of just adding up the offsprings' microbiomes. Again, this is unlikely to be useful here but might be for future cases.
# Setup
```{r load packages}
library(data.table)
library(brms)
library(ggplot2)
library(ranger)
library(purrr)
library(ALDEx2)
library(mixOmics)
options(mc.cores = 4, brms.backend = 'cmdstanr', brms.file_refit = 'on_change')
# Import data. Corrected labels, data accessed from g drive 2023-06-06
abundance16S <- fread('project/data/16S_abundance.csv')
abundanceITS <- fread('project/data/ITS_Abundance.csv')
traits <- fread('project/data/seedling_data_updated.csv')
```
```{r ggplot theme, include = FALSE}
windowsFonts(`fgbook` = windowsFont('Franklin Gothic Book'))
theme_set(
theme_bw(base_family = 'fgbook') +
theme(panel.grid = element_blank(),
strip.background = element_blank())
)
```
# Trait model
Fit a very simple linear mixed model to each trait separately. All default priors are used. Log transformation is used for the height and depth traits. (Log+0.1 is used for leaf height due to one having a zero value). For the count traits and days to germination, a GLMM is fit with Poisson response distribution.
There is relatively little variation in the number of primary roots (42 of 50 had 1 root) and the number of leaves (40 of 50 had 1 leaf). Thus it seems unlikely to be fruitful to use seed microbiome to predict these, but they are included for completeness.
> NOTE: I am also working on some efforts to make this a multivariate model using multivariate regression followed by a multioutput neural network. However, I do not feel that it is providing much of a benefit in the case of this dataset. Thus, I am not going to show it here, just the individual univariate models. But it may be worth using the multivariate approach in other cases.
```{r}
rdepthfit <- brm(log(rooting_depth_cm) ~ country_origin + (1 | maternal_plant_code), data = traits,
prior = c(prior(normal(0, 3), class = b)),
chains = 4, iter = 2000, warmup = 1000, seed = 836,
file = 'project/fits/rdepthfit')
lheightfit <- brm(log(leaf_height_cm + 0.1) ~ country_origin + (1 | maternal_plant_code), data = traits,
prior = c(prior(normal(0, 3), class = b)),
chains = 4, iter = 2000, warmup = 1000, seed = 1340,
file = 'project/fits/lheightfit')
rtipcountfit <- brm(root_tip_count ~ country_origin + (1 | maternal_plant_code), data = traits, family = poisson,
prior = c(prior(normal(0, 1), class = b)),
chains = 4, iter = 3000, warmup = 2000, seed = 1349,
control = list(adapt_delta = 0.95),
file = 'project/fits/rtipcountfit')
nprimrootfit <- brm(no_primary_roots ~ country_origin + (1 | maternal_plant_code), data = traits, family = poisson,
prior = c(prior(normal(0, 1), class = b)),
chains = 4, iter = 3000, warmup = 2000, seed = 1351,
control = list(adapt_delta = 0.95),
file = 'project/fits/nprimrootfit')
nleaffit <- brm(no_leaves ~ country_origin + (1 | maternal_plant_code), data = traits, family = poisson,
prior = c(prior(normal(0, 1), class = b)),
chains = 4, iter = 3000, warmup = 2000, seed = 1408,
control = list(adapt_delta = 0.95),
file = 'project/fits/nleaffit')
daygermfit <- brm(days_to_germinate ~ country_origin + (1 | maternal_plant_code), data = traits, family = poisson,
prior = c(prior(normal(0, 3), class = b)),
chains = 4, iter = 3000, warmup = 2000, seed = 1409,
file = 'project/fits/daygermfit')
```
Extract the posterior distribution of the BLUPs for each of the maternal plants that have data (9 for all of the traits). If you add the fixed and random effect together you get a predicted value for each maternal plant for each of the 4000 posterior samples, so the posterior predictions for each trait are a 4000x9 matrix.
```{r}
trait_fits <- list(rdepthfit, lheightfit, rtipcountfit, nprimrootfit, nleaffit, daygermfit)
trait_names <- c('rooting_depth_cm', 'leaf_height_cm', 'root_tip_count', 'no_primary_roots', 'no_leaves', 'days_to_germinate')
pred_grids <- map(trait_names, ~ unique(traits[!is.na(get(.)), .(maternal_plant_code, country_origin)]))
trait_post_blups <- map2(trait_fits, pred_grids, ~ predict(.x, newdata = .y, summary = FALSE))
```
# Option 1: use aggregated microbiome to predict traits
This is a "quick and dirty" method. Aggregate the microbiome samples by maternal plant by summing the counts up. This will be a reasonably good estimate of the "mean" microbiome from each maternal plant but will not account for the uncertainty resulting from us being only able to obliquely observe the maternal microbiome through sampling the microbiomes of its offspring.
As of 2023-10-16, the aggregation is done as follows:
- Remove maternal plant J from the data as it has no seedling traits.
- Remove any taxa that have all 0 abundance across all remaining microbiomes.
- Do CLR transformation, first adding 1 to all counts.
- Take the average log-ratio transformed value for each taxon, grouped by maternal plant.
```{r}
aggregate_abundance <- function(abundance) {
# Remove maternal plant J as it has no seedling traits.
J_columns <- grep('J$', names(abundance), value = TRUE)
abundance[, c(J_columns) := NULL]
# Remove any taxa that now have all zero abundance in the remaining 9 maternal plants.
zero_rows <- rowSums(abundance[,-1]) == 0
abundance <- abundance[!zero_rows]
# Do CLR transformation, adding 1 to all counts. Replace original values with transformed.
abundance_CLR <- logratio.transfo(abundance[,-1], logratio = 'CLR', offset = 1)
abundance_CLR <- cbind(abundance[, .(V1)], as(abundance_CLR, 'matrix'))
# Reshape to longform
abundance_long <- melt(abundance_CLR, id.vars = 'V1')
abundance_long[, maternal_plant_code := substr(variable, 3, 3)]
abundance_long[, sampleID := substr(variable, 1, 2)]
# Aggregate by taking the average log ratio of each taxon grouped by maternal plant
abundance_agg <- abundance_long[, .(value = mean(value)), by = .(V1, maternal_plant_code)]
# Return to wideform
abundance_agg_wide <- dcast(abundance_agg, maternal_plant_code ~ V1)
}
# Do aggregation, retaining identifier column.
abundance16S_agg_forfitting <- aggregate_abundance(abundance16S)
abundanceITS_agg_forfitting <- aggregate_abundance(abundanceITS)
```
Define function to fit the random forest for a single posterior sample, passing in the predicted trait value and which microbiome is being used.
```{r}
fit_rf <- function(y, pred_grid, abundance_dt) {
y <- data.table(maternal_plant_code = pred_grid$maternal_plant_code, y = y)
dt <- y[abundance_dt, on = .(maternal_plant_code)]
dt[, maternal_plant_code := NULL]
setnames(dt, c('y', paste0('otu', 1:(ncol(dt)-1)))) # Must do this because the OTU names are not valid column names.
rf <- ranger(y ~ ., data = dt, min.node.size = 3, num.trees = 1000)
}
```
Apply the function to a random subset of 100 of the posterior samples.
> NOTE: It would be easy to parallelize this if it becomes necessary later, to more easily work with more posterior samples.
```{r}
set.seed(1104)
trait_post_blups_subsample <- purrr::map(trait_post_blups, ~ .[sample(1:nrow(.), 100), ])
post_rf_fits_16S <- map2(trait_post_blups_subsample, pred_grids, ~ apply(.x, 1, fit_rf, pred_grid = .y, abundance_dt = abundance16S_agg_forfitting))
post_rf_fits_ITS <- map2(trait_post_blups_subsample, pred_grids, ~ apply(.x, 1, fit_rf, pred_grid = .y, abundance_dt = abundanceITS_agg_forfitting))
```
Examine the out-of-bag R-squared values for all 100 posterior samples, for both 16S and ITS.
```{r}
rsqs_16S <- map2(post_rf_fits_16S, trait_names, ~ data.frame(trait = .y, rsquared = map_dbl(.x, 'r.squared'))) |> list_rbind()
rsqs_ITS <- map2(post_rf_fits_ITS, trait_names, ~ data.frame(trait = .y, rsquared = map_dbl(.x, 'r.squared'))) |> list_rbind()
ggplot(rsqs_16S, aes(x = rsquared)) +
geom_histogram() +
stat_summary(aes(y = rsquared, x = 0.1, xintercept = stat(y)), geom = 'vline', fun.y = 'median', color = 'red') +
facet_wrap(~ trait) +
scale_y_continuous(name = 'count', expand = expansion(add = c(0, 1)), breaks = c(0, 4, 8, 12)) +
labs(x = 'out-of-bag R^2^') +
theme(axis.title.x = ggtext::element_markdown()) +
ggtitle('16S')
ggplot(rsqs_ITS, aes(x = rsquared)) +
geom_histogram() +
stat_summary(aes(y = rsquared, x = 0.1, xintercept = stat(y)), geom = 'vline', fun.y = 'median', color = 'red') +
facet_wrap(~ trait) +
scale_y_continuous(name = 'count', expand = expansion(add = c(0, 1)), breaks = c(0, 4, 8, 12)) +
labs(x = 'out-of-bag R^2^') +
theme(axis.title.x = ggtext::element_markdown()) +
ggtitle('ITS')
```
It is very poor performance for both 16S and ITS: almost all the out of bag r-squared values are <0 and the mean is about -0.2, indicating that the model using microbiome to predict traits (after adjusting for maternal plant identity and country of origin using the linear mixed model) is even worse than just predicting that every individual has the average trait.
# Option 2: use distribution from Aldex2 model to predict traits
See the vignette especially [part 4.1 in the vignette](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html#41_The_aldexclr_module). As far as I can tell Aldex2 is generating samples from an underlying distribution for each sample though we really want it to generate samples from an underlying distribution for each maternal plant, with samples grouped together. So far, I have been unable to get a better result. For now we are just using the aggregated microbiome.