-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathstanmodelpresentation.qmd
240 lines (182 loc) · 12 KB
/
stanmodelpresentation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "Multilevel trait-microbiome mismatch model"
author: Quentin D. Read
format:
revealjs:
embed-resources: true
transition: slide
engine: knitr
---
## The problem
- We have maternal plants from different populations
- We want to measure the association between seedling traits and seed microbiome in their offspring
+ And whether this association is different by population but that's just an extra wrinkle
:::: {.columns}
::: {.column}
![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cd/Flag_of_Afghanistan_%282013%E2%80%932021%29.svg/800px-Flag_of_Afghanistan_%282013%E2%80%932021%29.svg.png)
:::
::: {.column}
![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Flag_of_Turkey.svg/800px-Flag_of_Turkey.svg.png)
:::
::::
## The problem
- But it is not possible to observe seed microbiome and seedling traits for the same seed
- The seed can either be germinated to measure traits, or destroyed to measure microbiome, but not both
:::: {.columns}
::: {.column}
![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Eierkarton_10_%28fcm%29.jpg/740px-Eierkarton_10_%28fcm%29.jpg)
:::
::: {.column}
![](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/Fried_Egg_2.jpg/800px-Fried_Egg_2.jpg)
:::
::::
## How to resolve the problem
- We will treat the observed traits of seedlings that are offspring of the same maternal plant as realizations of an unobserved latent value: the trait value of the "average" offspring from that mother plant
- We'll do the same for the observed microbiomes of seeds from the same mother: they're realizations of the unobserved "average" seed microbiome community among offspring from that mother
## Overview of model
- So far the model I've created measures the association between microbiome and **one single trait**
- Estimate latent average seedling trait of offspring from each mother from the trait observations
- Estimate latent average seed microbiome of offspring from each mother from the microbiome observations
- Regress latent trait on latent microbiome, population of origin, and microbiome:population interaction
## The data
- $\mathbf{y}$: a vector of standardized trait values with length $N_t$
+ $N_t$: number of offspring for which traits were measured
- $\mathbf{X}$: a matrix of CLR-transformed seed microbiome taxon abundances with dimensions $N_m \times D$
+ $N_m$: number of offspring for which microbiomes were measured
+ $D$: number of taxa in the microbiome dataset
## The data, continued
- $\mathbf{p}$: a vector of covariates at maternal plant level (here it is just population of origin) with length $M$
+ $M$: number of maternal plants
- $\mathbf{M}_{t}$: a vector of length $N_t$ assigning maternal plant to each seedling with traits
- $\mathbf{M}_{m}$: a vector of length $N_m$ assigning maternal plant to each seed with microbiome
## The parameters
- $\mathbf{y}_M$: vector of length $M$, latent unobserved trait value of average offspring of each mother
- $\mathbf{X}_M$: matrix with dimensions $M \times D$, latent unobserved seed microbiome of average offspring of each mother
- $\sigma_{yM}$: standard deviation of traits across seedlings
- $\Sigma_{XM}$: covariance matrix of seed microbiomes
## The parameters, continued
- $\beta_0$: intercept of trait regression
- $\beta_{pop}$: main effect of population on trait
- $\beta_{tax}$: vector of length $D$, main effect of each taxon on trait
- $\beta_{taxpop}$: vector of length $D$, interactive effect between each taxon and population on trait
- $\sigma$: standard deviation of traits across maternal plants
## The model
- Observed traits are realizations of unobserved latent offspring trait for each mother $\mathbf{y}_{M}$
+ $\mathbf{y}_{i} \sim \text{Normal}(\mathbf{y}_{M, M_{t}(i)}, \sigma_{yM})$
- Observed microbiomes are realizations of unobserved latent offspring microbiome for each mother $\mathbf{X}_{M}$
+ $\mathbf{X}_{i} \sim \text{MultiNormal}(\mathbf{X}_{M, M_{m}(i)}, \Sigma_{XM})$
+ (In Stan we split the covariance matrix into a correlation matrix $\Omega_{XM}$ and a vector of scaling parameters $\tau$)
## The model, continued
- Now that we have latent traits and microbiomes that match up, we can do the regression!
+ $\mathbf{y}_{Mi} \sim \text{Normal}(\beta_0 + \beta_{pop}\mathbf{p}_{i} + \beta_{tax}\mathbf{X}_{Mi} + \beta_{taxpop}\mathbf{p}_{i}\mathbf{X}_{Mi}, \sigma)$
- No interactions between taxa are currently included
## The Stan code
- Everything I explained before, plus priors on the parameters
```
// Multivariate microbiome by single trait model
// QDR 2024-05-02
data {
int<lower=1> Nmicro; // Number of offspring with microbiome data
int<lower=1> Ntrait; // Number of offspring with trait data
int<lower=1> M; // Number of maternal plants
int<lower=1> D; // Number of taxa in the microbiome data
vector[Ntrait] y; // Vector of standardized values of a trait for each offspring with trait data
array[Nmicro] vector[D] X; // Array of CLR transformed abundance vectors for each offspring with microbiome data
vector<lower=0,upper=1>[M] pop; // Population of origin of the maternal plants (0=Afghanistan, 1=Turkey)
array[Nmicro] int<lower=1,upper=M> Mmicro; // Mapping to maternal plants 1-M for microbiome data
array[Ntrait] int<lower=1,upper=M> Mtrait; // Mapping to maternal plants 1-M for trait data
}
parameters {
real b0; // Global intercept of trait across all maternal plants
real b_pop; // Fixed effects at maternal plant level of population, taxon abundance, and their interaction
row_vector[D] b_tax;
row_vector[D] b_tax_pop;
vector[M] y_M; // Predicted trait value for each maternal plant
array[M] vector[D] X_M; // Array of predicted abundance vectors for each maternal plant
real<lower=0> sigma_yM; // Variation (SD) among maternal plants in traits
corr_matrix[D] Omega_xM; // Correlation matrix of taxon abundances
vector<lower=0>[D] tau; // Parameter to scale correlation matrix to variance-covariance matrix
real<lower=0> sigma; // Variation (SD) of model residuals
}
model {
// Priors
b0 ~ normal(0, 10); // Fixed effects get normal priors
b_pop ~ normal(0, 10);
b_tax ~ normal(0, 10);
b_tax_pop ~ normal(0, 10);
y_M ~ normal(0, 10); // Maternal plant level intercepts for traits and taxon abundances get normal priors
for (i in 1:M) {
X_M[i] ~ normal(0, 10);
}
sigma_yM ~ exponential(1); // SD parameters get exponential priors (must be >0)
sigma ~ exponential(1);
tau ~ cauchy(0, 2.5); // Scaling parameters of covariance matrix get half-Cauchy priors (must be >0)
Omega_xM ~ lkj_corr(2); // Correlation matrix gets an LKJ prior (see http://stla.github.io/stlapblog/posts/StanLKJprior.html)
// Likelihood
// Trait of each offspring is drawn from a normal distribution with mean being the mean trait of its mother
for (i in 1:Ntrait) {
y[i] ~ normal(y_M[Mtrait[i]], sigma_yM);
}
// Taxon abundance of each offspring is drawn from a multivariate normal distribution with mean being the mean abundance of its mother
for (i in 1:Nmicro) {
X[i] ~ multi_normal(X_M[Mmicro[i]], quad_form_diag(Omega_xM, tau));
}
// Trait of each maternal plant is modeled as a function of its population of origin, its microbiome, and their interactions
for (i in 1:M) {
y_M[i] ~ normal(b0 + b_pop * pop[i] + b_tax * X_M[i] + b_tax_pop * (pop[i] .* X_M[i]), sigma);
}
}
```
## Fit model to Alicia's data
- For testing purposes, a subset of 10 of the >1200 taxa were selected
- Rooting depth is used as the trait
- Total 63 seed microbiomes and 50 rooting depth values, from 9 maternal plants across 2 populations
## Results
- The model produces numbers, which is good, I guess?
- Here are the first few rows of parameter estimates
+ (Note all the regression parameter estimates are close to 0)
```
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -34.4 -34.1 11.7 12.5 -54.7 -16.6 1.01 501. 506.
2 b0 -0.321 -0.266 2.61 2.42 -4.56 3.84 1.01 1301. 3463.
3 b_pop 0.0616 0.00204 5.04 4.79 -8.09 8.38 1.00 1050. 1768.
4 b_tax[1] -0.202 -0.0640 9.01 8.68 -15.5 14.5 1.00 1585. 2775.
5 b_tax[2] -0.0897 -0.156 6.57 6.06 -10.4 10.8 1.01 872. 2115.
6 b_tax[3] -0.126 -0.348 3.06 2.52 -5.01 5.18 1.00 1431. 1642.
7 b_tax[4] 0.275 0.234 5.92 5.34 -9.73 9.92 1.01 1059. 1529.
8 b_tax[5] 0.0626 -0.219 8.75 8.38 -14.3 14.5 1.00 4529. 4495.
9 b_tax[6] -0.278 -0.378 9.28 9.68 -14.9 14.8 1.01 845. 1892.
10 b_tax[7] 0.650 0.397 8.43 8.23 -13.0 15.0 1.02 234. 77.8
11 b_tax[8] 0.344 0.464 5.87 5.41 -9.26 10.2 1.00 3381. 5214.
12 b_tax[9] 0.498 0.623 6.31 6.00 -9.69 10.6 1.01 908. 3962.
13 b_tax[10] -0.0318 -0.0688 4.30 3.70 -7.08 6.80 1.01 2033. 3064.
14 b_tax_pop[1] 0.0244 0.384 9.49 9.76 -15.8 15.0 1.01 1222. 5599.
15 b_tax_pop[2] 0.513 0.608 9.53 8.82 -15.0 17.3 1.02 433. 78.3
16 b_tax_pop[3] -0.455 -0.393 7.09 6.56 -12.7 11.5 1.01 487. 222.
17 b_tax_pop[4] 0.906 1.33 7.89 7.59 -12.3 13.7 1.00 2262. 4013.
18 b_tax_pop[5] -0.470 -0.634 9.52 9.25 -16.2 15.1 1.00 1611. 1743.
19 b_tax_pop[6] -0.742 -0.370 9.71 9.80 -18.5 15.0 1.01 357. 108.
20 b_tax_pop[7] -0.0000176 -0.0125 8.91 8.75 -14.0 14.4 1.01 1062. 2898.
21 b_tax_pop[8] -0.0345 -0.249 9.00 8.71 -14.7 15.0 1.00 2088. 2716.
22 b_tax_pop[9] -0.949 -0.920 9.51 9.47 -16.4 14.8 1.01 345. 468.
23 b_tax_pop[10] -0.394 -0.466 7.03 6.86 -11.5 11.0 1.01 1127. 1551.
24 y_M[1] -0.280 -0.279 0.198 0.191 -0.601 0.0487 1.00 1681. 2529.
25 y_M[2] -1.06 -1.07 0.256 0.257 -1.47 -0.630 1.01 833. 1833.
26 y_M[3] -0.479 -0.475 0.486 0.476 -1.29 0.406 1.02 266. 73.8
27 y_M[4] -0.116 -0.118 0.250 0.254 -0.532 0.293 1.01 484. 207.
28 y_M[5] -0.553 -0.551 0.263 0.251 -1.00 -0.124 1.01 1040. 672.
29 y_M[6] -0.795 -0.772 0.498 0.441 -1.69 0.00997 1.02 209. 63.2
30 y_M[7] 1.39 1.39 0.216 0.222 1.06 1.77 1.01 534. 594.
```
## Future directions
- The model could be extended to multivariate trait outcomes and interactions between taxa, *at least in theory*
- A major issue is scaling up the model
+ Only tested this on a random subset of 10 taxa from a few dozen individual plants
+ It only took a few minutes to sample the model but this will blow up as community size increases
+ Full Bayesian computation on anything much bigger than that is going to be very tough
## Future directions, continued
- The most important thing to do right now is generate some fake data with known parameters and fit the model to it
- This will demonstrate that the model works
- Because Stan forces you to rigorously define your model in terms of probability distributions, this will actually be pretty easy to do!
- Stay tuned ...