forked from daviddalpiaz/r4sl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-generative.Rmd
330 lines (230 loc) · 13.9 KB
/
11-generative.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
# Generative Models
In this chapter, we continue our discussion of classification methods. We introduce three new methods, each a **generative** method. This in comparison to logistic regression, which is a **discriminative** method.
Generative methods model the joint probability, $p(x, y)$, often by assuming some distribution for the conditional distribution of $X$ given $Y$, $f(x \mid y)$. Bayes theorem is then applied to classify according to $p(y \mid x)$. Discriminative methods directly model this conditional, $p(y \mid x)$. A detailed discussion and analysis can be found in [Ng and Jordan, 2002](https://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf).
Each of the methods in this chapter will use Bayes theorem to build a classifier.
$$
p_k(x) = P(Y = k \mid X = x) = \frac{\pi_k \cdot f_k(x)}{\sum_{g = 1}^{G} \pi_g \cdot f_g(x)}
$$
We call $p_k(x)$ the **posterior** probability, which we will estimate then use to create classifications. The $\pi_g$ are called the **prior** probabilities for each possible class $g$. That is, $\pi_g = P(Y = g)$, unconditioned on $X$. The $f_g(x)$ are called the **likelihoods**, which are indexed by $g$ to denote that they are conditional on the classes. The denominator is often referred to as a **normalizing constant**.
The methods will differ by placing different modeling assumptions on the likelihoods, $f_g(x)$. For each method, the priors could be learned from data or pre-specified.
For each method, classifications are made to the class with the highest estimated posterior probability, which is equivalent to the class with the largest
$$
\log(\hat{\pi}_k \cdot \hat{f}_k({\mathbf x})).
$$
By substituting the corresponding likelihoods, simplifying, and eliminating unnecessary terms, we could derive the discriminant function for each.
To illustrate these new methods, we return to the iris data, which you may remember has three classes. After a test-train split, we create a number of plots to refresh our memory.
```{r}
set.seed(430)
iris_obs = nrow(iris)
iris_idx = sample(iris_obs, size = trunc(0.50 * iris_obs))
# iris_index = sample(iris_obs, size = trunc(0.10 * iris_obs))
iris_trn = iris[iris_idx, ]
iris_tst = iris[-iris_idx, ]
```
```{r, fig.height=8, fig.width=8}
caret::featurePlot(x = iris_trn[, c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")],
y = iris_trn$Species,
plot = "density",
scales = list(x = list(relation = "free"),
y = list(relation = "free")),
adjust = 1.5,
pch = "|",
layout = c(2, 2),
auto.key = list(columns = 3))
```
```{r, fig.height=8, fig.width=8}
caret::featurePlot(x = iris_trn[, c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")],
y = iris_trn$Species,
plot = "ellipse",
auto.key = list(columns = 3))
```
```{r, fig.height=4, fig.width=7}
caret::featurePlot(x = iris_trn[, c("Sepal.Length", "Sepal.Width",
"Petal.Length", "Petal.Width")],
y = iris_trn$Species,
plot = "box",
scales = list(y = list(relation = "free"),
x = list(rot = 90)),
layout = c(4, 1))
```
Especially based on the pairs plot, we see that it should not be too difficult to find a good classifier.
Notice that we use `caret::featurePlot` to access the `featurePlot()` function without loading the entire `caret` package.
## Linear Discriminant Analysis
LDA assumes that the predictors are multivariate normal conditioned on the classes.
$$
X \mid Y = k \sim N(\mu_k, \Sigma)
$$
$$
f_k({\mathbf x}) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf x - \mu_k)^{\prime}\Sigma^{-1}(\mathbf x - \mu_k)\right]
$$
Notice that $\Sigma$ does **not** depend on $k$, that is, we are assuming the same $\Sigma$ for each class. We then use information from all the classes to estimate $\Sigma$.
To fit an LDA model, we use the `lda()` function from the `MASS` package.
```{r}
library(MASS)
iris_lda = lda(Species ~ ., data = iris_trn)
iris_lda
```
Here we see the estimated $\hat{\pi}_k$ and $\hat{\mu}_k$ for each class.
```{r}
is.list(predict(iris_lda, iris_trn))
names(predict(iris_lda, iris_trn))
head(predict(iris_lda, iris_trn)$class, n = 10)
head(predict(iris_lda, iris_trn)$posterior, n = 10)
```
As we should come to expect, the `predict()` function operates in a new way when called on an `lda` object. By default, it returns an entire list. Within that list `class` stores the classifications and `posterior` contains the estimated probability for each class.
```{r}
iris_lda_trn_pred = predict(iris_lda, iris_trn)$class
iris_lda_tst_pred = predict(iris_lda, iris_tst)$class
```
We store the predictions made on the train and test sets.
```{r}
calc_class_err = function(actual, predicted) {
mean(actual != predicted)
}
```
```{r}
calc_class_err(predicted = iris_lda_trn_pred, actual = iris_trn$Species)
calc_class_err(predicted = iris_lda_tst_pred, actual = iris_tst$Species)
```
As expected, LDA performs well on both the train and test data.
```{r}
table(predicted = iris_lda_tst_pred, actual = iris_tst$Species)
```
Looking at the test set, we see that we are perfectly predicting both setosa and versicolor. The only error is labeling a virginica as a versicolor.
```{r}
iris_lda_flat = lda(Species ~ ., data = iris_trn, prior = c(1, 1, 1) / 3)
iris_lda_flat
```
Instead of learning (estimating) the proportion of the three species from the data, we could instead specify them ourselves. Here we choose a uniform distributions over the possible species. We would call this a "flat" prior.
```{r}
iris_lda_flat_trn_pred = predict(iris_lda_flat, iris_trn)$class
iris_lda_flat_tst_pred = predict(iris_lda_flat, iris_tst)$class
```
```{r}
calc_class_err(predicted = iris_lda_flat_trn_pred, actual = iris_trn$Species)
calc_class_err(predicted = iris_lda_flat_tst_pred, actual = iris_tst$Species)
```
This actually gives a better test accuracy!
## Quadratic Discriminant Analysis
QDA also assumes that the predictors are multivariate normal conditioned on the classes.
$$
X \mid Y = k \sim N(\mu_k, \Sigma_k)
$$
$$
f_k({\mathbf x}) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left[-\frac{1}{2}(\mathbf x - \mu_k)^{\prime}\Sigma_{k}^{-1}(\mathbf x - \mu_k)\right]
$$
Notice that now $\Sigma_k$ **does** depend on $k$, that is, we are allowing a different $\Sigma_k$ for each class. We only use information from class $k$ to estimate $\Sigma_k$.
```{r}
iris_qda = qda(Species ~ ., data = iris_trn)
iris_qda
```
Here the output is similar to LDA, again giving the estimated $\hat{\pi}_k$ and $\hat{\mu}_k$ for each class. Like `lda()`, the `qda()` function is found in the `MASS` package.
Consider trying to fit QDA again, but this time with a smaller training set. (Use the commented line above to obtain a smaller test set.) This will cause an error because there are not enough observations within each class to estimate the large number of parameters in the $\Sigma_k$ matrices. This is less of a problem with LDA, since all observations, no matter the class, are being use to estimate the shared $\Sigma$ matrix.
```{r}
iris_qda_trn_pred = predict(iris_qda, iris_trn)$class
iris_qda_tst_pred = predict(iris_qda, iris_tst)$class
```
The `predict()` function operates the same as the `predict()` function for LDA.
```{r}
calc_class_err(predicted = iris_qda_trn_pred, actual = iris_trn$Species)
calc_class_err(predicted = iris_qda_tst_pred, actual = iris_tst$Species)
```
```{r}
table(predicted = iris_qda_tst_pred, actual = iris_tst$Species)
```
Here we find that QDA is not performing as well as LDA. It is misclassifying versicolors. Since QDA is a more complex model than LDA (many more parameters) we would say that QDA is overfitting here.
Also note that, QDA creates quadratic decision boundaries, while LDA creates linear decision boundaries. We could also add quadratic terms to LDA to allow it to create quadratic decision boundaries.
## Naive Bayes
Naive Bayes comes in many forms. With only numeric predictors, it often assumes a multivariate normal conditioned on the classes, but a very specific multivariate normal.
$$
{\mathbf X} \mid Y = k \sim N(\mu_k, \Sigma_k)
$$
Naive Bayes assumes that the predictors $X_1, X_2, \ldots, X_p$ are independent. This is the "naive" part of naive Bayes. The Bayes part is nothing new. Since $X_1, X_2, \ldots, X_p$ are assumed independent, each $\Sigma_k$ is diagonal, that is, we assume no correlation between predictors. Independence implies zero correlation.
This will allow us to write the (joint) likelihood as a product of univariate distributions. In this case, the product of univariate normal distributions instead of a (joint) multivariate distribution.
$$
f_k(x) = \prod_{j = 1}^{j = p} f_{kj}(x_j)
$$
Here, $f_{kj}(x_j)$ is the density for the $j$-th predictor conditioned on the $k$-th class. Notice that there is a $\sigma_{kj}$ for each predictor for each class.
$$
f_{kj}(x_j) = \frac{1}{\sigma_{kj}\sqrt{2\pi}}\exp\left[-\frac{1}{2}\left(\frac{x_j - \mu_{kj}}{\sigma_{kj}}\right)^2\right]
$$
When $p = 1$, this version of naive Bayes is equivalent to QDA.
```{r}
library(e1071)
iris_nb = naiveBayes(Species ~ ., data = iris_trn)
iris_nb
```
Many packages implement naive Bayes. Here we choose to use `naiveBayes()` from the package `e1071`. (The name of this package has an interesting history. Based on the name you wouldn't know it, but the package contains many functions related to machine learning.)
The `Conditional probabilities:` portion of the output gives the mean and standard deviation of the normal distribution for each predictor in each class. Notice how these mean estimates match those for LDA and QDA above.
Note that `naiveBayes()` will work without a factor response, but functions much better with one. (Especially when making predictions.) If you are using a `0` and `1` response, you might consider coercing to a factor first.
```{r}
head(predict(iris_nb, iris_trn))
head(predict(iris_nb, iris_trn, type = "class"))
head(predict(iris_nb, iris_trn, type = "raw"))
```
Oh look, `predict()` has another new mode of operation. If only there were a way to unify the `predict()` function across all of these methods...
```{r}
iris_nb_trn_pred = predict(iris_nb, iris_trn)
iris_nb_tst_pred = predict(iris_nb, iris_tst)
```
```{r}
calc_class_err(predicted = iris_nb_trn_pred, actual = iris_trn$Species)
calc_class_err(predicted = iris_nb_tst_pred, actual = iris_tst$Species)
```
```{r}
table(predicted = iris_nb_tst_pred, actual = iris_tst$Species)
```
Like LDA, naive Bayes is having trouble with virginica.
```{r, echo = FALSE}
classifiers = c("LDA", "LDA, Flat Prior", "QDA", "Naive Bayes")
train_err = c(
calc_class_err(predicted = iris_lda_trn_pred, actual = iris_trn$Species),
calc_class_err(predicted = iris_lda_flat_trn_pred, actual = iris_trn$Species),
calc_class_err(predicted = iris_qda_trn_pred, actual = iris_trn$Species),
calc_class_err(predicted = iris_nb_trn_pred, actual = iris_trn$Species)
)
test_err = c(
calc_class_err(predicted = iris_lda_tst_pred, actual = iris_tst$Species),
calc_class_err(predicted = iris_lda_flat_tst_pred, actual = iris_tst$Species),
calc_class_err(predicted = iris_qda_tst_pred, actual = iris_tst$Species),
calc_class_err(predicted = iris_nb_tst_pred, actual = iris_tst$Species)
)
results = data.frame(
classifiers,
train_err,
test_err
)
colnames(results) = c("Method", "Train Error", "Test Error")
```
```{r, echo = FALSE}
knitr::kable(results)
```
Summarizing the results, we see that Naive Bayes is the worst of LDA, QDA, and NB for this data. So why should we care about naive Bayes?
The strength of naive Bayes comes from its ability to handle a large number of predictors, $p$, even with a limited sample size $n$. Even with the naive independence assumption, naive Bayes works rather well in practice. Also because of this assumption, we can often train naive Bayes where LDA and QDA may be impossible to train because of the large number of parameters relative to the number of observations.
Here naive Bayes doesn't get a chance to show its strength since LDA and QDA already perform well, and the number of predictors is low. The choice between LDA and QDA is mostly down to a consideration about the amount of complexity needed.
## Discrete Inputs
So far, we have assumed that all predictors are numeric. What happens with categorical predictors?
```{r}
iris_trn_mod = iris_trn
iris_trn_mod$Sepal.Width = ifelse(iris_trn$Sepal.Width > 3,
ifelse(iris_trn$Sepal.Width > 4,
"Large", "Medium"),
"Small")
unique(iris_trn_mod$Sepal.Width)
```
Here we make a new dataset where `Sepal.Width` is categorical, with levels `Small`, `Medium`, and `Large`. We then try to train classifiers using only the sepal variables.
```{r}
naiveBayes(Species ~ Sepal.Length + Sepal.Width, data = iris_trn_mod)
```
Naive Bayes makes a somewhat obvious and intelligent choice to model the categorical variable as a multinomial. It then estimates the probability parameters of a multinomial distribution.
```{r}
lda(Species ~ Sepal.Length + Sepal.Width, data = iris_trn_mod)
```
LDA however creates dummy variables, here with `Large` is the reference level, then continues to model them as normally distributed. Not great, but better then not using a categorical variable.
## `rmarkdown`
The `rmarkdown` file for this chapter can be found [**here**](11-generative.Rmd). The file was created using `R` version `r paste0(version$major, "." ,version$minor)`. The following packages (and their dependencies) were loaded when knitting this file:
```{r, echo = FALSE}
names(sessionInfo()$otherPkgs)
```