-
Notifications
You must be signed in to change notification settings - Fork 0
/
FinalProject_Leano_Marianne.Rmd
366 lines (306 loc) · 18.5 KB
/
FinalProject_Leano_Marianne.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
---
title: "FinalProject_Leano_Marianne"
author: "Marianne Leano"
date: "8/19/2020"
output: word_document
---
# Q1 Data Preparation
### 1a. Load the dataset insurance.csv into memory.
``` {r}
#set working directory
setwd("/Users/marianneleano/Library/Mobile Documents/com~apple~CloudDocs/Desktop/dsci 512/")
insurance <- read.csv(file = '512_finalproject/insurance.csv')
colnames(insurance)
```
### 1b. In the data frame, transform the variables **charges** by setting **insurance$charges = log(insurance$changes)**. Do not transform it outside of the data frame.
``` {r}
insurance$charges = log(insurance$charges)
```
### 1c. Using the data set from 1b, use the model.matrix() function to create another data set that uses dummy variables in place of categorical variables. Verify that the first column only has ones (1) as values, and then discard the column only after verifying it has only ones as values.
``` {r}
dummy <- model.matrix(~. -charges, data=insurance)
```
``` {r}
#remove first column with 1 as values
dummy = dummy[, -1]
colnames(dummy)
```
### 1d. Use the **sample()** function with set.seed equal to 1 to generate row indexes for your training and tests sets, with 2/3 of the row indexes for your training set and 1/3 for your test set. **Do not use any method other than the sample() function for splitting your data**.
``` {r}
set.seed(1)
train_size <- floor(2*nrow(insurance)/3)
train_idx <- sample(seq_len(nrow(insurance)), size=train_size)
```
### 1e. Create a training and test data set from the data set created in 1b using the training and test row indexes created in 1d.
``` {r}
train_1e <- insurance[train_idx, ]
test_1e <- insurance[-train_idx, ]
```
### 1f. Create a training and test data set from data set created in 1c using the training and test row indexes created in 1d.
``` {r}
train_1f <- dummy[train_idx, ]
test_1f <- dummy[-train_idx, ]
```
# Q2 Build a multiple linear regression model
### 2a. Perform multiple linear regression with **charges** as the response and the predictors are **age, sex, bmi, children, smoker, and region**. Print out the results using the summary() function. **Use the training dataset created in step 1e to train your model**.
``` {r}
lm.fit = lm(charges ~ age + sex + bmi + children + smoker + region, data=insurance[train_idx, ])
summary(lm.fit)
```
### 2b. Is there a relationship between the predictors and the response?
To determine the relationship between **charges** and **age, sex, bmi, children, smoker and region**, we look at the p-value in our summary() output. If the p-value is greater than 5% their will be no indication of a relationship. If the p-value is less than 5%, than their is a relationship. As the p-value is *2.2e-16%*, which is less than 5%, which means that there is a relationship between the predictors and the response.
### 2c. Does sex have a statistically significant relationship to the response?
Sex does play a significant relationship to the response charges. We determine this by looking at the Pr(>|t|) values. Sex, specifically for male is *0.027847%* is less than 5% which means that sex has a significant relationship with response charges.
### 2d. Perform best subset selection using the stepAIC() function from the MASS library, choose best model based on AIC. For the "direction" parameter in the stepAIC() method, set direction = "backward".
The best model will include the following predictors: **age, sex, bmi, children, smoker, region**.
``` {r}
library(MASS)
full = lm(charges ~ age+sex+bmi+children+smoker+region, data = insurance[train_idx, ])
lm.bwd = stepAIC(full, direction = "backward")
lm.bwd
```
### 2e. Compute the test error of the best model in 2d based on AIC using LOOCV using trainControl() and train() from the caret library. Report the MSE by squaring the reported RMSE.
The MSE for the multiple linear regression model in 2d is **0.1833683**.
``` {r, warning=FALSE}
library(caret)
train_control = trainControl(method = "LOOCV")
model_2e <- train(charges ~ age+sex+bmi+children+smoker+region, data = insurance[train_idx, ], trControl = train_control, method = "lm")
print(model_2e)
```
``` {r}
MSE = (0.4282153)^2
MSE
```
### 2f. Calculate the test error of the best model in 2d based on AIC using 10-fold Cross-Validation. Use train and trainControl from the caret library. Refer to model selected in 2d based on AIC. Report the MSE.
The MSE using the 10-fold CV model is **0.1792372**.
``` {r}
library(caret)
train_control_2f <- trainControl(method = "CV", number = 10)
model_2f <- train(charges ~ age+sex+bmi+children+smoker+region, data = insurance[train_idx, ], trControl = train_control_2f, method = "lm")
print(model_2f)
```
``` {r}
MSE_2f = (0.4233641)^2
MSE_2f
```
### 2g. Calculate and report the test MSE using the best model from 2d and test data set created in step 1e.
The MSE using the best model from 2d and the test data set in 1e is **0.2347613**.
``` {r}
library(caret)
train_control = trainControl(method = "LOOCV")
model_2g <- train(charges ~ age+sex+bmi+children+smoker+region, data=insurance[-train_idx, ], trControl=train_control, method="lm")
print(model_2g)
```
``` {r}
MSE_2g = (0.4845217)^2
MSE_2g
```
### 2h. Compare the test MSE calculated in step 2.f using 10-fold cross-validation with the test MSE calculated in step 2.g. How similar are they?
MSE_2f in 2f using 10-fold CV is **0.1792372**.
MSE_2g in 2g using LOOCV is **0.2347613**.
The MSE for both models are different from each other. We use the test data set to compare the predictions to the actual response variable in the test data so we are able to evaluate the model's accuracy. The 10-fold CV performed better than the LOOCV because the MSE is closer to zero. This means the 10-fold CV has a good variance-bias tradeoff with low variance.
# Q3 Build a regression model tree.
### 3a. Build a regression tree model using function tree(), where **charges** is the response and the predictors are **age, sex, bmi, children, smoker, and region**.
Our final model will use the following predictors: **smoker, age, children and bmi**.
``` {r}
insurance$sex <- as.factor(insurance$sex)
insurance$smoker <- as.factor(insurance$smoker)
insurance$region <- as.factor(insurance$region)
```
``` {r, warning=FALSE}
library(tree)
tree.insurance = tree(charges ~ age+sex+bmi+children+smoker+region, data=insurance, subset=train_idx)
summary(tree.insurance)
```
### 3b. Find the optimal tree by using cross-validation and display the results in a graphic. Report the best size.
The size for the best optimal model is **6**.
``` {r, warning=FALSE}
cv.insurance = cv.tree(tree.insurance)
plot(cv.insurance$size, cv.insurance$dev, type='b')
```
### 3c. Justify the number you picked for the optimal tree with regard to the principle of variance-bias trade-off.
We would like to choose the size that has the least amount of error rate. The size for the best optimal model is 6. We would like to go with the size with the lowest error rate because we want to first test how well our model is able to get trained by some data and then predict the data it hasn't seen. If our test error is high after pruning, that just means we can continue with a subtree to find a simpler model and lower test error. We want to find a balance between variance and bias, with respect to how simple our tree can be. It is important to do cross validation so we can select the optimal value that controls the trade-off between a subtree's complexity and its fit to the training data.
### 3d. Prune the tree using the optimal size found in 3b.
``` {r}
prune.insurance = prune.tree(tree.insurance, best=6)
```
### 3e. Plot the best tree model and give labels.
``` {r}
plot(prune.insurance)
text(prune.insurance, pretty=0)
```
### 3f. Calculate the test MSE for the best model.
The MSE using the regression tree model is **0.2019634**.
Based on the two ways I computed test error after pruning, the test error before using the test data set is **0.2019634**. Before pruning the test error is 0.1890688.
``` {r, warning=FALSE}
#test error before pruning
yhat0 = predict(tree.insurance, insurance[-train_idx,])
insurance.test0 = insurance[-train_idx, "charges"]
mean((yhat0 - insurance.test0)^2)
```
``` {r, warning=FALSE}
#test error after pruning
yhat = predict(prune.insurance, newdata=insurance[-train_idx,])
insurance.test = insurance[-train_idx, "charges"]
mean((yhat - insurance.test)^2)
```
# Q4 Build a random forest model.
### 4a. Build a random forest model using function randomForest(), where **charges** is the response and the predictors are **age, sex, bmi, children, smoker, and region**.
``` {r, include=TRUE, echo=FALSE, warning=FALSE}
require(randomForest)
require(MASS)
set.seed(101)
```
Based on the randomForest() function, we can see that the MSR and % variance are based on OOB estimates, a device to get honest error estimates.
``` {r}
rf.insurance = randomForest(charges ~ age+sex+bmi+children+smoker+region, data=insurance, subset=train_idx, importance=TRUE, na.action = na.exclude)
rf.insurance
```
### 4b. Compute the test error using the test data set.
The MSE using the random forest model is **0.1779684**.
``` {r}
yhat.rf = predict(rf.insurance, newdata = insurance[-train_idx,])
insurance.test = insurance[-train_idx, "charges"]
mean((yhat.rf - insurance.test)^2)
```
### 4c. Extract variable importance measure using the importance() function.
``` {r}
importance(rf.insurance)
```
### 4d. Plot the variable importance using the function, varImpPlot(). Which are the top 3 important predictors in this model?
The top 2 important predictors in this model is **smoker** and **age**, and the third important predictor is between **children** and **bmi** depending on the graphs. This plot ranks the usefulness of the variables. This means that smoker and age are the variables that will give the prediction and contribute most to the model. **Children** may be an important variable because if a policy holder has more children, the charges would be higher. On the other hand, **bmi** may be an important variable because a policy holder's health background may affect the charges.
``` {r}
varImpPlot(rf.insurance)
```
# Q5 Build a support vector machine model.
### 5a. The response is charges and the predictors are age, sex, bmi, children, smoker, and region. Please use the svm() function with radial kernel and gamma=5 and cost = 50.
``` {r}
library(e1071)
svm.fit = svm(charges ~ age+sex+bmi+children+smoker+region, data=insurance[train_idx, ], kernel="radial", gamma=5, cost=50)
summary(svm.fit)
```
### 5b. Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1,3 and 5 and potential kernel: "linear","radial" and "sigmoid". And use the training set created in step 1e.
``` {r}
tune.out = tune(svm, charges ~ age+sex+bmi+children+smoker+region, data=insurance[train_idx, ], ranges=list(cost=c(1,10,50,100), gamma=c(1,3,5), kernel=c("linear", "radial", "sigmoid")))
```
### 5c. Print out the model results. What are the best model parameters?
The best model parameters are **kernel=radial, cost=1, and gamma=1**.
``` {r}
summary(tune.out)
```
### 5d. Forecast charges using the test dataset and the best model found in 5c.
``` {r}
pred = predict(tune.out$best.model, newdata=insurance[-train_idx, ])
```
### 5e. Compute the MSE (Mean Squared Error) on the test data.
The MSE for the SVM model is **0.2259622**.
``` {r}
trueobservation = insurance[-train_idx, "charges"]
```
``` {r}
MSE_5e = mean((trueobservation - pred)^2)
MSE_5e
```
# Q6 Perform the k-means cluster analysis.
### 6a. Remove the sex, smoker, and region, since they are not numerical values.
``` {r}
insurance = insurance[, c(-2,-5,-6)]
colnames(insurance)
```
### 6b. Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.
The optimal number of clusters is 2. This means that when k=2, this is considered a good cluster for which the within-cluster variation is small as possible. In addition, 3 would also be an optimal number of clusters. But it looks like 2 performs a little better than 3.
``` {r}
library("cluster")
library("factoextra")
set.seed(101)
```
``` {r}
fviz_nbclust(insurance, kmeans, method = "gap_stat")
```
### 6c. Perform k-means clustering using the 3 clusters.
``` {r}
km.res <- kmeans(insurance, 3, nstart = 25)
km.res
```
### 6d. Visualize the clusters in different colors.
Using fviz_cluster() can be good to use to visualize our results and assess the choice of the number of clusters as well as comparing two different cluster analyses. fviz_cluster() takes the k-means results and the original data as arguments. In the resulting plot below the observations are represented by points using principal components to find a low-dimensional representation of the observations that explains a good fraction of the variance.
For the insurance data, we clustered into 3 sub groups that visualizes all 4 columns of data observations. The sum of squares summary looks at the the deviations of each point around its cluster mean. For our percentage from between sum of squares divided by total sum of squares it is 74.7 %. This is a percent variance that explains the cluster means. It is pretty high which means that the k we chose, which is meant to be the most optimal is doing a good job.
``` {r}
fviz_cluster(km.res, data = insurance)
```
# Q7 Build a neural networks model.
### 7a. Remove the sex, smoker, and region, since they are not numerical values.
Occurred in question 6b.
### 7b. Standardize the inputs using the scale() function.
``` {r}
scaled.insurance = scale(insurance[,-5])
scaled.insurance
```
### 7c. Convert the standardized inputs to a data frame using the as.data.frame() function.
``` {r}
scaled.insurance = as.data.frame(scaled.insurance)
```
### 7d. Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.
``` {r}
set.seed(101)
index <- sample(1:nrow(scaled.insurance), 0.80*nrow(scaled.insurance))
train <- scaled.insurance[index, ]
test <- scaled.insurance[-index, ]
```
### 7e. The response is charges and the predictors are age, bmi, and children. Please use 1 hidden layer with 1 neuron.
``` {r}
library(neuralnet)
nn.model <- neuralnet(charges ~ age+bmi+children, data=train, hidden = 1)
```
### 7f. Plot the neural networks.
``` {r}
plot(nn.model)
```
### 7g. Forecast the charges in the test dataset.
``` {r}
predict.nn = compute(nn.model, test[, c("age", "bmi", "children")])
```
### 7h. Get the observed charges of the test dataset.
``` {r}
observ.test = test$charges
```
### 7i. Compute test error (MSE).
The MSE using the neural network model is **0.6754152**.
``` {r}
mean((observ.test - predict.nn$net.result)^2)
```
# Q8 Putting it all together.
### 8a. For predicting insurance charges, your supervisor asks you to choose the best model among the multiple regression, regression tree, random forest, support vector machine, and neural network models. Compare the test MSEs of the models generated in steps 2.g, 3.f, 4.b, 5.e, and 7.i. Display the names for these types of these models, using these labels: Multiple Linear Regression, Regression Tree, Random Forest, Support Vector Machine, and Neural Network and their corresponding test MSEs in a data.frame. Label the column in your data frame with the labels as Model.Type, and label the column with the test MSEs as Test.MSE and round the data in this column to 4 decimal places. Present the formatted data to your supervisor and recommend which model is best and why.
Model.Type | Test.MSE
--------------------------- | ---------
Multiple Linear Regression | 0.2348
Regression Tree | 0.2020
Random Forest | 0.1780
Support Vector Machine | 0.2260
Neural Network | 0.6754
For the insurance data set, the best model to use is the **Random Forest Model** with a MSE of **0.1780**. Our goal is to find the model with the smallest MSE. The smallest MSE will represent that the predicted responses are very close to the true responses. We calculate based on MSE because it takes into consideration of the bias-variance trade off. The random forest model is the best because while achieving a MSE closer to zero, it simultaneously has low variance and low bias which are both non-negative.
### 8b. Another supervisor from the sales department has requested your help to create a predictive model that his sales representatives can use to explain to clients what the potential costs could be for different kinds of customers, and they need an easy and visual way of explaining it. What model would you recommend, and what are the benefits and disadvantages of your recommended model compared to other models?
The model I would recommend is using the **Random Forest Model**.
The *benefits* of using the random forest model are:
1. Random Forest can be used to solve both classification and regression problems.
2. Random Forest works well with both categorical and continuous variables.
3. Random Forest creates many trees on the subset of the data & combines the output of all the trees, which means it reduces the overfitting problem which reduces the variance and improves accuracy.
4. The model output using the varImpPlot() function is a great visual when talking to potential clients and how each variable may affect their charges.
The *disadvantages* of using the random forest model are:
1. Random Forest creates a lot of trees which means it will use a lot of computational power and resources. The complexity of the model is not great.
2. Random Forest require more time to train which means a longer running period.
### 8c. The supervisor from the sales department likes your regression tree model. But she says that the sales people say the numbers in it are way too low and suggests that maybe the numbers on the leaf nodes predicting charges are log transformations of the actual charges. You realize that in step 1.b of this project that you had indeed transformed charges using the log function. And now you realize that you need to reverse the transformation in your final output. The solution you have is to reverse the log transformation of the variables in the regression tree model you created and redisplay the result.
### Follow these steps:
### a. Copy your pruned tree model to a new variable.
``` {r}
copy_of_my_pruned_tree = prune.tree(tree.insurance, best=6)
```
### b. In your new variable, find the data.frame named "frame" and reverse the log transformation on the data.frame column yval using the exp() function.
``` {r}
copy_of_my_pruned_tree$frame$yval <- exp(copy_of_my_pruned_tree$frame$yval)
```
### c. After you reverse the log transform on the yval column, then replot the tree with labels.
``` {r}
plot(copy_of_my_pruned_tree)
text(copy_of_my_pruned_tree, pretty=0)
```