-
Notifications
You must be signed in to change notification settings - Fork 0
/
HW2 - Vaddy, Venkat Srinidhi.Rmd
887 lines (634 loc) · 36.1 KB
/
HW2 - Vaddy, Venkat Srinidhi.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
---
title: "HW - 2"
author: "Srinidhi"
date: "2/2/2022"
output: html_document
---
## Homework #2
### By: Venkat Srinidhi Vaddy, 2173456
```{r}
#load the libraries
library(GGally)
library(MASS)
library(ggplot2)
library(dplyr)
library(class)
library(dummies)
library(tidyverse)
```
#### 1. Data exploration + Logistic Regression
```{r}
#load the data
df = read.csv("wdbc.data")
#displaying the top 5 rows of dataframe
head(df)
```
##### (a) Describe the data: sample size n, number of predictors p, and number of observations in each class.
```{r}
# number of rows in the dataframe
nrow(df)
# number of predictors except target
length(df) - 1
# number of observations in each class
table(df$M)
```
The number of rows or entries in dataframe is 568 with 31 predictors or variables. Including the target/response variable, the total length of the dataframe is 32.
There are 357 entries in Benign class and 211 entries in Malignant class.
##### (b) Divide the data into a training set of 400 observations, and a test set; from now on, unless specified, work only on the training set.
```{r}
#setting aside 400 samples
#setting seed to fix the dataset
set.seed(0)
#sampling 400 ids
train_id = sample(nrow(df), 400)
#selecting only X17.99 and X10.38 columns in the training and testing datasets for the selected rows
train_df = df[train_id, 2:4]
test_df = df[-train_id, 2:4]
head(train_df)
```
##### (c) Make a scatterplot displaying Y (color or shape encoded) and the predictors X1, X2 (on the x- and y-axis). Based on this scatterplot, do you think it will be possible to accurately predict the outcome from the predictors? Motivate your answer.
```{r}
#plotting scatter plot between the target variable and the predictors
# displaying target variable with color encoding
ggplot(train_df, aes(x = X17.99, y = X10.38, color = M)) +
geom_point() +
labs(x = 'Average cell nuclei radius', y = 'Average cell nuclei texture', title = 'Average radius vs Average texture', color = 'Outcome')
```
It is possible to accurately predict the classes from the predictors as there is clear color coded distinction between the two classes - B and M. We can observe a decision boundary between the observations of the two classes. The benign class has observations with lower average cell nuclei radius and lower average cell texture. The malignant class has observations with higher average cell nuclei radius and higher average cell texture.
##### (d) Fit a logistic regression model to predict Y and make a table, like Table 4.3 in the textbook, displaying the coefficient estimates, standard errors, and p-values (use command summary). Give an interpretation of the values of the coefficient estimates.
```{r}
train_df$M <- as.factor(train_df$M)
#fitting the logistic regression model to the data with X17.99 and X10.38 as the two predictors
glm.model = glm(formula = M~., family = binomial(link = "logit"), data = train_df)
#will print the categorical value assigned to each class
contrasts(train_df$M)
#to print the coefficients of the model
coef(glm.model)
#summary of the model - coefficients, t value, p value
summary(glm.model)
```
The predictor X17.99 has a coefficient of 0.9678 and the p value for this coefficient is < 2e-16 which is very small and less than 0.05. It is statiscally significant and can reject the null hypothesis that there is no correlation between the predictor X17.99 and the outcome.
A unit increase in X17.99 leads to increase in the log odds of the response by 0.9678 that is, the log odds of the tumor being Malignant increases by 0.9678 when other predictor is kept constant.
The predictor X10.38 has a coefficient of 0.2486 and the p value for this coefficient is < 2e-16 which is very small and less than 0.05. It is statistically significant and we can reject the null hypothesis that there is no correlation between the predictor X10.38 and the outcome.
A unit increase in the predictor X10.38 leads to the increase in the log odds of the tumor being malignant by 0.2486 while keeping X17.99 as constant.
#### (e) Use the coefficient estimates to manually calculate the predicted probability P(Y = M | (X1,X2) = (10,12)) writing explicitly every step. Compare your result with the prediction computed with predict.
```{r}
#Logistic regression assumes a parametric model
# P(Y = 1 | X) = exp(β0 +β1X1 +...+βpXp) / 1+exp(β0 +β1X1 +...+βpXp)
X1 = 10
X2 = 12
predicted_probability = exp(-19.1519513 + (0.9678167*X1) + (0.2486162*X2)) / (1 + exp(-19.1519513 + (0.9678167*X1) + (0.2486162*X2)))
predicted_probability
```
The manually calculated predicted probability for X1 = 10 and X2 = 12 is 0.001515656
```{r}
# Test set performances
newdata <- data.frame(X17.99 = c(10), X10.38 = c(12))
glm.prob.test = predict(glm.model,type = "response", newdata = newdata)
glm.prob.test
```
The predicted probability using the logistic regression model glm for X1 = 10 and X2 = 12 is 0.001515656.
Both the predicted probabilities using manual calculation and the logistic regression model are the same.
#### (f) Use the glm previously fitted and the Bayes rule to compute the pre- dicted outcome Yˆ from the associated probability estimates (computed with predict) both on the training and the test set. Then compute the confusion table and prediction accuracy (rate of correctly classified observations) both on the training and test set. Comment on the results.
```{r}
glm.prob.train = predict(glm.model, type = "response")
glm.label.train = rep("B", nrow(train_df))
glm.label.train[glm.prob.train > .5] = "M"
# Confusion matrix
print("Training data - confusion matrix")
tt.glm.train = table(True = train_df$M, Predicted = glm.label.train)
tt.glm.train
# Model Accuracy
print("Training data - Model accuracy")
mean(glm.label.train == train_df$M)
# Test set performances
glm.prob.test = predict(glm.model,type = "response", newdata = test_df)
glm.label.test = rep("B", nrow(test_df))
glm.label.test[glm.prob.test > .5] = "M"
print("Testing data - confusion matrix")
tt.glm.test = table(True = test_df$M, Predicted =glm.label.test )
tt.glm.test
print("Training data - Model accuracy")
mean(glm.label.test == test_df$M)
```
1. The model has an accuracy of 89.25% on the training set. The model has performed well on the training data, on which it was trained.
2. For the training data confusion matrix, we see that 229 benign and 128 tumors are predicted accurately as True Psoitive and True Negatives.
3. The model has an accuracy of 89.286% on the testing set. The model has performed a little better than on training set.
4. From the confusion matrix for testing data, we see that 103 benign tumors and 47 malignant tumors are predicted accurately as True Positive and True Negatives.
We don't want malignant tumors to be predicted as benign tumors as that will put patients as risk and will change their treatment. The confusion matrix shows that 28 and 8 malignant tumors are predicted as benign for the training and testing datasets.
The linear model has performed generally well on both the datasets.
#### (g) Plot an image of the decision boundary (like the one in Figure 2.13 in the textbook, but without the purple dashed line) as follows:
#### Generate a dense set (e.g. 10000 observations) of possible values for the predictors (X1,X2) within reasonable ranges; (the command expand.grid might come in handy)
#### Use the glm model previously fitted to predict the outcome probabil- ities for every observation you have generated and use Bayes rule to compute the predicted outcomes;
#### Plot predicted outcomes and associated predictors in a scatter plot together with the training set.
#### Generate the same plot for probability cutoff values of 0.25 and 0.75. Comment on the results.
```{r}
#storing training dataset with predicted probabilities in new dataframe
ML_fits = mutate(train_df , pred = predict(glm.model, data=train_df, type = "response"))
nbp <- 1000;
#creating sequence of both the columns within minimum and maximum ranges
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#using expand.grid function to create 10000 rows of the two columns
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predicting probabilities using logistic regression on the dense set
v <- predict(glm.model,Grid, type = "response")
#splitting into target classes B and M using 0.5 as threshold probability
glm.label.dense = rep("B", nrow(Grid))
glm.label.dense[v > .5] = "M"
lm.label.dense = rep("B", nrow(ML_fits))
lm.label.dense[ML_fits$pred > .5] = "M"
#plotting the predicted values of denseset with training dataset
ggplot(data = ML_fits, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.6 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = glm.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
#### 0.25
```{r}
#storing training dataset with predicted probabilities in new dataframe
ML_fits = mutate(train_df , pred = predict(glm.model, data=train_df))
nbp <- 1000;
#creating sequence of both the columns within minimum and maximum ranges
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#using expand.grid function to create 10000 rows of the two columns
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
v <- predict(glm.model,Grid)
#splitting into target classes B and M using 0.5 as threshold probability
glm.label.dense = rep("B", nrow(Grid))
glm.label.dense[v > .25] = "M"
lm.label.dense = rep("B", nrow(ML_fits))
lm.label.dense[ML_fits$pred > .25] = "M"
#plotting the predicted values of denseset with training dataset
ggplot(data = ML_fits, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = glm.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
###### 0.75
```{r}
#storing training dataset with predicted probabilities in new dataframe
ML_fits = mutate(train_df , pred = predict(glm.model, data=train_df))
nbp <- 1000;
#creating sequence of both the columns within minimum and maximum ranges
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#using expand.grid function to create 10000 rows of the two columns
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
v <- predict(glm.model,Grid)
#splitting into target classes B and M using 0.5 as threshold probability
glm.label.dense = rep("B", nrow(Grid))
glm.label.dense[v > .75] = "M"
lm.label.dense = rep("B", nrow(ML_fits))
lm.label.dense[ML_fits$pred > .75] = "M"
#plotting the predicted values of denseset with training dataset
ggplot(data = ML_fits, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = glm.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
The overlay of the training dataset on the dense set creates a linear decision boundary to predict for any observation to be either benign or malignant.
We keep increasing the threshold probability from 0.25 to 0.5 to then 0.75 which shifts the decision boundary towards the right by small margins. The model has the highest accuracy when threshold is taken as 0.5. For all the observations, if the probability is greater than 0.5, it is predicted as malignant tumor.
#### (h) Plot the ROC curve, computed on the test set, for a dense grid of possible cutoffs (e.g. 20 intervals).
# ROC curve (test set)
# FPR = FP/N; TPR = TP/P
```{r}
n_segm = 20
TPR = replicate(n_segm, 0)
FPR = replicate(n_segm, 0)
p_th = seq(0,1,length.out = n_segm)
for (i in 1:n_segm)
{
glm.label.test = rep("B", nrow(test_df))
glm.label.test[glm.prob.test > p_th[i]] = "M"
tt.glm.test = table(True = test_df$M, Predicted = glm.label.test)
TPR[i] = mean(glm.label.test[test_df$M == 'M'] == test_df$M[test_df$M == 'M'])
FPR[i] = mean(glm.label.test[test_df$M == 'B'] != test_df$M[test_df$M == 'B'])
}
# plot(x = FPR, y = TPR, 'l')
ggplot() + geom_path(aes(x = FPR, y = TPR))
```
### 2. Linear discriminant analysis model
#### (a) Now fit a linear discriminant analysis model to the training set you created in Exercise 1. Make a table displaying the estimated ‘Prior probabilities of groups’ and ‘Group means’. Describe in words the meaning of these estimates and how they are related to the posterior probabilities.
```{r}
#creating lda model and fitting to the training data after centering and scaling
lda.model = lda(M~., data = train_df, centre = TRUE, scale = TRUE)
lda.model
```
Prior Probabilities:
61% of the total tumors in training set are benign and 39% of total tumors in the training set are Malignant.
Group means:
The mean of X17.99 variable for Benign tumors is 12.22459 and the mean of X17.99 variable for Malignant tumors is 17.65397
The mean of X10.38 variable for Benign tumors is 17.62635 and the mean of X10.38 variable for Malignant tumors is 21.62968
These values are deployed to compute posterior probability where QDA assumes that the variance of both classes B and M is the same.
#### (b) Use the fitted model and Bayes rule to compute the predicted outcome Yˆ from the predicted posterior probabilities, both on the training and test set. Then, compute the confusion table and prediction accuracy both on the training and test set. Comment on the results.
```{r}
#predicting using lda model on training and testing dataset
lda.pred.train = predict(lda.model, train_df)
lda.pred.test = predict(lda.model, test_df)
names(lda.pred.train)
#Training dataset - confusion matrix
print("Training data - confusion matrix")
tt.lda.train = table(True = train_df$M, Predicted = lda.pred.train$class)
tt.lda.train
print("Training data - accuracy")
mean(lda.pred.train$class == train_df$M)
#testing dataset
print("Testing data - confusion matrix")
tt.lda.test = table(True = test_df$M, Predicted = lda.pred.test$class)
tt.lda.test
print("Testing data - accuracy")
mean(lda.pred.test$class == test_df$M)
#using posterior probability for determining class
lda.pred.test_cutoff = rep("B", nrow(test_df))
lda.pred.test_cutoff[lda.pred.test$posterior[,2] > .5] = "M"
#printing confusion matrix and accuracy again after using posterior probability
tt.lda.test_cutoff = table(True = test_df$M, Predicted = lda.pred.test$class)
tt.lda.test_cutoff
print("Testing data - confusion matrix")
mean(lda.pred.test_cutoff == test_df$M)
```
1. The model has an accuracy of 87.5% on the training set. The model has performed well on the training data, on which it was trained.
2. For the training data confusion matrix, we see that 107 benign and 41 tumors are predicted accurately as True Positive and True Negatives.
3. The model has an accuracy of 88.09% on the testing set. The model has performed a little better than on training set.
4. From the confusion matrix for testing data, we see that 107 benign tumors and 41 malignant tumors are predicted accurately as True Positive and True Negatives.
We don't want malignant tumors to be predicted as benign tumors as that will put patients as risk and will change their treatment. The confusion matrix shows that 40 and 14 malignant tumors are predicted as benign for the training and testing datasets.
The LDA model has performed generally well on both the datasets
#### (c) Plot an image of the decision boundary (follow the instructions in 1(g)). Generate the same plot for cutoff values of 0.25 and 0.75. Comment on the results.
#### 0.5
```{r}
nbp <- 1000;
#creating sequence of predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#creating grid of 10000 rows using expand.grid
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predict using lda model
v <- predict(lda.model,Grid, type = "response")
#categorizing using threshold probability of 0.5
lda.label.dense = rep("B", nrow(Grid))
lda.label.dense[v$posterior[,2] > 0.5] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .5] = "M"
#plotting overlaying of training and dense data with color coding of the classes
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = lda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
# 0.25
```{r}
nbp <- 1000;
#creating sequence of predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#creating grid of 10000 rows using expand.grid
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predict using lda model
v <- predict(lda.model,Grid, type = "response")
#categorizing using threshold probability of 0.25
lda.label.dense = rep("B", nrow(Grid))
lda.label.dense[v$posterior[,2] > 0.25] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .25] = "M"
#plotting overlaying of training and dense data with color coding of the classes
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = lda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
# 0.75
```{r}
nbp <- 1000;
#creating sequence of predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#creating grid of 10000 rows using expand.grid
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predict using lda model
v <- predict(lda.model,Grid, type = "response")
#categorizing using threshold probability of 0.75
lda.label.dense = rep("B", nrow(Grid))
lda.label.dense[v$posterior[,2] > 0.75] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .75] = "M"
#plotting overlaying of training and dense data with color coding of the classes
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = lda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
The overlay of the training dataset on the dense set creates a linear decision boundary using LDA model to predict for any observation to be either benign or malignant.
We keep increasing the threshold probability from 0.25 to 0.5 to then 0.75 which shifts the decision boundary towards the right by small margins. The model has the highest accuracy when threshold is taken as 0.5. For all the observations, if the probability is greater than 0.5, it is predicted as malignant tumor.
#### (d) Plot the ROC curve, computed on the test set, for a dense grid of possible cutoffs (e.g. 20 intervals).
# ROC curve (test set)
# FPR = FP/N; TPR = TP/P
```{r}
n_segm = 20
TPR = replicate(n_segm, 0)
FPR = replicate(n_segm, 0)
p_th = seq(0,1,length.out = n_segm)
for (i in 1:n_segm)
{
lda.label.test = rep("B", nrow(test_df))
lda.label.test[lda.pred.test$posterior[,2] > p_th[i]] = "M"
tt.lda.test = table(True = test_df$M, Predicted = lda.label.test)
TPR[i] = mean(lda.label.test[test_df$M == 'M'] == test_df$M[test_df$M == 'M'])
FPR[i] = mean(lda.label.test[test_df$M == 'B'] != test_df$M[test_df$M == 'B'])
}
# plot(x = FPR, y = TPR, 'l')
ggplot() + geom_path(aes(x = FPR, y = TPR))
```
#### (e) Compute an estimate of the Area under the ROC Curve (AUC).
```{r}
simple_auc <- function(TPR, FPR){
TPR = sort(TPR)
FPR = sort(FPR)
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
simple_auc(TPR, FPR)
```
### 3. Quadratic discriminant analysis model
#### (a) Now fit a quadratic discriminant analysis model to the training set you created in Exercise 1. Make a table displaying the estimated ‘Prior probabilities of groups’ and ‘Group means’. Describe in words the meaning of these estimates and how they are related to the posterior probabilities.
```{r}
qda.model = qda(M~., data = train_df, centre = TRUE)
qda.model
```
Prior Probabilities:
61% of the total tumors in training set are benign and 39% of total tumors in the training set are Malignant.
Group means:
The mean of X17.99 variable for Benign tumors is 12.22459 and the mean of X17.99 variable for Malignant tumors is 17.65397
The mean of X10.38 variable for Benign tumors is 17.62635 and the mean of X10.38 variable for Malignant tumors is 21.62968
These values are deployed to compute posterior probability where QDA assumes that the variance of both classes B and M is the same.
#### (b) Use the fitted model and Bayes rule to compute the predicted outcome Yˆ from the predicted posterior probabilities, both on the training and test set. Then, compute the confusion table and prediction accuracy both on the training and test set. Comment on the results.
```{r}
#predicting on training and testing dataset using qda model
qda.pred.train = predict(qda.model, train_df)
qda.pred.test = predict(qda.model, test_df)
names(qda.pred.train)
#confusion matrix
print("Training data - confusion matrix")
tt.qda.train = table(True = train_df$M, Predicted = qda.pred.train$class)
tt.qda.train
print("Training data - accuracy")
mean(qda.pred.train$class == train_df$M)
print("Testing data - confusion matrix")
tt.qda.test = table(True = test_df$M, Predicted = qda.pred.test$class)
tt.qda.test
print("Testing data - accuracy")
mean(qda.pred.test$class == test_df$M)
#using posterior probability for determining class
qda.pred.test_cutoff = rep("B", nrow(test_df))
qda.pred.test_cutoff[qda.pred.test$posterior[,2] > .5] = "M"
#showing accuracy again using posterior probability
mean(qda.pred.test_cutoff == test_df$M)
```
1. The model has an accuracy of 87.5% on the training set. The model has performed well on the training data, on which it was trained.
2. For the training data confusion matrix, we see that 232 benign and 118 tumors are predicted accurately as True Positive and True Negatives.
3. The model has an accuracy of 88.09% on the testing set. The model has performed a little better than on training set.
4. From the confusion matrix for testing data, we see that 105 benign tumors and 43 malignant tumors are predicted accurately as True Positive and True Negatives.
We don't want malignant tumors to be predicted as benign tumors as that will put patients as risk and will change their treatment. The confusion matrix shows that 38 and 12 malignant tumors are predicted as benign for the training and testing datasets.
The QDA model has performed generally well on both the datasets but QDA has performed better than LDA.
#### (c) Plot an image of the decision boundary (follow the instructions in 1(g)). Generate the same plot for cutoff values of 0.25 and 0.75. Comment on the results.
```{r}
nbp <- 1000;
#creating sequence of 1000 length of both predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#expanding grid to 10000
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predicting on dense data using qda model
v <- predict(qda.model,Grid, type = "response")
#catergorising using 0.5 as threshold probability
qda.label.dense = rep("B", nrow(Grid))
qda.label.dense[v$posterior[,2] > 0.5] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .5] = "M"
#plotting with overlaying of training data and dense data
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = qda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
# 0.25 - qda
```{r}
nbp <- 1000;
#creating sequence of 1000 length of both predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#expanding grid to 10000
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predicting on dense data using qda model
v <- predict(qda.model,Grid, type = "response")
#catergorising using 0.5 as threshold probability
qda.label.dense = rep("B", nrow(Grid))
qda.label.dense[v$posterior[,2] > 0.25] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .25] = "M"
#plotting with overlaying of training data and dense data
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = qda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
# 0.75 - qda
```{r}
nbp <- 1000;
#creating sequence of 1000 length of both predictors
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
#expanding grid to 10000
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
#predicting on dense data using qda model
v <- predict(qda.model,Grid, type = "response")
#catergorising using 0.5 as threshold probability
qda.label.dense = rep("B", nrow(Grid))
qda.label.dense[v$posterior[,2] > 0.75] = "M"
lm.label.dense = rep("B", nrow(train_df))
lm.label.dense[lda.pred.train$posterior[,2] > .75] = "M"
#plotting with overlaying of training data and dense data
ggplot(data = train_df, aes(x = X17.99, y = X10.38, color = lm.label.dense, shape = lm.label.dense), cex = 0.2 ) +
geom_point(data = Grid, aes(X17.99, X10.38, color = qda.label.dense) ,pch = ".") + geom_point(cex = 1.5) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
The overlay of the training dataset on the dense set creates a linear decision boundary using QDA model to predict for any observation to be either benign or malignant.
We keep increasing the threshold probability from 0.25 to 0.5 to then 0.75 which shifts the decision boundary towards the right by small margins. The model has the highest accuracy when threshold is taken as 0.5. For all the observations, if the probability is greater than 0.5, it is predicted as malignant tumor.
#### (d) Plot the ROC curve for qda, computed on the test set, for a dense grid of possible cutoffs (e.g. 20 intervals).
# ROC curve (test set)
# FPR = FP/N; TPR = TP/P
```{r}
n_segm = 20
TPR_qda = replicate(n_segm, 0)
FPR_qda = replicate(n_segm, 0)
p_th = seq(0,1,length.out = n_segm)
for (i in 1:n_segm)
{
qda.label.test = rep("B", nrow(test_df))
qda.label.test[qda.pred.test$posterior[,2] > p_th[i]] = "M"
tt.qda.test = table(True = test_df$M, Predicted = qda.label.test)
TPR_qda[i] = mean(qda.label.test[test_df$M == 'M'] == test_df$M[test_df$M == 'M'])
FPR_qda[i] = mean(qda.label.test[test_df$M == 'B'] != test_df$M[test_df$M == 'B'])
}
# plot(x = FPR, y = TPR, 'l')
ggplot() + geom_path(aes(x = FPR_qda, y = TPR_qda))
```
#### (e) Compute an estimate of the Area under the ROC Curve (AUC) for qda
```{r}
simple_auc(TPR_qda, FPR_qda)
```
#### 4. Now we decide to use a kNN classifier.
##### (a) For all choices of k = {1, 2, 3, 4, 20} (number of neighbors), compute the predicted outcome Yˆ both on the training and test set. Then, compute the confusion table and prediction accuracy both on the training and test set. Comment on the results.
```{r}
test_df$M = as.factor(test_df$M)
#selects only the numeric values
train.num = train_df[,!sapply(train_df, is.factor)]
#scaling the training dataset
train.num = scale(train.num)
test.num = test_df[,!sapply(test_df, is.factor)]
test.num = scale(test.num)
train.M.dummy = model.matrix(~M - 1, data = train_df)
test.M.dummy = model.matrix(~M - 1, data = test_df)
```
```{r}
## "Fit" KNN (the output is a vector of predicted outcomes)
#creating a vector of k values
k = c(1,2,3,4,20)
for (i in k){
cat("Taking k = ", i, "\n")
knn.train.pred <- knn(train = train.num,
test = train.num,
cl = train_df$M, k = i)
knn.test.pred <- knn(train = train.num,
test = test.num,
cl = train_df$M, k = i)
tt.knn.train = table(True = train_df$M, Predicted = knn.train.pred)
print(tt.knn.train)
cat("Training Accuracy = ", mean(knn.train.pred == train_df$M), "\n")
tt.knn.test = table(True = test_df$M, False = knn.test.pred)
print(tt.knn.test)
cat("Testing Accuracy = ", mean(knn.test.pred == test_df$M), "\n")
}
```
Observations:
For k = 1, the training set accuracy is 1 and testing accuracy is 0.88. This is probably because the model overfits the data at k = 1 due to low bias and high variance.
For k =2, the the training set accuracy is 0.9225 and testing accuracy is 0.88. this model performs better than k = 1 as it doesn't overfit the training dataset as much.
For k = 3, the model works well on both datasets with accuracy as 0.92 and 0.91 for training and testing datasets. Moreober, the variance is lower and bias is higher at k=3 than at k=1.
For k = 20, the model accuracy is very different for training and testing datasets due to higher bias and lower variance.
k = 3 seems like the ideal value for achieving the best performance of the knn model.
##### (b) Plot an image of the decision boundary (follow the instructions in 1(h)), for k = {1, 2, 3, 4, 20} (number of neighbors). Comment on the results.
```{r}
nbp <- 1000;
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
Grid <- as.data.frame(scale(Grid))
train.num <- as.data.frame((train.num))
knn.prob.dense <- knn(train = train.num,
test = Grid,
cl = train_df$M, k = 1)
ggplot(data = train.num, aes(x = X17.99, y = X10.38, color = knn.train.pred, shape = knn.train.pred, cex = 0.2 )) +
geom_point(data = Grid, aes(X17.99, X10.38, color = knn.prob.dense) ,pch = ".") + geom_point(cex = 0.4) + xlim(-2, 2)+
ylim(-2, 2) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
##### K = 2
```{r}
nbp <- 1000;
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
Grid <- as.data.frame(scale(Grid))
train.num <- as.data.frame((train.num))
knn.prob.dense <- knn(train = train.num,
test = Grid,
cl = train_df$M, k = 2)
ggplot(data = train.num, aes(x = X17.99, y = X10.38, color = knn.train.pred, shape = knn.train.pred, cex = 0.2 )) +
geom_point(data = Grid, aes(X17.99, X10.38, color = knn.prob.dense) ,pch = ".") + geom_point(cex = 0.4) + xlim(-2, 2)+
ylim(-2, 2) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
##### K = 3
```{r}
nbp <- 1000;
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
Grid <- as.data.frame(scale(Grid))
train.num <- as.data.frame((train.num))
knn.prob.dense <- knn(train = train.num,
test = Grid,
cl = train_df$M, k = 3)
ggplot(data = train.num, aes(x = X17.99, y = X10.38, color = knn.train.pred, shape = knn.train.pred, cex = 0.2 )) +
geom_point(data = Grid, aes(X17.99, X10.38, color = knn.prob.dense) ,pch = ".") + geom_point(cex = 0.4) + xlim(-2, 2)+
ylim(-2, 2) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
##### K = 4
```{r}
nbp <- 1000;
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
Grid <- as.data.frame(scale(Grid))
train.num <- as.data.frame((train.num))
knn.prob.dense <- knn(train = train.num,
test = Grid,
cl = train_df$M, k = 4)
ggplot(data = train.num, aes(x = X17.99, y = X10.38, color = knn.train.pred, shape = knn.train.pred, cex = 0.2 )) +
geom_point(data = Grid, aes(X17.99, X10.38, color = knn.prob.dense) ,pch = ".") + geom_point(cex = 0.4) + xlim(-2, 2)+
ylim(-2, 2) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
##### K = 20
```{r}
nbp <- 1000;
PredA <- seq(min(train_df$X17.99), max(train_df$X17.99), length = nbp)
PredB <- seq(min(train_df$X10.38), max(train_df$X10.38), length = nbp)
Grid <- expand.grid(X17.99 = PredA, X10.38 = PredB)
dense_id <- sample(nrow(Grid), 10000)
Grid <- Grid[dense_id,]
Grid <- as.data.frame(scale(Grid))
train.num <- as.data.frame((train.num))
knn.prob.dense <- knn(train = train.num,
test = Grid,
cl = train_df$M, k = 20)
ggplot(data = train.num, aes(x = X17.99, y = X10.38, color = knn.train.pred, shape = knn.train.pred, cex = 0.2 )) +
geom_point(data = Grid, aes(X17.99, X10.38, color = knn.prob.dense) ,pch = ".") + geom_point(cex = 0.4) + xlim(-2, 2)+
ylim(-2, 2) + ggtitle("Decision region") + theme(legend.text = element_text(size = 10))
```
1. The model overfits at k = 1. The model has very high variance and low bvias as it classifies each observation into its class. The blue decision area is for malignant tumors.
2. As k value increases, the decision boundary generalises more. when k = 3, the model is more accurately able to predict the class for the observation without overfitting on the training dataset.
3. When k=20, the model has high bias and low variance. It almost seems like a binary decision model with no outliers.
k = 3 seems like the ideal value of k for getting the best performance of the knn model.
##### (c) Compute and plot the prediction accuracy (both on the training and test set), for k = {1,2,...,19,20} (number of neighbors). Which value of k would you choose? Comment on the results.
``` {r}
pred <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(pred) <- c("k", "train", "test")
for (i in 1:20){
knn.train.pred <- knn(train = train.num,
test = train.num,
cl = train_df$M, k = i)
knn.test.pred <- knn(train = train.num,
test = test.num,
cl = train_df$M, k = i)
pred <- rbind(pred,
data.frame(k=i, train=mean(knn.train.pred ==train_df$M), test=mean(knn.test.pred == test_df$M)))
}
pred
```
```{r}
pred %>% ggplot(aes(x=k)) +
geom_line(aes(y=train), color = "blue" ) +
geom_line(aes(y=test), color = "green") + ggtitle("k value vs train test accuracies")
```
The training plot is depicted in blue and the testing plot is given in green.
The training data accuracy reduces initially due to overfitting at k = 1 and then starts increasing as k increases. The testing data accuracy starts increases at k =1 due to overfitting to training data but then fluctuates. k = 3 is the ideal k value as the training and testing accuracies are similar and based on bias- variance trade off, it has the best performance.