-
Notifications
You must be signed in to change notification settings - Fork 4
/
assignment1.R
827 lines (753 loc) · 39.2 KB
/
assignment1.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
#' ---
#' title: "Homework 1"
#' author: "Allan Kimaina"
#' date: "February 5, 2018"
#' output:
#' pdf_document: default
#' html_document: default
#' ---
#'
#'
knitr::opts_chunk$set(echo = F)
library(dplyr)
wine = read.csv("data/Wine.csv")
colnames(wine)[1] <- "COUNTRY"
library(car)
# load package
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(ggpubr)
library(ggpmisc)
library("gridExtra")
library(stargazer)
library(e1071)
library(jtools)
library(effects)
library(multcompView)
library(ggplot2)
library(ggrepel)
#'
#'
#' # Question 1
#'
#' The data in the file wine.csv (in the data sets folder on Canvas) give the average wine consumption rates (in liters per person) and number of ischemic heart attack deaths (per 1000 men aged 55 to 64 years) for 18 industrialized countries.
#' Do these data suggest that heart disease death rates are associated with average wine consumption? If so, how can that be described?
#'
#' Do any countries have substantially higher or lower death rates than others with similar wine consumption rates?
#'
#' Analyze the data and write a brief report that includes a summary of findings, a graphical display and a section describing the methods used to answer the questions of interest.
#'
#'
#'
#' The goal of the study is to determine the significance of the relationship between wine consumption and ischemic heart attack mortality
#'
#'
select_similar <- wine
plot(wine$WINE, wine$MORTALITY,
#main= "Absolute Losses vs. Relative Losses(in%)",
xlab= "Wine Consumption",
ylab= "Heart Attack Mortality",
col= "blue", pch = 19, cex = 1, lty = "solid", lwd = 2)
#'
#'
#'
#'
#'
wine.model = lm(MORTALITY ~ WINE, data = wine)
#cor(wine$WINE, wine$MORTALITY) # calculate correlation between WINE and MORTALITY
#'
#'
#' From the scatter plot, the data points suggest that there is a strong negative association between wine consumption and Ischemic heart attack deaths. Countries with low wine consumption have high heart attack mortality rates compared to countries with high wine consumption. Computing the correlation coefficient we get -0.7455682 indicating that heart attack mortality is inversely proportional to wine consumption. However, this could be misleading since the plot indicates that there is a possibility of nonlinear relationship (possibly exponential) between the predictor variable and the response variable, i.e heart attack mortality rate increases exponentially as wine consumption per liter decreases
#'
#' But first before establishing the best model that fits this association, lets try to understand the characteristics of predictor and response variables graphically.
#'
#' #### Data Examination & Exploration
#'
#'
par(mfrow=c(1, 2)) # divide graph area in 2 columns
plot(density(wine$WINE), main="Density Plot: Wine", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(wine$WINE), 2))) # density plot for 'wine'
polygon(density(wine$WINE), col="red")
plot(density(wine$MORTALITY), main="Density Plot: MORTALITY", ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(cars$MORTALITY), 2))) # density plot for 'MORTALITY'
polygon(density(wine$MORTALITY), col="red")
#'
#'
#' Looking at the density plot, the response variable is normal without any skewness, however, the predictor variable is skewed by 1.83. This plot suggest that we could be having an outlier in the predictor variable (Wine). This can drastically distort or bias our model. Therefore, this prompts for further outlier detection and possibly outlier treatment.
#'
#' Let's see if there is any data-point which lies outside the 1.5 * distance between the 25th percentile and 75th percentile values (IQR).
#'
#' ##### Outlier Detection
#'
#'
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(wine$WINE, main="WINE", sub=paste("Outlier rows: ", boxplot.stats(wine$WINE)$out)) # box plot for 'WINE'
boxplot(wine$MORTALITY, main="MORTALITY", sub=paste("Outlier rows: ", boxplot.stats(wine$MORTALITY)$out)) # box plot for 'MORTALITY'
#'
#'
#' The box plot suggest that there are two outliers, France and Italy. Even though France and Italy are separated from the rest of the data-points, we do not have sufficient information to conclude that these 2 data points are outliers. For all we know there could be more data-points in the population (other countries) that would have changed this notion if they were included. In fact, an exponential relationship can be clearly visible on the original scatter plot.
#'
#' ##### Exponential relationship
#' From the above scatter plot and density plot, the explanatory variable (WINE CONSUMPTION) is strongly positively skewed with many data points near zero and fewer values at the extreme right. This indicates a multiplicative effect on the response (Heart Attack Mortality). Therefore, it is possible that the log transformation would make the association between heart attack mortality and wine consumption closer to a linear relationship. Let's asses this possibility by plotting a scatter of log transformed data:
#'
#'
wine.model.transformed <- lm( log(wine$MORTALITY)~log(wine$WINE))
plot(log(wine$WINE), log(wine$MORTALITY),
#main= "Absolute Losses vs. Relative Losses(in%)",
xlab= "log(Wine Consumption)",
ylab= "log(Heart Attack Mortality)",
main= "Log Transformation Plot",
col= "blue", pch = 19, cex = 1, lty = "solid", lwd = 2)
abline(wine.model.transformed, col="red")
#summary(wine.model.transformed)
#'
#'
#' Transforming both predictor and response variable we get a perfect linear relationship implying a linear model can be fitted easily. Before fitting this model let us assess other possible models.
#'
#' Another possible explanation for the curvilinear relationship could be because the relationship between predictor and response variable is modeled as 2 degree polynomial (quadratic) in the predictor.
#'
wine.model.poly<-lm(wine$MORTALITY~ poly(wine$WINE, 2) )
ggplot(wine, aes(y=MORTALITY, x=WINE)) +
geom_point(alpha = .5) +
geom_point(color = "blue") +
stat_smooth(method = "lm", formula = y ~ poly(x,2), color="red")+
labs(x='Wine Consumption (Liters)', y='Heart Attack Mortality (per 1000)')+
labs(title = "Quadratic Plot")+
theme_classic(base_size = 18)+ guides(fill=FALSE)
#'
#'
#' There is little reason to believe that the relationship between wine consumption and heart attack mortality is quadratic. In fact this is depicted by r-squared value explained in the next section.
#'
#' #### Possible Models
#'
#' From the previous section, we have been able to identify 3 possible models
#'
#' 1. Simple Linear model
#' 2. Log Transformed Linear Model (exponential)
#' 3. Quadratic model (highly unlikely)
#'
#'
#'
simple_lm_plot <- ggplot(wine, aes(y=MORTALITY, x=WINE)) +
geom_point(alpha = .5) +
geom_point(color = "blue") +
stat_smooth(method = "lm", color="red")+
labs(x='Wine Consumption (Liters)', y='Heart Attack Mortality (per 1000)',
title = "Simple Linear Model: y~x")+
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., ..adj.rr.label.., sep = "~~~")),
label.x.npc = "right", label.y.npc = 0.87, geom = "text",
formula = y ~ poly(x,2), parse = TRUE, size = 2.5)+
stat_fit_glance(method = "lm",
method.args = list(formula = y ~ x),
geom = "text", size = 2.5,
label.y.npc = 0.85, label.x.npc = "right",
aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))+
theme_classic()+ guides(fill=FALSE)+
theme(
axis.title = element_text( size=8),
axis.title.x = element_text( size=8),
axis.title.y = element_text( size=8)
)
log_lm_plot<-ggplot(wine, aes(y=MORTALITY, x=WINE)) +
geom_point(alpha = .5) +
geom_point(color = "blue") +
stat_smooth(method = "lm", color="red")+
labs(x='log(Wine Consumption)', y='log(Heart Attack Mortality)',
title = "Log Transformed Linear Model: log(y)~log(x)")+
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., ..adj.rr.label.., sep = "~~~")),
label.x.npc = "right", label.y.npc = 0.87, geom = "text",
formula = y ~x, parse = TRUE, size = 3)+
stat_fit_glance(method = "lm",
method.args = list(formula = y~x),
geom = "text", size = 3,
label.y.npc = 0.85, label.x.npc = "right",
aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))+
scale_x_log10() +
scale_y_log10()+
theme_classic()+ guides(fill=FALSE)+
theme(
axis.title = element_text( size=8),
axis.title.x = element_text( size=8),
axis.title.y = element_text( size=8)
)
quadratic_plot <- ggplot(wine, aes(y=MORTALITY, x=WINE)) +
geom_point(alpha = .5) +
geom_point(color = "blue") +
stat_smooth(method = "lm", formula = y ~ poly(x,2), color="red")+
labs(x='Wine Consumption (Liters)', y='Heart Attack Mortality (per 1000)',
title = "Quadratic Model: y ~ poly(x,2)")+
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., ..adj.rr.label.., sep = "~~~")),
label.x.npc = "right", label.y.npc = 0.87, geom = "text",
formula = y ~ poly(x,2), parse = TRUE, size = 2.5)+
stat_fit_glance(method = "lm",
method.args = list(formula = y ~ poly(x,2)),
geom = "text", size = 2.5,
label.y.npc = 0.85, label.x.npc = "right",
aes(label = paste("P-value = ", signif(..p.value.., digits = 4), sep = "")))+
theme_classic()+ guides(fill=FALSE)+
theme(
axis.title = element_text( size=8),
axis.title.x = element_text( size=8),
axis.title.y = element_text( size=8)
)
grid.arrange(log_lm_plot, arrangeGrob(simple_lm_plot,
quadratic_plot, ncol = 2), # Second row with 2 plots in 2 different columns
nrow = 2.) # Number of rows
#'
#'
#' well, explanatory power and residual plot can help us determine which model fits well the relationship between wine consumption and heart attack
#'
#'
#' ### Model Selection:
#'
#' Both simple Linear model (y~x) and quadratic model (y~poly(x,2)) have really low explanatory power (R-squared and adj-R squared). In simple Linear model, only 66% of the variability in heat attack mortality is explained by wine consumption while in quadratic model on 64% of the variability in
#' the response is explained by the predictor.
#'
#'
par(mfrow = c(2, 2))
plot(wine.model.poly, which=c(2,1))
plot(wine.model.poly, which=c(2,1))
#'
#'
#' Furthermore, the residual plot in both model seems to follow a consistent trend at the same time exhibiting residual heteroskedasticity. Randomness and unpredictability is a crucial components of any regression model therefore for this reason among others, we are not going to do further diagnostics on them. The rule of thumb has always been to transform any nonlinear relationship. In this case linearity and equality of standard deviations assumptions have not been met. This gives us more reason to select log transformed model as the best fit model
#'
#' #### Log Transformed Linear Model (exponential)
#'
#'
#'
stargazer(wine.model.transformed,
header=F,
type = "latex",
no.space = T,
single.row = T
)
#'
#'
#'
#' Graphically we can notice that the log transformed regression line covers most data-points as well as having a more precise and uniform CI band compared with the other two models. In fact this model p-value is very significant (approximately 4.914e-06) and at the same time having a good explanatory power (over 73.6% of the variability in heart attack mortality is explained by wine consumption)
#'
#'
#'
#' ### Model Diagnostic:
#'
#'
par(mfrow = c(2, 2))
plot(wine.model.transformed)
#'
#'
#'
#' **Residuals vs. fitted values**
#'
#' * We observe an acceptable Scatter-plot in the sense that the fitted values doesn't show any noticeable systematic pattern hence doesn't contain any "leaking" explanatory information. Unlike this model, the other linear model had noticeable pattern.
#' * From the plot, the residuals bounce randomly around the 0 line and roughly forms a horizontal band around the 0 line. This suggests that the variances of the error terms are approximately equal.
#' * The plot suggest we have noticeable outliers. This is not concerning since not all outliers are influential. The last plot (Residual vs Leverage ) confirms this assumption.
#'
#' **Normal Q-Q Plot**
#'
#' * This plot shows that the residuals are normally distributed since the data points are well arranged on the straight dashed line. No noticeable deviation from the straight line can be observed.
#'
#' **Scale-Location Plot**
#'
#' * This plot confirms the assumption of equal variance (homoscedasticity). The residuals appear randomly scattered with no concerning patterns. We have a smooth horizontal red line, although it's not a perfect straight line.
#'
#' **Residual vs Levarage Plot**
#'
#' * We have no datapoints outside the Cook’s distance lines (red dashed lines). This suggest we have no influential outliers on the regression (leverage points).
#'
#' ### Summary
#'
#'
#' **Do these data suggest that heart disease death rates are associated with average wine consumption? If so, how can that be described?**
#'
#' The data suggest that wee have a highly significant negative relationship between heart attack mortality and wine consumption. The extremely low p-value (<0.05) indicates that it is highly significant. Since the relationship depicted exponential pattern we had to do transformation so that the linearity and equality of variance assumptions can be achieved. After transformation, the data suggested that an increase in Wine consumption by 1 unit (log scale) resulted to a decrease in the number of Ischemic heart attack mortality by 0.356 (log scale).
#'
#' \begin{center}
#'
#' \(log(y) = \beta_{0} + \beta_{1} log(x) + \epsilon\)
#'
#' \(log(HeartAttackMortality) = \beta_{0} + \beta_{1} log(WineConsumption) + \epsilon\)
#'
#' \(log(HeartAttackMortality) = 2.556 - 0.356*log(WineConsumption) + \epsilon\)
#'
#' \end{center}
#'
#' **What does this tell us?**
#'
#' This implies that there is a multiplicative change between the response and predictor.
#'
#' * **Multiplicative changes in e** - Multiplying WineConsumption by "e" will decrease HeartAttackMortality by 0.356%
#' * **A 1% increase in WineConsumption** - A 1% increase in Wine Consumption will decrease Heart Attack Mortality by 0.356/100 = 0.00356%
#' * **A 10% increase in WineConsumption** - A 10% increase in Wine Consumption will decrease Heart Attack Mortality by 0.356*log(1.10) = 0.03393042%
#'
#'
#' This tells us that an increase of wine consumption by 10% will decrease Ischemic heart attack mortality by 0.03393042%. However, it is important to note that this relationship could have been confounded by other factors not included in this dataset.
#'
#'
#'
#'
#' **Do any countries have substantially higher or lower death rates than others with similar wine consumption rates?**
#'
#'
#'
ggplot(wine, aes(WINE, MORTALITY, label = COUNTRY)) +
geom_vline(xintercept = c(75.90), linetype = 3) +
geom_point(color = "blue") +
geom_label_repel(
aes(
fill = "black"),
fontface = 'bold', color = 'black',
size = 2,
box.padding = unit(0.25, "lines"),
point.padding = unit(0.5, "lines")
) +
labs(x='Wine Consumption (Liters)', y='Heart Attack Mortality (per 1000)')+
theme_classic(base_size = 18)+ guides(fill=FALSE)
#'
#'
#' Yes, both the data table and the plot above indicate that some countries have substantially higher or lower death rates than others with similar wine consumption rates.
#'
#' * Australia and German have same heart mortality (4.7) yet wine consumption in Austria (25.1) is way higher than Germany (15.1)
#' * France and Italy have same wine consumption of (75.9) yet heart mortality in France (2.1) is lower than Itality (3.2)
#' * Scotland and England have same wine consumption of (3.2) yet heart mortality in Scotland (9.0) is way higher than England (7.1)
#' * United States (5.1) and Netherlands (5.2) have similar wine consumption yet heart mortality in United States (9.3) is way higher than Netherlands (5.3)
#'
#' **What does this tell us?**
#'
#' There are some variables that were not collected or included in the dataset that would aid us in extrapolating these differences
#'
#'
#' \onecolumn
#'
#' # Question 2
#'
#' Meadowfoam is a small plant that grows in Pacific Northwest and is domesticated for its seed oil. A study was set up to determine if meadowfoam can be made into a profitable crop. In a controlled growth chamber, the plant was grown at 6 different light intensities and two different timings of onset of light treatment. The outcome of interest is the number of flowers per plant which was measured by averaging numbers of flowers produced by 10 seedlings in each group. Growth was replicated at each combination of time and light intensity.
#'
#' a. First put the data into a dataset with four variables: number of flowers, light intensity, timing and replicate.
#'
flowers = read.csv("Flowers.csv")
#replicate <- flowers%>%mutate(replicate = match(INTENS, unique(INTENS)))
flowers$REPLICATE=as.factor(rep(c(1,2),12))
t.test.value <- t.test(data =flowers, FLOWERS~REPLICATE)
# Since the p-value
# print
stargazer(flowers,
header=F,
type = "latex",
no.space = T,
summary = F,
single.row = T
)
#'
#'
#' ##### *T.test Value*
#'
t.test.value <- t.test(data =flowers, FLOWERS~REPLICATE)
t.test.value
#'
#'
#' We fail to reject the null hypothesis since the p-value of .5787 indicates that the two replicate groups have no significant difference. In order to achieve parsimony, we will not include this predictor in our model.
#'
#'
#'
#' b. Create a categorical form of the light intensity with 6 categories.
#'
# cat form
INTENS_CATEGORY <- data.frame(
INTENS=c(150,300,450,600,750,900),
INTENS_CATEGORY_ =c('Dimmest','Dimmer', 'Dim', 'Bright', 'Brighter', 'Brightest')
)
# light intensity
flowers.df<- flowers%>%inner_join(INTENS_CATEGORY, by='INTENS') %>% # add categories
mutate(TIME_CATEGORY_=as.factor(ifelse(TIME==2, "EARLY","LATE"))) # add
# print out the new categories
stargazer(INTENS_CATEGORY,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
#'
#'
#'
#' c. The research questions are: What are effects of intensity and timing? Is there an interaction between the two factors?
#'
#' d. First create an analysis of variance using timing and the categorical form of the light intensity variable. Determine if there is an effect of each factor.
#'
flowers.model.cat <- lm(data=flowers.df, FLOWERS~INTENS_CATEGORY_+TIME_CATEGORY_)
stargazer(flowers.model.cat,
header=F,
type = "latex",
no.space = T,
single.row = T
)
#summary(flowers.model.cat)
# from the p-value it seems that very bright and slightly bright do not have any significant effects on number of flower. The other have significant effects
# talk about timing?
#'
#'
#' * *INTENS_CATEGORY_Brighter* - If we held other predictors constant, under a "brighter" light intensity condition results to a decrease of 4.5 flowers. However this is insignificant.
#' * *INTENS_CATEGORY_Brightest* - If we held other predictors constant, under the "brightest" light intensity condition results to a decrease of 6 flowers. However this is insignificant.
#' * *INTENS_CATEGORY_Dim* - If we held other predictors constant, under a "Dim" light intensity condition results to an increase of 9 flowers. However this is insignificant.
#' * *INTENS_CATEGORY_Dimmer* - If we held other predictors constant, under a "Dimmer" light intensity condition results to an increase of 14 flowers.
#' * *INTENS_CATEGORY_Dimmest* - If we held other predictors constant, under the "Dimmest" light intensity condition results to an increase of 23 flowers.
#' * *TIME_CATEGORY_LATE* - If we hold other predictors constant, "Late" onset timing of light treatment results to a decrease in 23 flowers.
#'
#' Looking at the p-values, light intensity condition ("Dimmer", "Dimmest" ) and Timing ("Late") are statistically significant. The Dimmer it gets, results to an increase in the number of flowers. "Late" onset timing of light treatment results to a decrease in number of flowers.
#'
#' ##### *Significance of REPLICATE in the model*
#'
# What is the effect of the treatment on the value ?
flowers.model.with.replicate <- lm(data=flowers.df, FLOWERS~INTENS_CATEGORY_+TIME_CATEGORY_+REPLICATE)
stargazer(flowers.model.with.replicate,
header=F,
type = "latex",
no.space = T,
single.row = T
)
#summary(flowers.model.with.replicate)
#'
#'
#' Adding the REPLICATE factor as one of the predictors in the model exhibited insignificance in the model with a p-value of 0.253968 which greater than our alpha level of .05. This confirmed the t-test
#'
#'
#' e. Then create an interaction between light intensity and timing by multiplying the two variables and test for the presence of an interaction.
#'
flowers.model.cat.int <- lm(data=flowers.df, FLOWERS~INTENS_CATEGORY_*TIME_CATEGORY_)
stargazer(flowers.model.cat.int,
header=F,
type = "latex",
no.space = T,
single.row = T
)
#summary(flowers.model.cat.int)
# while for the model without interaction R-squared: 0.823, Adjusted R-squared: 0.761
# for the model with interactions we have an R2 of .849 and adj. R2 of .710
# No negligable differnec between the 2 r squared hence No strong evidence of interaction between light intensity and timming
#'
#'
#' The p-values are greater than the significant level of .05, therefore none of the interaction terms are statistically significant. This means light intensity level (as a categorical variable) and timing of onset of light treatment do not interact. The effect of timing doesn't differs based on the level of light intensity and vice versa.
#'
#' f. Now repeat the process but using light intensity as a continuous variable.
#'
#' ##### *Continous Without Interaction Terms*
#'
#'
flowers.model.cont <- lm(data=flowers.df, FLOWERS~INTENS+TIME_CATEGORY_)
#summary(flowers.model.cont)
#sjt.lm(flowers.model.cont, show.fstat = TRUE)
stargazer(flowers.model.cont,
header=F,
type = "latex",
no.space = T,
single.row = T
)
# R2 / adj. R2 .799 / .780
#'
#'
#' * *INTENS* - If we held other predictors constant, an increase of light intensity by 1 unit results to a decrease in the number of flowers by 0.04. The extremely low p-value (<0.05) indicates that is highly significant.
#' * *TIME_CATEGORY_LATE* - If we held other predictors constant, "Late" timing for light exposure results to a decrease in number of flowers by 12. the p-value of (0.000146) indicates that this is statistically significant
#'
#' This means that light intensity is inversely proportional to number of flowers. At the same time, Late timing of light exposure result to a decrease in number of flowers.
#'
#' ##### *Continous with Interaction Terms*
#'
#'
flowers.model.cont.int <- lm(data=flowers.df, FLOWERS~INTENS*TIME_CATEGORY_)
#summary(flowers.model.cont.int)
#sjt.lm(flowers.model.cont.int, show.fstat = TRUE)
stargazer(flowers.model.cont.int,
header=F,
type = "latex",
no.space = T,
single.row = T
)
#R2 / adj. R2 .799 / .769
#'
#'
#' Looking at the low p-value of (0.9096), we have insufficient evidence of interaction between timing of exposure and light intensity as a continuous variable with a p-value greater than .05
#'
#' g. Then perform F-tests to compare the four model you have created (light as continuous and categorical with and without the interaction)
#'
#' ##### *Model 1 as the reference (light as categorical variabe & without interaction )*
#'
#'
flowers.anova.1 <- anova(flowers.model.cat, flowers.model.cat.int, flowers.model.cont, flowers.model.cont.int)
#summary(flowers.anova.1)
stargazer(flowers.anova.1,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
# there is no significant difference between the 4 models beccause all the p-values are extremly high
#'
#'
#' * There is no significant difference between the 4 models because all the p-values are extremely high i.e >.05
#'
#' ##### *Model 2 as the reference (light as categorical variabe & without interaction )*
#'
#'
flowers.anova.2 <- anova( flowers.model.cat.int, flowers.model.cat, flowers.model.cont, flowers.model.cont.int)
#summary(flowers.anova.2)
stargazer(flowers.anova.2,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
# there is no significant difference between the 4 models beccause all the p-values are extremly high
#'
#'
#' * There is no significant difference between the 4 models because all the p-values are extremely high i.e >.05
#'
#' ##### *Model 3 as the reference (light as continous variabe & without interaction )*
#'
#'
flowers.anova.3 <- anova( flowers.model.cont, flowers.model.cat, flowers.model.cat.int, flowers.model.cont.int)
#summary(flowers.anova.3)
stargazer(flowers.anova.3,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
# there is no significant difference between the 4 models beccause all the p-values are extremly high
#'
#'
#' * There is no significant difference between the 4 models because all the p-values are extremely high i.e >.05
#'
#' ##### *Model 4 as the reference (light as continous variabe & with interaction )*
#'
#'
flowers.anova.4 <- anova(flowers.model.cont.int, flowers.model.cat, flowers.model.cat.int, flowers.model.cont)
#summary(flowers.anova.4)
stargazer(flowers.anova.4,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
# there is no significant difference between the 4 models beccause all the p-values are extremly high
#'
#'
#' * There is no significant difference between the 4 models because all the p-values are extremely high i.e >.05
#'
#' h. Predict the number of flowers grown at each combination of light and timing for each of the four models.
#'
flowers.df$prediction1 = predict(flowers.model.cat, flowers.df, type = "response")
flowers.df$prediction2 = predict(flowers.model.cat.int, flowers.df, type = "response")
flowers.df$prediction3 = predict(flowers.model.cont, flowers.df, type = "response")
flowers.df$prediction4 = predict(flowers.model.cont.int,flowers.df, type = "response")
stargazer(flowers.df%>%select(FLOWERS,prediction1,prediction2,prediction3, prediction4),
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
#'
#'
#'
#' i. Compare each prediction to the observed number of flowers and calculate the difference (observed – predicted). This is the residual. Calculate the residual mean squared error for each model by adding the squared residuals together and dividing by the number of residual degrees of freedom. This should equal the mean squared error in each ANOVA table.
#'
#RESIDUE
flowers.df$residual1 = flowers.df$FLOWERS-flowers.df$prediction1
flowers.df$residual2 = flowers.df$FLOWERS-flowers.df$prediction2
flowers.df$residual3 = flowers.df$FLOWERS-flowers.df$prediction3
flowers.df$residual4 = flowers.df$FLOWERS-flowers.df$prediction4
# SQUAred residual
flowers.df$residual_mse1 = flowers.df$residual1^2
flowers.df$residual_mse2 = flowers.df$residual2^2
flowers.df$residual_mse3 = flowers.df$residual3^2
flowers.df$residual_mse4 = flowers.df$residual4^2
#MSE
flowers.mse =colSums(flowers.df[,c("residual_mse1","residual_mse2","residual_mse3","residual_mse4")])
stargazer(flowers.mse,
header=F,
type = "latex",
summary = F,
no.space = T,
single.row = T
)
#'
#'
#' j. Now plot the residuals vs. the predicted for each model and see if there are any patterns. If you see any, what might you do to remove them?
#'
#'
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
# model 1
plot(flowers.df$prediction1,flowers.df$residual1,
ylab = "Residuals",
xlab = "Predicted",
main = "Model 1"
)
abline(0,0)
# model 2
plot(flowers.df$prediction2,flowers.df$residual2,
ylab = "Residuals",
xlab = "Predicted",
main = "Model 2"
)
abline(0,0)
# model 3
plot(flowers.df$prediction3,flowers.df$residual3,
ylab = "Residuals",
xlab = "Predicted",
main = "Model 3"
)
abline(0,0)
# model 4
plot(flowers.df$prediction4,flowers.df$residual4,
ylab = "Residuals",
xlab = "Predicted",
main = "Model 4"
)
abline(0,0)
#'
#'
#' Systematic pattern can be visible in model 2 of the residual plot. This means that predictive information is leaking into the residual due to inclusion of insignificant predictors (interraction terms). This can be removed by removing interaction terms.
#'
#' k. Finally, take the model you think describes the data the best and write a short report for your grandmother who would like to grow these flowers carefully explaining to her how she should best grow them and why. Note that your grandmother is curious about how much changes in light and timing might affect her flowers and how sensitive her results will be to the settings she makes.
#'
#'
#'
#' ##Report
#'
#' The main goal of this study and analysis is to identify the best combination of light intensity and timing of onset of light treatment. Timing of onset of light treatment is a dummy variable which consist of (Late and Early) while light intensity is a continuous variable which consist of(150,300,450,600,750,900). We further transformed this continuous variable to form a third categorical variable consisting of ('Dimmest','Dimmer', 'Dim', 'Bright', 'Brighter', 'Brightest'). We also added another variable (REPLICATE) which uniquely identified similar combination of timing of onset of light treatment and light intensity.
#'
#'
#' ### Technical Analysis
#' Before fitting any model we went ahead and explored our data by checking whether there was any significant difference between the replicates. T-test p-value of .5787 indicated that the two replicate groups had no significant difference.
#'
#' #### Evaluation whether results differed by replicate
#' Even after establishing that there was no any significant differences between the 2 groups, we went ahead and added the REPLICATE factor as one of the predictors in the model to ascertain this fact. Adding the REPLICATE factor as one of the predictors in the model exhibited insignificance in the model with a p-value of 0.253968. This confirmed the t-test hence we removed it from the model inorder to achieve a parsimonious model.
#'
#' From the initial step-by-step analysis, it was well established that we have no significant evidence of interaction between timing of exposure and light intensity as a categorical variable. The p-values were greater than the significant level of .05. In fact from the interaction plot we do not see any form interactions. We do not see any of the lines crossing.
#'
#'
#' Apart from the categorical case, We also established that we did not have significant evidence of interaction between timing of exposure and light intensity as a continuous variable.
#'
#'
#'
par(mfrow = c(2, 1))
plot(effect(term="INTENS:TIME_CATEGORY_",mod=flowers.model.cont.int),multiline=TRUE)
plot(effect(term="INTENS_CATEGORY_:TIME_CATEGORY_",mod=flowers.model.cat.int,default.levels=20),multiline=TRUE)
#'
#'
#'
#' In fact plotting an interaction plots, we get perfect parallel lines which do not cross (continuous model) and categorical model. These plots further confirmed that we did not have interaction.
#'
#' After checking for interactions, we went ahead and assessed whether there was any significant difference among the 4 models. The extremely high p-values indicated that we did not have any significant difference among the 4 models.
#'
#' #### Model Selection
#' In order to identify the best model which best describe the relationship between the dependent and independent variables, several assumptions and conditions must be taken into account.
#'
#' * Linearity: The relationship between independent and the dependent is linear.
#' * Homoscedasticity: The variance of residual is the same for any value of independent.
#' * Independence: Observations are independent of each other.
#' * Normality: Residual must be normally distributed.
#'
#' Most of the models exhibited all these properties. The next thing is to asses the best fit model with relation to the study question at hand and other factors such parsimony, R-squared, Residual Standard Error, and adj-R squared e.t.c
#'
#' Since the interaction terms in the 2 models (categorical with interaction and continuous with interaction) were insignificant, we dropped them in favor of the other 2 model without interactions in order to pick the most parsimonious model.
#'
#'
stargazer(flowers.model.cat.int,flowers.model.cont.int,
header=F,
type = "latex",
summary = F,
no.space = T,
ci = TRUE,
keep = c("\\bprecip\\b"),
title = "Models With Interraction Terms!",
column.labels = c( "Categorical Model", "Continous Model"),
notes = "Coefficients have been removed!",
dep.var.caption = "-" , # Bold
single.row = T
)
#'
#'
#' Another reason for dropping them was the fact that they had lower explanatory power (R-squared and adj-R squared) compared to the models without interaction. In the model with interaction term (categorical), only 71% of the variability in the number of flowers is explained by covariates while in the continuous case with interaction 77% of the variability in the number of flowers is explained by the covariates.
#'
#' #### Model Without interraction
#'
#' Well now we are down to 2 models (Models Without interaction). For starter Both models have high explanatory power (R-squared and adj-R-squared values.
#'
#'
stargazer(flowers.model.cat,flowers.model.cont,
header=F,
type = "latex",
summary = F,
no.space = T,
ci = TRUE,
title = "Models Without Interraction",
column.labels = c( "Categorical Model", "Continous Model"),
notes = "Notes",
single.row = T
)
#'
#'
#' In the continuous model, 80% of the variability in the number of flowers is explained by covariates(Light intensity and timing of treatment) while in the categorical model 82 % of the variability in the number of flower is explained by the covariates. Adjusted r-squared in the continuous model was higher than categorical model, possibly due to the fact that we have more coefficient in the categorical model. Therefore the continuous model is more parsimonious compared to the categorical model.
#'
#'
#' The best model that best describe the data is: **Continous model without interaction** because:
#'
#' * This model accomplishes the desired level of explanatory power (80%) with as few predictor variables as possible compared to the latter model. - parsimony
#' * This model has a higher adjusted R-Squared 0.78 compared to 0.76 of the latter. In the categorical model, some of the predictors were highly insignificant like (brighter and brightest). This is one of the reasons why we have a higher adjusted R-Squared in the categorical model.
#' * This model has lower residual standard error 6.441 compared to 6.719 in the categorical model. On average, the response variable (Flowers) will deviate from the true regression line by 6.441 compared to 6.719 in model 2
#' * This model has highly significant predictors. The categorical model has 2 insignificant predictors included in the model. The rule of thumb is to remove, insignificant predictors from your model.
#'
#'
#' ####Model diagnostics
#'
#'
par(mfrow = c(2, 2))
plot(flowers.model.cont)
#'
#'
#'
#'
#' **Residuals vs. fitted values**
#'
#' * We observe an acceptable Scatter-plot in the sense that the fitted values doesn't show any noticeable systematic pattern hence doesn't contain any "leaking" explanatory information.
#' * From the plot, the residuals bounce randomly around the 0 line and roughly forms a horizontal band around the 0 line. This suggests that the variances of the error terms are approximately equal.
#'
#' **Normal Q-Q Plot**
#'
#' * This plot shows that the residuals are normally distributed since the data points are well arranged on the straight dashed line. No noticeable deviation from the straight line can be observed.
#'
#' **Scale-Location Plot**
#'
#' * This plot confirms the assumption of equal variance (homoscedasticity). The residuals appear randomly scattered with no concerning patterns. We have a smooth horizontal red line confirming this fact.
#'
#' **Residual vs Levarage Plot**
#'
#' * We do not have datapoints outside the Cook’s distance lines. In fact the red dashed lines is not visible. This suggest we have no influential outliers on the regression (leverage points).
#'
#' ####Summary
#' In summary the selected model tells us that:
#'
#' * If we held other predictors constant, an increase of light intensity by 1 unit results to a decrease in the number of flowers by 0.04. The extremely low p-value (<0.05) indicates that is highly significant.
#' * If we held other predictors constant, "Late" timing for light exposure results to a decrease in number of flowers by 12. the p-value of (0.000146) indicates that this is statistically significant
#'
#' This means that light intensity is inversely proportional to the number of flowers. At the same time, Late timing of onset of light treatment result to a decrease in number of flowers.
#'
#' ### Grandmothers Report
#'
#' The goal of the analysis was to identify the best combination of light intensity and timing of onset of light treatment that would yield highest number of flowers. Therefore, in order to get the highest number of flowers, grandmother needs to expose her flowers to the dimmest light intensity as early (timing) as possible.
#'
#'
#'
#'
#'
#' \onecolumn
#'
#' # Source Ccde
#'
#'
#'
#'