-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathredwine.rmd
1106 lines (815 loc) · 35.4 KB
/
redwine.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r global_options, echo=FALSE}
library("knitr")
knitr::opts_chunk$set(fig.width=7,fig.height=6,fig.path='Figs/',
fig.align='center',tidy=TRUE,
echo=FALSE,warning=FALSE,message=FALSE)
```
---
title: "Red Wine Quality by Anav Gupta"
output: html_notebook
author: "Anav Gupta"
---
Red Wine Quality by Anav Gupta
========================================================
# Packages
Before we start exploring the data, we will first load the packages that we will need for this exploration.
```{r packages, echo=FALSE, message=FALSE, warning=FALSE}
# Load all of the packages that you end up using in your analysis in this code
# chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk. This
# prevents the code from displaying in the knitted HTML output. You shoecho=FALSE echo=FALSEuld set
# echo=FALSE for all code chunks in your file, unless it makes sense for your
# report to show the code that generated a particular plot.
# The other parameters for "message" and "warning" should also be set to FALSE
# for other code chunks once you have verified that each plot comes out as you
# want it to. This will clean up the flow of your report.
library(ggplot2)
library(ggthemes)
library(dplyr)
library(tidyr)
library(gridExtra)
library(RColorBrewer)
library(GGally)
```
***
# Load the Data
Lets load the data necessary for this exploration.
```{r Load_the_Data, echo=FALSE}
# Load the Data
redwine <- read.csv('wineQualityReds.csv')
str(redwine)
# Converting the quality into a factor variable
redwine$quality <- factor(redwine$quality, levels = seq(1, 10, 1))
str(redwine)
# Converting the quality into a numerical variable
redwine$quality.num <- as.numeric(redwine$quality)
# Creating a group variable of the qualities of wine
redwine$quality.group <- as.factor(cut(redwine$quality.num, c(2, 4, 6, 8)))
```
This tidy data set contains 1,599 red wines with 11 variables on the chemical
properties of the wine. At least 3 wine experts rated the quality of each wine,
providing a rating between 0 (very bad) and 10 (very excellent).
***
# Univariate Plots Section
In this section we will try to analyze the variable individually.
```{r Data Summary, echo=FALSE}
summary(redwine)
```
### Quality
Now lets have a look at the quality variable
```{r Qualitym, echo=FALSE}
qplot(data = redwine, quality, fill = I('#f79420'))
```
It seems the most of the wines in this sample have quality score of 5 and 6 .
***
### Fixed.Acidity
```{r echo=FALSE}
summary(redwine$fixed.acidity)
```
```{r echo=FALSE}
table(cut(redwine$fixed.acidity, c(4, 5, 6, 7, 8, 9, 10, 12, 16)))
quantile(redwine$fixed.acidity, seq(0, 1, .1))
quantile(redwine$fixed.acidity, seq(0.9, 1, .01))
```
```{r echo=FALSE}
qplot(data = redwine, x = fixed.acidity, fill = I('#f79420'), bins = 60) +
xlim(4, 14)
```
The Fixed Acidity of the wine seems to max out around the 6 to 8 units.
We can clearly see the presence of the outliers in the data set.
This seems appropriate as higher values of fixed acidity will turn the wine more acidic in nature.
***
### Volatile Acidity
```{r echo=FALSE}
summary(redwine$volatile.acidity)
table(cut(redwine$volatile.acidity, c(.1, .3, .4, .5,.6, .7, .8, 1, 1.6)))
quantile(redwine$volatile.acidity, seq(0, 1, .1))
quantile(redwine$volatile.acidity, seq(0.9, 1, .01))
```
```{r echo=FALSE}
qplot(data = subset(redwine, volatile.acidity <= quantile(redwine$volatile.acidity,
.99)), x = volatile.acidity, fill = I('#f79420'), bins = 60)
```
High levels of Volatile Acidity can lead to a unpleasant and vinegar like taste.
This seems to the reason why only few percentage of wines have higher volatile
Acidity. Most of the wines seems to contain .4 to .6 levels of volatile acidity.
***
### Citric Acid
```{r echo=FALSE}
summary(redwine$citric.acid)
```
```{r echo=FALSE}
qplot(data = redwine, x = citric.acid, fill = I('#f79420'), bins = 60)
```
It seems that citric Acid is not present in many of wine samples. The citric acid
distribution is quite flat. It will be nice to explore the quality of the wines
that don't have any citric Acid.
***
### Residual sugar
```{r echo=FALSE}
summary(redwine$residual.sugar)
```
```{r echo=FALSE}
grid.arrange(
ggplot(redwine, aes( x = 1, y = residual.sugar)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(alpha = 0.2, color = 'red') ,
ggplot(redwine, aes(x = residual.sugar)) +
geom_histogram(bins=60),
ncol=2)
```
It seems that majority of the wines have residual sugar level between 1 and 3.
As you can see from the above graph that there are wine samples that far more
residual sugar. It will be interesting to compare the quality of wines based on
the residual sugar.
***
### Chlorides
```{r echo=FALSE}
summary(redwine$chlorides)
```
```{r echo=FALSE}
grid.arrange(
ggplot(data = redwine, aes(x = 1, y = chlorides)) +
geom_jitter(alpha = 0.1) +
geom_boxplot(alpha = 0.2, color = 'red') +
scale_y_continuous(breaks = c(seq(0, .2, .05), .6)),
ggplot(data = redwine, aes(x = chlorides)) +
geom_histogram(fill = I('#f79420'), bins = 60),
ncol = 2
)
```
```{r echo=FALSE}
ggplot(data = subset(redwine, chlorides <= quantile(redwine$chlorides, .95)),
aes(x = chlorides)) +
geom_histogram(fill = I('#f79420'), bins = 60)
```
We can clearly see that the chlorides are found in very minute quantities in the
wine samples. The chlorides seems to have a Normal distribution with many
outliers. About 95 % of the wines contain chlorides in the range of 0.040 to
0.125.
***
### Free Sulphar Dioxide
```{r echo=FALSE}
summary(redwine$free.sulfur.dioxide)
```
```{r echo=FALSE}
quantile(redwine$free.sulfur.dioxide, seq(0, 1, .1))
quantile(redwine$free.sulfur.dioxide, seq(0.9, 1, .01))
```
```{r echo=FALSE}
qplot(data = redwine, x = (free.sulfur.dioxide), fill = I('orange'),
bins = 60)
```
```{r echo=FALSE}
ggplot(aes(free.sulfur.dioxide), data = subset(redwine, free.sulfur.dioxide <=
quantile(redwine$free.sulfur.dioxide, .99))) +
geom_histogram(fill = I('orange'), bins = 60)
```
From the graph as well as from the Quintilian function we can clearly see that in
majority of the wines(90%) the amount of free sulfur dioxide is with in 31 units.
The presence of sulfur dioxide in the low concentration is undetectable, but at
free concentration over 50 ppm, the sulfur dioxide become evident in the nose as
well as the taste of the wine.
I suppose this is why only 1 percent of wines have it's concentration greater
than 50.
***
### Total Sulfur Dioxide
```{r echo=FALSE}
summary(redwine$total.sulfur.dioxide)
```
```{r echo=FALSE}
quantile(redwine$total.sulfur.dioxide, seq(0, 1, .1))
```
```{r echo=FALSE}
qplot(x = total.sulfur.dioxide, data = redwine,
fill = I('Orange'), bins = 60) +
scale_x_continuous(limits = c(0, 165), breaks = seq(0, 160, 20))
```
Total Sulfur Dioxide is the the total amount of sulfur dioxide in the wine and
hence there will be some kind of relation between the free and total sulfur
dioxide.
***
### Density
```{r echo=FALSE}
summary(redwine$density)
```
```{r echo=FALSE}
ggplot(data = redwine, aes(x = density)) +
geom_histogram(fill = I('Orange'), bins = 60) +
geom_vline(aes(xintercept = mean(density), color = I('black')))
```
we can clearly see that the density of the wine varies over a narrow range.
Median and mean of the density is equal. This suggest that it has a normal
distribution.
***
### pH
```{r echo=FALSE}
summary(redwine$pH)
```
```{r echo=FALSE}
ggplot(data = redwine, aes(x = pH)) +
geom_histogram(fill = I('Orange'), bins = 60) +
scale_x_continuous(breaks = seq(2.5, 4, .1)) +
geom_vline(aes(xintercept = mean(pH), color = I('black')))
```
Ph is an index which indicates the acidity or the alkalinity of the water soluble
substance. we can clearly see that the density of the wine varies over a narrow
range.Median and mean of the density is equal. This suggest that it has a normal
distribution. We can see that the ph value for a wine range over a narrow values
of 3 to 4.
***
### Sulphates
```{r echo=FALSE}
summary(redwine$sulphates)
```
```{r echo=FALSE}
quantile(redwine$sulphates, seq(0, 1, .1))
quantile(redwine$sulphates, seq(0.9, 1, .01))
table(cut(redwine$sulphates, c(0, .5 , .6, .7, .8, .9, 1, 2)))
```
```{r echo=FALSE}
qplot(data = redwine, x = sulphates, fill = I('orange'), bins = 60) +
scale_x_continuous(breaks = seq(0, 1.5, .1)) +
geom_vline(aes(xintercept = mean(sulphates), color = I('black'))) +
geom_vline(aes(xintercept = median(sulphates), color = I('blue')))
```
From the histogram as well as from the table information we can clearly see that
majority of the wines have 0.5 to 0.7 units of sulphates concentration in the
wine.
***
### Alcohol
```{r echo=FALSE}
summary(redwine$alcohol)
```
```{r echo=FALSE}
table(cut(redwine$alcohol, c(8, 9, 10, 11, 12, 13, 15)))
```
```{r echo=FALSE}
ggplot(data = redwine, aes(x = alcohol)) +
geom_histogram(fill = I('orange'), bins = 60) +
scale_x_continuous(breaks = c(seq(6, 16, 1), 10.5), minor_breaks = seq(10, 11, .1)) +
geom_vline(aes(xintercept = mean(alcohol), color = I('black'))) +
geom_vline(aes(xintercept = median(alcohol), color = I('blue')))
```
From the histogram as well as the table summary, we can see that that 50 percent
of the wines have 9-10 percent of alcohol content.
***
# Univariate Analysis
In this section we will list the analysis of the univariate exploration.
### What is the structure of your dataset?
Our data set consists of 1599 observation having 11 physicochemical inputs and a output that gives the quality of the wine. The quality variable is an ordered
factor variable with following levels :
(Worst) ----------------> (Best)
1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10
Other Observations:
* Most Wines have a ph of 3 to 4.
* Most Wine have an alcohol content of 9 to 10 percent.
### What is/are the main feature(s) of interest in your dataset?
We can see that the variables Free sulfur Dioxide and Total Sulfur Dioxide will
be connected in some way. In the same manner, Fixed and Total acidity are
connected to each other.
It will be interesting to find that whether the quantity of alcohol in the wine
have any influence on the quality of the wine or not.
How does the quantity of Citric Acid, which add freshness and flavor to wines effect the quality of the wine.
### What other features in the dataset do you think will help support your \
investigation into your feature(s) of interest?
Chlorides, Sugar and Ph are some of the variables that will support my investigation into my features of interest.
### Did you create any new variables from existing variables in the dataset?
Yes, I created two new variables out of the existing quality variable. Firstly,
I created a factor of quality variable. Secondly, I created a variable
'quality.group' which is created by cutting the quality variable into 3 equal
sizes.
### Of the features you investigated, were there any unusual distributions? \
Did you perform any operations on the data to tidy, adjust, or change the form \
of the data? If so, why did you do this?
I have converted the quality into a factored variable. This will aid me to
visualize the input variable that lead to different quality of wines.
# Bivariate Plots Section
This section will try to explore two variables at a time.
```{r echo=FALSE}
ggcorr(subset(redwine, select = c(-X, -quality, -quality.group)),
label = TRUE, label_size = 2, label_alpha = TRUE,
angle = -45) +
theme(legend.title = element_text(size = 14))
my_fn <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(shape = I('.')) +
geom_smooth(method=method, ...)
p
}
```
From the above graph we can see some significant correlation among the following
variables :
* pH and Fixed Acidity : - 0.683
* Fixed Acidity and Citric Acid : 0.672
* Density and Fixed Acidity : 0.668
* Free Sulfuric Dioxide and Total Sulfuric Dioxide : 0.668
* Volatile Acidity and Citric Acid : - 0.552
* Citric and pH : - 0.542
* Density and Alcohol : - 0.5
```{r message=FALSE,warning=FALSE,echo=FALSE, fig.width=20, fig.height=30}
ggpairs(subset(redwine, select = c(-X, -quality, -quality.group)),
lower = list(continuous = wrap(my_fn, method = 'lm')),
upper = list(combo = wrap('box', outlier.shape = I('.'))),
axisLabels = 'internal',
corSize = 10)
```
Now let test all the physicochemical input of the wine with the Quality (output)
of the wine.
### Volatile Acidity and Quality
```{r echo=FALSE}
# scale_y_continuous(breaks = seq(0.3, 1.6, .2))
ggplot(data = subset(redwine, volatile.acidity <=
quantile(redwine$volatile.acidity, .95)),
aes(quality, volatile.acidity)) +
geom_point(alpha = 1/5, position = 'jitter') +
geom_boxplot(aes(fill = quality), alpha = .5) +
scale_y_continuous() +
stat_summary(geom = 'point', fun.y = 'mean', shape = 8, color = "red")
```
From the above Box plot of the Volatile Acidity, we can make some connection
between the volatile acidity and the Quality of the wine.
In lower quality wines, volatile acidity is very dispersed and the dispersion
lowers down as we move to better quality wines.
We can see that the median as well as the mean of the volatile acidity in the
wine starts to reduce with the increase in the quality of wine.
From the Boxplot it seems that .3 to .5 is the ideal range for volatile acidity.
***
### Fixed Acidity and Quality
Fixed Acidity for the wines having quality 3 or 4 is very dispersed.
For the wines with quality score 5, 6 and 7 we can see from the a
```{r echo=FALSE}
ggplot(data = redwine, aes(quality, fixed.acidity)) +
geom_boxplot(aes(fill = redwine$quality)) +
scale_y_continuous(breaks = seq(4, 16, 2)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
It's difficult to sight some trend from the above boxplots. Although it can be
said that for each quality of wine 50 percent time, the fixed acidity lies
between 7 to 10 units.
***
### Citric Acid and Quality
```{r echo=FALSE}
subset(redwine, citric.acid != 0) %>%
group_by(quality) %>%
summarise(n = n())
```
```{r echo=FALSE}
ggplot(data = redwine, aes(quality, citric.acid)) +
geom_boxplot(aes(fill = redwine$quality)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
The trend we can make about the citric acid content and quality of the wine is
that the content of citric acid (mean as well as median) increases with the increase in the quality of the wine.
***
### Ph and Quality
```{r echo=FALSE}
ggplot(data = redwine, aes(x = quality, y = pH)) +
geom_boxplot(aes(fill = redwine$quality)) +
scale_y_continuous(breaks = seq(3, 4, .1)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
From the boxplot you can't find out much about the quality of the wine from it's
pH value. One thing is sure from the boxplot that the pH of the wine generally
remains within the range of 3 to 4, with about 50 percent of times within 3.2 to
3.4.
***
### Residual Sugar vs Quality
```{r echo=FALSE}
ggplot(data = redwine, aes(x = quality, y = residual.sugar)) +
geom_boxplot(aes(fill = redwine$quality)) +
scale_y_continuous(limits = c(0.8, quantile(redwine$residual.sugar, 0.90)), minor_breaks = seq(1, 3, .5)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 4)
```
From the boxplot above, we don't seem to have any kind of relation ship or trend
between residual sugar content and the quality of the wine. For 87 percent of
times, the residual sugar falls with the range of 1 to 3.
Median of the residual sugar content remain constant for all the qualities of
wine.
some wines with quality 5 and 6 have high amount of residual sugar.
More than 50 percent of times for all qualities of wine, residual sugar remains
within the range of 2 - 3.
***
### Chlorides vs Quality
```{r warning=FALSE, echo=FALSE}
ggplot(data = redwine, aes(x = chlorides)) +
geom_histogram(aes(fill = quality.group), bins = 60) +
scale_x_continuous(limits = c(0.025, quantile(redwine$chlorides, 0.95)))
```
```{r echo=FALSE}
ggplot(data = redwine, aes(x = quality, y = chlorides)) +
geom_boxplot(aes(fill = redwine$quality)) +
scale_y_continuous(limits = c(0, quantile(redwine$chlorides, 0.95)),
minor_breaks = seq(0.05, .2, .05)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
It looks like that there doesn't seem to be any trend between the quantity of
chlorides and the quality of the wine.
For most of the wines, the quantity of chloride falls within 0.05 and 0.1 units.
***
### Free Sulfur Dioxide vs Quality
```{r echo=FALSE}
ggplot(data = subset(redwine, total.sulfur.dioxide <=
quantile(redwine$total.sulfur.dioxide, .95)),
aes(x = quality, y = free.sulfur.dioxide)) +
geom_boxplot(aes(fill = quality)) +
scale_y_continuous(breaks = seq(0, 70, 10), minor_breaks = seq(0, 20, 2)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
From the Boxplot above it doesn't seem like that the free sulfur dioxide has any
effect on the quality of the wine.
***
### Total Sulfur Dioxide vs Quality
```{r echo=FALSE}
ggplot(data = subset(redwine, total.sulfur.dioxide <=
quantile(redwine$total.sulfur.dioxide, .99)),
aes(x = quality, y = total.sulfur.dioxide)) +
geom_boxplot(aes(fill = quality)) +
scale_y_continuous() +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
Total Sulfur Dioxide doesn't seem to have any kind of relationship with the
quality of an alcohol. It just so happens that for more that 50 percent of time
total sulfur dioxide present in the wine is less than or equal to 50.
***
### Sulphates vs Quality
```{r echo=FALSE}
ggplot(data = subset(redwine, sulphates <=
quantile(redwine$sulphates, .99)),
aes(x = quality, y = sulphates)) +
geom_boxplot(aes(fill = quality)) +
scale_y_continuous() +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
The median as well as the mean of quantity of the sulphate increases with the
increase in the quality of the wine.
***
### Density vs Quality
```{r echo=FALSE}
ggplot (data = redwine, aes( density)) +
geom_histogram(aes(x = density, fill = quality.group), bins = 60)
```
```{r echo=FALSE}
ggplot (data = redwine, aes(quality, density)) +
geom_boxplot(aes(fill = quality)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
As we had already seen that the density of the wine varies over a narrow period.
It's difficult to find the trend between the density and the quality of a wine.
***
### Alcohol vs Quality
```{r echo=FALSE}
ggplot (data = redwine, aes(alcohol)) +
geom_histogram(aes(x = alcohol, fill = quality), bins = 60)
```
```{r echo=FALSE}
ggplot(data = subset(redwine, volatile.acidity <= quantile(
redwine$volatile.acidity, .98)), aes(quality, alcohol)) +
geom_boxplot(aes(fill = quality)) +
stat_summary(geom = 'point', fun.y = 'mean', shape = 22, fill = I('black'))
```
```{r}
ggplot(data = redwine, aes(x = quality.group, y = alcohol)) +
geom_jitter(alpha = 0.3) +
geom_boxplot(aes(fill = quality.group), alpha = 0.5) +
stat_summary(geom = 'point', stat = 'summary', fun.y = mean,
color = 'red', shape = 8, size = 4)
```
The above boxplots seems to suggest that as the alcohol content increases in the
wine, it's quality increases. However this cannot be said with surety since, we
can see that some of the wines of quality 5 have such high alcohol content.
***
Now that we have tried to compare the physicochemical properties of wine with the
quality of the wine. Now lets try to relate the physicochemical properties
itself.
### Fixed Acidiy and Citric Acid
We know that the citric acid is a non-volatile acid and the fixed acidity tend
to calculate the non-volatile acid content of the wine.
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(fixed.acidity, citric.acid)) +
geom_point(color = I('orange'), alpha = 1 / 5) +
geom_smooth()
```
```{r echo=FALSE}
with (data = redwine, cor.test(fixed.acidity, citric.acid))
```
We do see some kind of correlation between the citric acid and the fixed
acidity which was somewhat expected.
***
### Fixed Acidity and pH
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(round(fixed.acidity / .2) * .2, pH)) +
geom_point(color = I('orange'), alpha = 1 / 5) +
geom_smooth()
```
In the graph, it seems that the smooth line is going through the middle of the
major portion of points.
```{r echo=FALSE}
with (data = redwine, cor.test(fixed.acidity, pH))
```
From above we can say that Fixed acidity and pH are negatively correlated.
This seems plausible as well. As the acidic content of the wine increase, the
pH value which gives the extent of the alkalinity/acidity should decrease.
A substance with pH with value 0 is most acidic.
***
### Density and Fixed Acidity
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(fixed.acidity, density)) +
geom_point(color = I('orange'), alpha = 1/5) +
stat_smooth()
```
The data is dispersed in the upper half of the smooth line. I guess we do see
some correlation. The smooth line somewhats affirm our belief.
```{r echo=FALSE}
with(data = redwine, cor.test(density, fixed.acidity))
```
From the graph as well as from the R's coefficient, we can say that density and
fixed acidity are positively correlated.
***
### Free vs Total sulfur Dioxide
These are two variables that tell us about of concentration of the sulfur
dioxide in the wine either free or fixed.
So even from the definition of these two variables itself, we can postulate that
these two variables must be correlated. Lets try out our postulation.
```{r message=FALSE, warning=FALSE, echo=FALSE}
ggplot(data = redwine, aes(free.sulfur.dioxide, total.sulfur.dioxide)) +
scale_x_continuous(limits =
c(0, quantile(redwine$total.sulfur.dioxide, .95))) +
scale_y_continuous(limits =
c(0, quantile(redwine$total.sulfur.dioxide, .95))) +
geom_point(color = I('orange'), alpha = 1/5) +
stat_smooth()
```
```{r echo=FALSE}
with (data = redwine, cor.test(free.sulfur.dioxide, total.sulfur.dioxide))
```
From the above analysis, it seems that they both are positively correlated.
```{r echo=FALSE}
with(data = redwine, cor.test(quality.num, alcohol))
with(data = redwine, cor.test(quality.num, fixed.acidity))
```
### Density vs Alcohol
```{r echo=FALSE}
with (data = redwine, cor.test(density, alcohol))
```
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(alcohol, density)) +
geom_point(color = I('orange'), alpha = .5) +
geom_smooth()
```
We can see that there is about negative correlation among the density and
alcohol. The smooth line does passes through most of important places.
***
# Bivariate Analysis
This section lists the analysis of the bi-variate explorations.
### Talk about some of the relationships you observed in this part of the \
investigation. How did the feature(s) of interest vary with other features in \
the dataset?
The alcohol content in the wine correlates with the quality of the wine.
With the increase in the quality of the wine, the average (mean and median)
of the wine's alcohol content increases.
The Volatile Acidity correlates mildly correlates (Negatively) with the quality
of the wine.
With the increase in the quality of the wine, the median as well as the mean
quantity of the volatile acidity decreases.
The content of citric acid (mean as well as median) increases with the increase
in the quality of the wine.
The median as well as the mean of quantity of the sulphate increases with the
increase in the quality of the wine.
The pH value of all wines remain in the range of 3-4. Specially, it can be seen
that as the quality of the wine increases, about more than 50 percent of times,
the pH value of the wine will remain within 3.2 to 3.4.
There doesn't seem to be any kind of relation between residual sugar and the
quality of the wine, but it must be noted that for more than 50 percent of time,
the residual sugar was within 2-3 units.
### Did you observe any interesting relationships between the other features \
(not the main feature(s) of interest)?
Fixed acidity and Citric acid tends to correlate with each other. Since Citric
acid is also a non-volatile acid, the fixed acidity gives the total
non-volatile acid content, this relationship makes sense.
Fixed Acidity and pH are negatively correlated to each other. If the fixed
acidity of the wine will increase, it's pH value will decrease. This makes
sense as with more non-volatile acid in the wine, it's acidity will increase
and hence it's pH will decrease. Solution with 0 pH value is a most acidic
substance.
Density and Fixed Acidity tend to correlate with each other (mildly). If we
increase the fixed acidity of the wine, the density of the wine tend to
increase.
### What was the strongest relationship you found?
Our feature of interest, Quality have the strongest relationship with it's
alcohol content. The Quality of the wine is positively correlated with it's
alcohol content. This is the strongest relation we found.
# Multivariate Plots Section
In this section we will try to examine multiple variable at a time.
### Density vs Fixed Acidity by Quality
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(fixed.acidity, density)) +
geom_point(aes(color = quality.group)) +
geom_smooth(aes(color = quality.group)) +
scale_color_brewer(type = 'seq', guide = guide_legend("Quality"),
labels = c('worst', 'normal', 'better'))
```
It can be clearly seen that the density of the wine increases with the increase
in the fixed acidity of the wine. From above, we can see that the trend is
followed irrespective of the wine quality.
***
### Citric Acid vs Fixed Acidity by Quality
```{r message=FALSE, echo=FALSE}
ggplot(data = redwine, aes(fixed.acidity, citric.acid)) +
geom_point(aes(color = quality.group)) +
geom_smooth(aes(color = quality.group)) +
scale_color_brewer(type = 'seq', guide = guide_legend("Quality"),
labels = c('worst', 'normal', 'better'))
```
We can in all quality groups, with the increase of the fixed acidity the citric
acid content all tend to increase.
***
### Fixed Acidity vs pH by Quality
```{r echo=FALSE}
ggplot(data = redwine, aes(fixed.acidity, pH)) +
geom_point(aes(color = quality.group)) +
geom_smooth(aes(color = quality.group)) +
scale_color_brewer(type = 'seq', guide = guide_legend("Quality"),
labels = c('worst', 'normal', 'better'))
```
We can see that for the wine of the better quality, regression line's slope
remain almost constant, but for the wine with worst quality regression line's
slope changes very frequently as the fixed acidity is increased.
***
### Density vs Alcohol by Quality
```{r echo=FALSE}
ggplot(data = redwine, aes(alcohol, density )) +
geom_point(aes(color = quality.group), alpha = 0.5) +
geom_smooth(aes(color = quality.group)) +
scale_color_brewer(type = 'seq', guide = guide_legend("Quality"),
labels = c('worst', 'normal', 'better'))
```
It can be clearly seen that for the wine of worst quality, the regression line's
slope tend to remain constant thoughout the graph. In a totaly opposite sense,
the regression line of the wine with better qualities tend to wobble as the
alcohol concentration increases. Generally, as the alcohol concentration
increases the density decreases.
***
### Free sulfur dioxide vs Total Sulfur Dioxide by Quality
```{r message=FALSE, warning=FALSE, echo=FALSE}
ggplot(data = redwine, aes(total.sulfur.dioxide, free.sulfur.dioxide)) +
scale_y_continuous(limits =
c(0, quantile(redwine$free.sulfur.dioxide, 0.95))) +
scale_x_continuous(limits =
c(0, quantile(redwine$total.sulfur.dioxide, 0.95))) +
geom_point(aes(color = quality.group), alpha = 0.5) +
geom_smooth(aes(color = quality.group)) +
scale_color_brewer(type = 'seq', guide = guide_legend("Quality"),
labels = c('worst', 'normal', 'better'))
```
We can see that for lower values of total sulfur dioxide there is very less
amount of variance in the value of free sulfur dioxide. As the total sulfur dioxide starts to increase the variation in the amount of free sulfur dioxide.
With the help of quality group we can see there is very frequent change for the wines in the normal category. The change is not that frequent in other group of
wines.
***
### Density vs Alcohol over Fixed Acidity by Quality
```{r}
ggplot(data = redwine, aes(alcohol / fixed.acidity, density)) +
geom_point(aes(color = quality.group), alpha = 1 / 3, size = 1) +
scale_color_brewer(type = "qual", palette = 2,
labels = c('worst', 'normal', 'best')) +
labs(color = 'Qaulity group') +
geom_smooth(aes(color = quality.group))
```
The Density of the Wine is strongly correlated with the Alcohol content of the
wine and it's fixed acidity. We can see that for all qualities of wine, the
smooth tend to pass through the middle of the graph, suggesting correlation.
So as the concentration of the alcohol over fixed acidity increases, the density
tend to decrease.
***
### Linear model for density
Here we will create linear model with density, alcohol and fixed acidity.
```{r echo=FALSE}
densityLm <- lm(data = redwine, I(density) ~ I(alcohol) + I(fixed.acidity))
summary(densityLm)
```
So it turns out that 65 percent of variance in density is explained by alcohol
content and the fixed acidity of the wine.
***
### Quality Linear Model
Now lets try to make a linear model for the quality of the wine.
One to check before making the linear model is that the variables in the model
should not be correlated with each other. This can create ambiguity in deciding
which component is responsible for the change in model.
From the correlation matrix we know that these variables are not correlated with
each other.
* Alcohol
* pH
* Sulphates
* residual.sugar
* Fixed Acidity
* Chlorides
* Total Sulfur Dioxide
```{r echo=FALSE}
m1 <- lm(data = redwine, I(quality.num) ~ alcohol + sulphates + pH +
residual.sugar + chlorides + total.sulfur.dioxide)
summary(m1)
```
A model that tries to explain the variation in the quality of the wine.
We get a r-squared value of 0.3142.
***
# Multivariate Analysis
### Talk about some of the relationships you observed in this part of the \
investigation. Were there features that strengthened each other in terms of \
looking at your feature(s) of interest?
There seems to be a strong relationship between Density and the Alcohol and the
fixed acidity.
### Were there any interesting or surprising interactions between features?
There is a surprising interaction between Density and the combination of alcohol and fixed acidity. It may have to do something with the chemistry of the fluids,
but nonetheless it is an interesting relation that one find without the innate
knowledge of the chemistry behind this interactions.
### OPTIONAL: Did you create any models with your dataset? Discuss the \
strengths and limitations of your model.
I created a model for the examination of the quality of the wine variable with
the help of some of the physicochemical inputs that were provided with the
dataset.
This model contain seven of the original 11 inputs in the wine data set.
All these are pretty much not correlated to each other. This is good as it will