-
Notifications
You must be signed in to change notification settings - Fork 0
/
DataAnalysis1.qmd
428 lines (337 loc) · 18.1 KB
/
DataAnalysis1.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
---
title: "Basil Water Stress Experiment 1 DE 2024 | Data Analysis"
subtitle: "Bioteksa Trials"
author: "Victor Hugo Rivero"
bibliography: "https://api.citedrive.com/bib/0ffe6857-a531-47fa-a32f-6ed0fd2467a1/references.bib?x=eyJpZCI6ICIwZmZlNjg1Ny1hNTMxLTQ3ZmEtYTMyZi02ZWQwZmQyNDY3YTEiLCAidXNlciI6ICI4Mjk3IiwgInNpZ25hdHVyZSI6ICJlM2ZkY2U1NzgyMWRlOTJiMWU0MzI4MjQ5N2NhZWZiNzZlOGIwZmFjNGRhZGI4NjRlYTNjOGY3MDA2NTYzYTAxIn0=/bibliography.bib"
number-sections: true
format:
confluence-html:
code-overflow: wrap
fig-width: 8
fig-height: 4
code-fold: true
toc: true
overflow: wrap
toc-depth: 1
toc-location: left
embed-resources: true # single html file
anchor-sections: true
smooth-scroll: true
execute:
tbl-cap-location: bottom
jupyter: python3
---
# Dataset and libraries used for the analysis
Below is a list of libraries used in conjunction with the R programming language, Quarto, and Jupyter for the creation of this statistical analysis.
All datasets and libraries needed were imported at the start of the quarto project. Data was imported into a GitHub Repository to access it directly. [Access the repository](https://github.com/biotechdesigner/quinoa-analysis-coursework/tree/main). The raw data `data` was selected to include only the traits desired. This data is referred to as `selected_data` in the project code. The dataset `data` was modified to treat missing data as described in @sec-miss. This data is referred in this project as `imputed_data`
```{python data_libraries_import}
#| output: false
import math
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from itables import options, show
# Import data
data = pd.read_csv('btk_de_basil_2024.csv')
# Convert 'TimeStamp' column to datetime
data['TimeStamp'] = pd.to_datetime(data['TimeStamp'])
# Get list of treatments in the dataset
treatments = data['Treatment'].unique()
# Group data by treatment
grouped_data_treatment = data.groupby('Treatment')
# Get numerical variables
numerical_variables = data.select_dtypes(include=['float64', 'int64']).columns.tolist()
# Get categorical variables, excluding 'TimeStamp'
categorical_variables = [col for col in data.select_dtypes(include=['object', 'category']).columns if col != 'TimeStamp']
# Descriptive analysis for numerical variables
numerical_descriptive_stats = {var: grouped_data_treatment[var].describe().transpose() for var in numerical_variables}
# Descriptive analysis for categorical variables
categorical_descriptive_stats = {}
for var in categorical_variables:
counts = data.pivot_table(index='Treatment', columns=var, aggfunc='size', fill_value=0)
categorical_descriptive_stats[var] = counts
# Function to create a descriptive analysis of a variable by treatment
def descriptive_analysis(variable):
if variable in numerical_descriptive_stats:
analysis = numerical_descriptive_stats[variable]
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(f"Descriptive analysis for numeric variable '{variable}':")
display(analysis)
elif variable in categorical_descriptive_stats:
analysis = categorical_descriptive_stats[variable]
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
print(f"Descriptive analysis for categorical variable '{variable}':")
display(analysis)
else:
print(f"The variable '{variable}' is not in the dataset")
# Example usage
# descriptive_analysis('PlantHeight')
# Show lists of variables and treatments
print("Treatments available:", treatments)
print("Numeric variables:", numerical_variables)
print("Cualitative variables", categorical_variables)
# Only include the data of the final date of the experiment for data analysis
final_date = '2024-05-25'
data_may25 = data[data['TimeStamp'] == final_date]
# data.head()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
grouped_data_treatment.head()
```
# test
```{python test}
#| label: tbl-test
#| #| tbl-cap: "Descriptive analysis of Variables"
#| warning: false
#| scrollable: true
#| output: true
descriptive_analysis('PlantHeight')
```
# Data Description
The dataset contains measurements from 360 quinoa accessions, detailing various seed traits retreived from @craine-2023. The data includes continuous measurements such as lysine content (measured in mg/g), yield (measured in g/plant) and Thousand Seed Weight, or TSW (measured in grams).
# Statistical analysis
## Plant Height
### Descriptive analysis
```{python descriptive-analysis-plantheight}
#| label: tbl-descriptive-analysis-plantheight
#| tbl-cap: "Descriptive analysis of Plant Height"
#| warning: false
#| scrollable: true
#| output: true
#todo add standard error
descr_analysis_PlantHeight = grouped_data_treatment['PlantHeight'].describe().transpose()
# Set display options to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Print the full descriptive statistics table
show(descr_analysis_PlantHeight, "")
```
### Normality Test
Debido a que se tiene una muestra total
```{python normality-test-plantheight}
#| label: tbl-normality-test-plantheight
#| tbl-cap: "Normality test of Plant Height by treatment"
#| warning: false
#| output: true
#todo add dispersion graphic
from scipy.stats import shapiro, kstest, anderson
import scipy.stats as stats
normality_results = {}
variables = ['PlantHeight', 'StemDiameter', 'LeafArea', 'LeafNumber']
# Normality test for every treatment and variable
for Treatment, group in grouped_data_treatment:
normality_results[Treatment] = {}
for variable in variables:
# filter non-null data
valid_data = group[variable].dropna()
if len(valid_data) > 0: # Validation that there is enough data to test
# Shapiro-Wilk test
shapiro_test = shapiro(valid_data)
# Kolmogorov-Smirnov test
ks_test = kstest(valid_data, 'norm')
# Anderson-Darling test
anderson_test = anderson(valid_data)
# Save results
normality_results[Treatment][variable] = {
'Shapiro-Wilk': shapiro_test,
'Kolmogorov-Smirnov': ks_test,
'Anderson-Darling': anderson_test
}
# Show results
#import pprint
#pprint.pprint(normality_results)
#print(normality_results)
def plot_histogram_qq(data, treatment, variable, normality_results):
valid_data = data[data['Treatment'] == treatment][variable].dropna()
if len(valid_data) > 0:
# Histograma
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.hist(valid_data, bins=10, color='blue', edgecolor='black', alpha=0.7)
plt.title(f'Histogram of Frequency for {variable} ({treatment})')
plt.xlabel(variable)
plt.ylabel('Frecuency')
# Gráfico Q-Q
plt.subplot(1, 2, 2)
stats.probplot(valid_data, dist="norm", plot=plt)
plt.title(f'Q-Q Graphic of {variable} ({treatment})')
plt.xlabel('Normal Theoretical quantiles')
plt.ylabel('Normal Data Quantiles')
# Obtener resultados de los tests
shapiro_res = normality_results[treatment][variable]['Shapiro-Wilk']
ks_res = normality_results[treatment][variable]['Kolmogorov-Smirnov']
anderson_res = normality_results[treatment][variable]['Anderson-Darling']
# Crear texto descriptivo
description = (f"Shapiro-Wilk p-value: {shapiro_res.pvalue:.3e}\n"
f"Kolmogorov-Smirnov p-value: {ks_res.pvalue:.3e}\n"
f"Anderson-Darling statistic: {anderson_res.statistic:.3f}")
# Add descriptive text to histogram
plt.subplot(1, 2, 1)
plt.text(0.95, 0.95, description, transform=plt.gca().transAxes, fontsize=10,
verticalalignment='top', horizontalalignment='right', bbox=dict(facecolor='white', alpha=0.5))
# Show graphics
plt.tight_layout()
plt.show()
# Normality test
print("Normality analysis")
print(f"Shapiro-Wilk p-value: {shapiro_res.pvalue:.3e} ({'Not enough evidence to reject' if shapiro_res.pvalue > 0.05 else 'Rejects'} Normality hipothesis)")
print(f"Kolmogorov-Smirnov p-value: {ks_res.pvalue:.3e} ({'Not enough evidence to reject' if ks_res.pvalue > 0.05 else 'Rejects'} Normality hipothesis)")
print(f"Anderson-Darling statistic: {anderson_res.statistic:.3f} (Compare with critical values: {anderson_res.critical_values})")
else:
print(f"Not enough data for {variable} in treatment {treatment}")
# Example of use for treatment B100 and PlantHeight Variable
plot_histogram_qq(data, 'B80', 'PlantHeight', normality_results)
```
### Frequency Distribution
```{python frequency-distribution-plantheight}
#| label: fig-frequency-distribution-plantheight
#| tbl-cap: "Frequency Distribution of Plant Height"
#| warning: false
#| output: true
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
axes = axes.flatten()
for ax, (treatment, group) in zip(axes, grouped_data_treatment):
ax.hist(group['PlantHeight'], bins=10, edgecolor='black', alpha=0.7)
ax.set_title(f'Frequency Distribution of Plant Height for Treatment {treatment}')
ax.set_xlabel('Plant Height')
ax.set_ylabel('Frequency')
ax.grid(axis='y', alpha=0.75)
# Hide any empty subplots
for i in range(len(grouped_data_treatment), len(axes)):
fig.delaxes(axes[i])
# Adjust layout
plt.tight_layout()
# Save the combined figure as an image
plt.savefig('combined_histograms.png')
# Show the combined figure
plt.show()
```
## Stem Diameter
### Descriptive analysis
```{python descriptive-analysis-stemdiameter}
#| label: tbl-descriptive-analysis-stemdiameter
#| tbl-cap: "Descriptive analysis of Stem Diameter"
#| warning: false
#| scrollable: true
#| output: true
descr_analysis_StemDiam = grouped_data_treatment['StemDiameter'].describe().transpose()
# Set display options to show all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# Print the full descriptive statistics table
show(descr_analysis_StemDiam, "")
```
## Missing Data Assessment {#sec-miss}
There is not really missing data, the cuantitative variables of the plant were measured every week, but the water stress incidents were taken every day because irrigation was made everyday by hand, so pandas library is used because it can handle missing data without the need to manipulate the data sheet.
## Frequency Distribution Analysis {#sec-dis}
```{python histogram_plot}
#| label: frequency--distribution
#| fig-cap: "Histogram plot of the frequency distribution of cuantitative variables"
#| warning: false
```
From @fig-lysine-distribution, it is visually inferred that the data seems to be approximately normally distributed with a single peak and symmetric shape, however, it doesn´t follow exactly the bell curve, so it is needed to do a nromality test to make sure how the data is distributed.
## Normality Test
```{r normal_distribution}
#| label: tbl-normal-distribution
#| tbl-cap: "Shapiro_Wilk test results for lysine content"
#| warning: false
#Test for normal distribution
normality <- shapiro.test(imputed_data$Lysine.14.mgG)
result <- data.frame(
W = normality$statistic,
Pvalue = normality$p.value
)
kable(result, align = "llr")
```
The Shapiro-Wilk normality test has a p-value of almost 0 which is far below any conventional alpha level (e.g., 0.05). despite the high W value, the test finds significant evidence to suggest that the lysine content does not follow a normal distribution. For that reason, this will be needed to take into account the correlation and regression analyses.
## Correlation analysis {#sec-corr}
```{r correlation_analysis_yield}
#| label: fig-correlation_yield
#| fig-cap: "Correlation analysis between Lysine content and Yield traits"
#| warning: false
#| fig.width: 8
#| fig.height: 4
#Add a column in imputed data with Lysine content transformed to logaritmic values
imputed_data$Lysine.14.mgG_log <- log(imputed_data$Lysine.14.mgG)
#Correlation analysis
cor_value_yield <- cor(imputed_data$Lysine.14.mgG_log, imputed_data$Yield_g,
method = "spearman")
# Correlation plot
ggplot(imputed_data, aes(x = Lysine.14.mgG_log, y = Yield_g)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
annotate("text", x = Inf, y = Inf, label = paste("Spearman Correlation: ",
round(cor_value_yield, 5)),
hjust = 1.1, vjust = 1.1) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Lysine content log(mg/g)", y = "Yield (g/plant)")
```
### Correlation of Lysine content and TSW
```{r correlation_analysis_tsw}
#| label: fig-correlation_tsw
#| fig-cap: "Correlation analysis between Lysine content and Thousand Seed Weight (TSW)"
#| warning: false
#Correlation analysis
cor_value_tsw <- cor(imputed_data$Lysine.14.mgG_log, imputed_data$TSW,
method = "spearman")
# Correlation plot
ggplot(
imputed_data, aes(x = Lysine.14.mgG_log, y = TSW)) +
geom_point(shape = 21, fill = '#0f993d', color = 'white', size = 3) +
annotate("text", x = Inf, y = Inf, label = paste("Spearman Correlation: ",
round(cor_value_tsw, 5)), hjust = 1.1, vjust = 1.1) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Lysine content log(mg/g)", y = "TSW (g)"
)
```
As can be seen, the correlation level between yield variables and lysine content is moderately positive (0.30 with yield and 0.39 with TSW) similar as reported in @craine-2023, indicating that when the lysine content in quinoa seeds is higher, the yield and TSW will also be higher.
## Regression analysis {#sec-reg}
### Regression analysis and ecuation of Lysine content and Yield
This analysis is crucial for understanding the relationship between these two variables, indicating whether higher lysine content correlates with higher yield. In @tbl-regression_yield, it is observed that the p-value is less than 0.05, leading to the rejection of the null hypothesis and the conclusion that there is a significant relationship between lysine content and yield. Furthermore, the R^2 value indicates that approximately 10% of the variability in Yield_g is explained by the model. This suggests that there are other factors influencing the yield that are not accounted for in this analysis. Additionally, this analysis is based on a single value derived from many, so it was expected that lysine content would not explain the entire model. For this same reason, the reported regression equation will not be as precise for determining yield values.
```{r regression_analysis_yield}
#| label: tbl-regression_yield
#| tbl-cap: "Regression analysis between Lysine content and Yield"
#| warning: false
#Regression analysis
reg1_yield <- lm(Yield_g ~ Lysine.14.mgG_log, data = imputed_data)
sum_reg1_yield <- summary(reg1_yield)
broom_yield_summary <- broom::glance(sum_reg1_yield)
knitr::kable(broom_yield_summary, align = "llllllrr")
```
```{r coef_analysis_yield}
#| label: coef_yield
#| warning: false
#| cap: "Regression ecuation between Lysine content and Yield"
#Regression and coefficient analysis
coefs_yield <- coef(reg1_yield)
paste("Y =", coefs_yield[1], "+", coefs_yield[2], "* X")
```
### Regression analysis and ecuation of Lysine content and Thousand Seed Weight TSW
The outcome of this analysis delineates the potential relationship between lysine content and Thousand Seed Weight (TSW). Similar to the previous regression analysis, the null hypothesis can be rejected, leading to the conclusion that there is a highly significant relationship between lysine content and TSW. However, the R^2 value is quite low (13%), indicating that there are other factors affecting TSW that are not considered in this analysis, which was to be expected.
```{r regression_analysis_tsw}
#| label: tbl-regression_tsw
#| tbl-cap: "Regression analysis between Lysine content and Thousand Seed Weight (TSW)"
#| warning: false
#Regression analysis
reg1_tsw <- lm(TSW ~ Lysine.14.mgG_log, data = imputed_data)
sum_reg1_tsw <- summary(reg1_tsw)
broom_tsw_summary <- broom::glance(sum_reg1_tsw)
knitr::kable(broom_tsw_summary, align = "llllllrr")
```
```{r coef_analysis_tsw}
#| label: coef_tsw
#| warning: false
#| cap: "Regression ecuation between Lysine content and Thousand Seed Weight (TSW)"
#Regression and coefficient analysis
coefs_yield <- coef(reg1_tsw)
paste("Y =", coefs_yield[1], "+", coefs_yield[2], "* X")
```
# Discussion
The analysis of quinoa seed traits, particularly focusing on lysine content, yield, and thousand seed weight (TSW), provides some insights into quinoa's genetic diversity and agricultural potential. In the initial phase of the analysis, a significant proportion of data pertaining to lysine content was found to be missing in @fig-missing-data-interest, presenting a substantial challenge in the data evaluation process. This issue could arise from a multitude of factors, each with varying implications. Consequently, a decision was made to retain the data rows and employ multiple imputation techniques. This approach aimed to provide an alternative perspective to the analyses previously conducted by @craine-2023. Following the implementation of multiple imputation, the deviation of the dataset from a normal distribution suggested potential specific influences on the yield and Thousand Seed Weight (TSW) of the quinoa seeds as it can be seen in @tbl-normal-distribution. Subsequent regression and correlation analyses, focusing on the lysine content in relation to yield and TSW, confirmed the existence of a moderate positive correlation, as seen in @fig-correlation_yield and @fig-correlation_tsw. It was observed that lysine content in quinoa seeds positively impacts both yield and TSW. Nonetheless, it is important to recognize that this is not the only influential factor in the model construction, which was confirmed in @tbl-regression_yield and @tbl-regression_tsw with the R^2 value. In conclusion, a comprehensive analysis incorporating other characteristics within the dataset is essential to fully understand the myriad factors influencing the yield of quinoa seeds.
# References
::: {#refs}
:::