-
Notifications
You must be signed in to change notification settings - Fork 5
/
homework7_Rebecca_Nora-gaussian_distribution.Rmd
144 lines (106 loc) · 6.04 KB
/
homework7_Rebecca_Nora-gaussian_distribution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
title: "The Gaussian Distribution"
author: "Nora Krebs and Rebecca Amberger"
date: "7 January 2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Task of the week
A group of geoscience students use a tape to measure the long axis of a coral boulder stranded on a sandy beach. The students have collected *n* data points of the axis diameter *x*. We know that the tape measure is only accurate to a certain point, and assume, for the sake of simplicity, that all measurements come from the same Gaussian distribution. We express the spread of measurements with the distribution's variance $\sigma^2$.
Given *n* measurements, what is the boulder's most believable actual size?
## Gaussian Distribution
To solve the problem, we have to compute the most likely diameter of the coastal boulder. This can be done by using the Gaussian distribution at a fixed variance in Bayes' Theorem calculations.
The Gaussian Distribution can be described by the following equation:
$$p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$
The individual terms are referred to as following:
$\mu$ is the mean\
$\sigma$ is the standard deviation\
$\sigma^2$ is the variance\
$\sigma^{-2}$ is the precision
A standard Normal Distribution is achieved, if $\mu = 0$ and $\sigma = 1$.
## Coding
To calculate the unknown mean $\mu$, which is the actual diameter of the coral boulder's long axis, we use Bayes' Theorem.\
Luckily the true mean $\mu$ [m] is already defined as 5.2 so we can adjust the prior for the calculation of Bayes' Theorem and compare the results.
```{r}
# Define true mean 'mu' of Gaussian distribution equation
mu <- 5.2
```
Now we define the constant variance $\sigma^2$ [m^2^], which is known and expresses the accuracy of the tape measurement.
```{r}
# Set known variance of Gaussians noise
sig <- 0.1
```
With the command rnorm() we create *n* random data points (student measurements) at a Gaussian Distribution with a mean of $\mu$. The standard deviation in the command can be calculated by the square root of the variance.
```{r}
# Create 'n' data points (measurements of students)
n <- 10
# Random values of a normal distribution --> rnorm
dat <- rnorm(n, mu, sqrt(sig))
# Plot a probability density estimate of the data
plot(density(dat), col = "green", lwd = 2,
main = "Random Gaussian Data, known variance")
```
### Prior
To calculate the prior of Bayes' Theorem, we have to specify our prior on $\mu$. Here, we assume a realistic range for the maximum boulder diameter between 3.5 m and 6 m with steps of 0.1 which results in 26 values.
```{r}
# Set range of prior values
#(background knowledge that boulders on this site are between 3.5 and 6 --> from current studies)
prior_mu <- seq(3.5, 6, 0.1)
```
Now we use the runif() command to allocate a random probability weight to each of the 26 values. We are aware that this is not an ideal assumption, since a more accurate probability distribution should be known for the prior. However, a random distribution will help us to evaluate the validity of Bayes' Theorem in the case the given problem.
By renormalisation, we guarantee that all weights add up to 1.
```{r}
# Set weights for prior values
weights_mu <- runif(length(prior_mu))
# Renormalise prior (to obtain a PDF)
p_mu <- weights_mu / sum(weights_mu)
# Plot prior
plot(p_mu ~ prior_mu, type = "h", col = "red",
xlab = "boulder diameter", ylab = "Prior")
```
### Likelihood
Now we want to calculate the likelihood of each prior rock diameter.
```{r}
# Creation of a vector with 26 digits
likeli <- rep(NA, length(prior_mu))
# The likelihood is then calculated in a loop, for each possible boulder diameter individually. If we assume that the student diameter measurements are independent of each other, we can compute the likelihood of the sequence by multiplying their individual probabilities. The resulting vector will then contain the probabilities of each case of diameters between 3.5 and 6.0, given the data, measured by students.
for (i in seq_along(prior_mu)) {
likeli[i] <- prod(dnorm(dat, mean = prior_mu[i], sd = sqrt(sig)))
}
# To illustrate the outcome of this step, we plot a graph that shows the calculated likelihood of each possible diameter.
plot(likeli ~ prior_mu, type = "h", col = "red",
xlab = "boulder diameter", ylab = "Likelihood")
# We also want to give show the outcome of the most likely diameter, computed in the current session.
mlike <- which.max(likeli)
prior_mu[mlike]
```
### Posterior
To calculate the posterior, we multiply the likelihood with the prior and renormalise the posterior, so all probabilities add to 1.
```{r}
# Calculating posterior
posterior <- likeli * p_mu
# Renormalise the posterior
posterior <- posterior / sum(posterior)
```
```{r}
# Our final outcome is illustrated in the following plot.
plot(posterior ~ prior_mu, type = "h", col = "red",
xlab = "boulder diameter", ylab = "Posterior")
# To see, which rate is most likely, we extract it by the "max()" command.
mposterior <- which(c(posterior) == max (posterior))
prior_mu[mposterior]
```
## Evaluation of the results
The most likely rock diameter is changing with every generated session, but varies generally around a value of 5.2. The generated random probabilities of the prior therefore highly affect the outcome, although they don´t disturb it. To improve the accuracy of the outcome, the prior should be weighted in a uniform way.
In general Bayes' Theorem is a good approach to the solution of the introduced problem. Given that n measurements were taken, the code could be adjusted to different numbers of measurements.
To analyse the outcome of the session, we want to compare the probability of the most likely diameter and the probability of a diameter of 5.2.
```{r}
#probability of the most likely diameter [%]:
mp <- max(posterior)*100
round(mp, 2)
#probability of a diameter of 5.2m [%]:
p <- posterior[18]*100
round(p,2)
```