Skip to content

Commit

Permalink
Merge pull request #67 from saundersg/master
Browse files Browse the repository at this point in the history
Lesson 8 Summaries update
  • Loading branch information
saundersg authored Sep 19, 2023
2 parents 1dd4733 + 0473588 commit 1a63193
Show file tree
Hide file tree
Showing 29 changed files with 565 additions and 264 deletions.
Binary file added Images/normalShadingSituations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 6 additions & 4 deletions Lesson01_Campus.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -467,11 +467,13 @@ As with all the classes you take at BYU-Idaho, it is up to you to decide what yo
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- In this class you will use the online textbook that has been written for you by your statistics teachers. All of the assignments and quizzes will be based on the readings, so study it well.
- Most weeks will cover two lessons
- By doing the work, staying on schedule, and living the Honor Code you *will* succeed in this class!

- The three **rules of probability** are:
1. In this class you will use the online textbook that has been written for you by your statistics teachers. All of the assignments and quizzes, available in I-Learn, will be based on the readings, so study it well. Most weeks will cover two lessons.
2. You have successfully located the online textbook. Ensure you have also located the course in I-Learn and can access the quizzes and assignments that are there.
3. Ensure you have located the contact information for your instructor in the I-Learn course. Recording the contact information of peers from class would also be a wise idea.
4. This course uses MS Excel for all statistical analysis. Check that you have access to the software on your computer. If not, see I-Learn for details on how to obtain it through the University for free.
5. By doing the work, staying on schedule, and living the Honor Code you *will* succeed in this class.
6. The three **rules of probability** are:
1. A probability is a number between 0 and 1.
$$0 \leq P(X) \leq 1$$
2. If you list all the outcomes of a probability experiment (such as rolling a die) the probability that one of these outcomes will occur is 1. In other words, the sum of the probabilities in any probability is 1.
Expand Down
24 changes: 14 additions & 10 deletions Lesson02.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,7 @@ As an example of a convenience sample, an auditor could haphazardly select items

<br>

### Types of Variables
### Types of Data

Whenever we collect data, we record information about the things we are studying. There are two basic types of data that can be recorded: quantitative measurements and categorical labels. We will call these types of data simply "quantitative" or "categorical" variables. We use the word "variable" to denote the idea that the quantitative measurements or categorical labels can vary from person to person, or item to item, in our study.

Expand Down Expand Up @@ -562,18 +562,22 @@ Since there was not enough evidence to suggest that ImmunAvance's financial stat
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- The **Statistical Process** has five steps: **D**esign the study, **C**ollect the data, **D**escribe the data, **M**ake inferences, **T**ake action.
1. The **Statistical Process** has five steps: **D**esign the study, **C**ollect the data, **D**escribe the data, **M**ake inference, **T**ake action. These can be remembered by the pneumonic "**D**aniel **C**an **D**iscern **M**ore **T**ruth."

- In a **designed experiment**, researchers control the conditions of the study. In an **observational study**, researchers don't control the conditions but only observe what happens.
2. In a **designed experiment**, researchers control the conditions of the study, typically with a treatment group and a control group, and then observe how the treatments impact the subjects. In a purely **observational study**, researchers don't control the conditions but only observe what happens.

- There are many sampling methods used to obtain a **sample** from a **population**:
+ A **simple random sample (SRS)** is a random selection taken from a population
+ A **systematic sample** is every ***k***<sup>th</sup> item in the population, beginning at a random starting point
+ A **cluster sample** is all items in one or more randomly selected clusters, or blocks
+ A **stratified sample** divides data into similar groups and an **SRS** is taken from each group
+ A **convenience sample** is one easily obtained in a less-than-systematic way and should be avoided whenever possible
3. The **population** is the entire group of all possible subjects that could be included in the study. The **sample** is the subset of the population that is actually selected to participate in the study. Statistics use information from the sample to make claims about what is true about the entire population.

- **Quantitative variables** represent things that are numeric in nature, such as the value of a car or the number of students in a classroom. **Categorical variables** represent non-numerical data that can only be considered as labels, such as colors or brands of shoes.
4. There are many sampling methods used to obtain a **sample** from a **population**. The best methods use some sort of randomness (like pulling names out of a hat, rolling dice, flipping coins, or using a computer generated list of random numbers) to avoid bias.
a. A **simple random sample (SRS)** is a random sample taken from the full list of the population. This is the least biased (best) sampling method, but can only be implemented when a full list of the population is accessible.
b. A **stratified sample** divides the population into similar groups and then takes an **SRS** from each group. The main reason to use this sampling method is when a study wants to compare and contrast certain groups within the population, say to compare freshman, sophomores, juniors, and seniors at a university.
c. A **systematic sample** samples every ***k***<sup>th</sup> item in the population, beginning at a random starting point. This is best applied when subjects are lined up in some way, like at a fast food restaurant, an airport security line, or an assembly line in a factory.
d. A **cluster sample** consists of taking all items in one or more randomly selected clusters, or blocks. For example, ecologists could draw grids on a map of a forest to create small sampling regions and then sample all trees they find in a few randomly selected regions. Note that this differs from a stratified sample in that only a few sub-groups (clusters) are selected and that all subjects within the selected clusters are included in the study.
e. A **convenience sample** involves selecting items that are relatively easy to obtain and *does not use random selection* to choose the sample. This method of sampling can be assumed to *always bring bias* into the sample.

5. The best way to avoid bias when trying to make conclusions about a population from a single sample of that population is to use a random sampling method to obtain the sample.

6. **Quantitative variables** represent things that are numeric in nature, such as the value of a car or the number of students in a classroom. **Categorical variables** represent non-numerical data that can only be considered as labels, such as colors or brands of shoes.

</div>
<br>
Expand Down
15 changes: 11 additions & 4 deletions Lesson03.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ To help us visualize these data, we will create a graph called a histogram. To m
</table>
</center>

For each of these intervals, we draw a bar on the histogram. The width of the bars is determined by the width of the interval (5000 in this example). The height of the bars is equal to the number of observations that fall in each interval. As we look at the histogram shown below, we see bars ranging from \$0 to \$35,000. We also see higher bars in the middle between \$10,000 to \$20,000 show that these values are more commonly occurring than the other values. If we computed the average of the values contained in our histogram, we would compute the number
For each of these intervals, called *bins*, we draw a bar on the histogram. The width of the bars is determined by the width of the bin (5000 in this example). The height of the bars is equal to the number of observations that fall in each bin. As we look at the histogram shown below, we see bars ranging from \$0 to \$35,000. We also see higher bars in the middle between \$10,000 to \$20,000 show that these values are more commonly occurring than the other values. If we computed the average of the values contained in our histogram, we would compute the number
$$
\frac{15,100 + 19,000 + 4,800 + 6,500 + 14,900 + 600 + 23,500 + 11,500 + 12,900 + 32,200}{10} = 14,100
$$
Expand Down Expand Up @@ -512,12 +512,19 @@ Even though measures of center are important, we need to consider the shape, cen
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

1. Histograms are created by dividing the number line into several equal parts, starting at or below the minimum value occurring in the data and ending at or above the maximum value in the data. The number of data points occurring in each interval (called a bin) are counted. A bar is then drawn for each bin so that the height of the bar shows the number of data points contained in that bin.

- A **histogram** allows us to visually interpret data. Histograms can be left-skewed, right-skewed, or symmetrical and bell-shaped.
2. A **histogram** allows us to visually interpret data to quickly recognize which values are most common and which values are least common in the data.

- The **mean**, **median**, and **mode** are measures of the center of a distribution. The mean is the most common measure of center and is computed by adding up the observed data and dividing by the number of observations in the data set.
3. Histograms can be **left-skewed** (the majority of the data is on the right of the histogram, less common values stretch to the left side), **right-skewed** (majority of the data is on the left side with less common values stretching to the right), or **symmetrical and bell-shaped** (most data is in the middle with less common values stretching out to either side).

- A **parameter** is a true (but usually unknown) number that describes a population. A **statistic** is an estimate of a parameter obtained from a sample of the population.
4. The **mean**, **median**, and **mode** are measures of the center of a distribution. The mean is the most common measure of center and is computed by adding up the observed data and dividing by the number of observations in the data set. The median represents the 50th percentile in the data. The mean can be calculated in Excel using `=AVERAGE(...)`, the median by using `=MEDIAN(...)`, and the mode by `=MODE(...)` where the `...` in each case consists of the cell references that highlight the data.

5.

6. In a symmetrical and bell-shaped distribution of data, the mean, median, and mode are all roughly the same in value. However, in a skewed distribution, the mean is strongly influenced by outliers and tends to be pulled in the direction of the skew. In a left-skewed distribution, the mean will tend to be to the left of the median. In a right-skewed distribution, the mean will tend to be to the right of the median.

7. A **parameter** is a true (but usually unknown) number that describes a population. A **statistic** is an estimate of a parameter obtained from a sample of the population.

</div>
<br>
Expand Down
16 changes: 14 additions & 2 deletions Lesson04.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -1317,9 +1317,21 @@ Additional formatting can be applied to further improve the appearance of the ch
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- The **standard deviation** is a number that describes how spread out the data are. A larger standard deviation means the data are more spread out than data with a smaller standard deviation.
1. A **percentile** is calculated in Excel using `=PERCENTILE(..., 0.#)` where the `0.#` is the percentile written as a decimal number. So the 20th percentile would be written as 0.2.

- **Quartiles/percentiles**, **Five-Number Summaries**, and **Boxplots** are tools that help us understand data. The five-number summary of a data set contains the minimum value, the first quartile, the median, the third quartile, and the maximum value. A boxplot is a graphical representation of the five-number summary.
2. A **percentile** is a number such that a specified percentage of the data are at or below this number. For example, if say 80% of college students were shorter than (or equal to) 70 inches tall in height, then the 80th percentile of heights of college students would be 70 inches.

3. **Standard deviation** is calculated in Excel for a sample of data using `=STDEV.S(...)`.

4. The **standard deviation** is a number that describes how spread out the data typically are from the mean of that data. A larger standard deviation means the data are more spread out from their mean than data with a smaller standard deviation. The standard deviation is never negative. A standard deviation of zero implies all values in the data set are exactly the same.

5. To compute any of the **five-number summary** values in Excel, use the Excel function `=QUARTILE.INC(..., #)` where `#` is either a 0 (gives the minimum), 1 (gives the first quartile), 2 (gives the second quartile, i.e., median), 3 (gives the third quartile), or 4 (gives the maximum).

6. The **five-number summary** consists of (1) the minimum value in the data, (2) the first quartile (25th percentile) of the data, (3) the median of the data (50th percentile), (4) the third quartile (75th percentile) of the data, and (5) the maximum value occurring in the data.

7. To create a **boxplot** in Excel, highlight the data, go to *Insert* on the menu ribbon, choose the *histogram icon*, select the *Boxplot* option from the menu that appears.

8. **Boxplots** are a visualization of the five-number summary of a data set.

</div>
<br>
Expand Down
13 changes: 8 additions & 5 deletions Lesson05.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -536,16 +536,19 @@ One way to get the Format Axis menu to appear is to right-click anywhere on the
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- A **normal density curve** is symmetric and bell-shaped. The curve lies above the horizontal axis and the total area under the curve is equal to 1.
1. A **normal density curve** is symmetric and bell-shaped with a mean of $\mu$ and a standard deviation of $\sigma$. The curve lies above the horizontal axis and the total area under the curve is equal to 1. A **standard normal distribution** has a mean of 0 and a standard deviation of 1.

- A **standard normal distribution** has a mean of 0 and a standard deviation of 1. The **68-95-99.7% rule** states that when data are normally distributed, approximately 68% of the data lie within $z=1$ standard deviation ($\sigma$) from the mean, approximately 95% of the data lie within $z=2$ standard deviations from the mean, and approximately 99.7% of the data lie within $z=3$ standard deviations from the mean.
2. A **z-score** is calculated as: $\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}}$

- A **z-score** tells us how many standard deviations away from the mean a given value is. It is calculated as: $\displaystyle{z = \frac{\text{value}-\text{mean}}{\text{standard deviation}} = \frac{x-\mu}{\sigma}}$
3. A **z-score** tells us how many standard deviations above ($+Z$) or below ($-Z$) the mean ($\mu$) a given value ($x$) is.

- The [**Normal Probability Applet**](https://byuimath.com/apps/normprob.html) allows us to use z-scores to calculate proportions, probabilities, and percentiles of being "above," "below," or "between" certain values.
4. To calculate probabilities for an observation $x$, calculate the $z$-score using $\mu$, $\sigma$, and $x$ and then use the [Normal Probability Applet](https://byuimath.com/apps/normprob.html) to shade the appropriate area of the distribution for the desired probability. The area shaded depends on both the direction of interest (above, below, between) and the sign of the z-score as depicted in the images below. In every case, the **probability is** given by the **Area** box at the top of the applet.

- Percentiles can be calculated using the [**Normal Probability Applet**](https://byuimath.com/apps/normprob.html) by (1) shading the left tail only, (2) entering the desired percentile in the "Area" box, and (3) using the z-score from where the blue shaded region ends solve for $x$ in the equation $z=\frac{x-\mu}{\sigma}$.
<img src="./Images/normalShadingSituations.png">

5. The **68-95-99.7% rule** states that when data are normally distributed, approximately 68% of the data lie within $z=1$ standard deviation ($\sigma$) from the mean, approximately 95% of the data lie within $z=2$ standard deviations from the mean, and approximately 99.7% of the data lie within $z=3$ standard deviations from the mean. For example, this rule approximates that 2.5% of observations will be less than a z-score of $z=-2$.

6. Percentiles can be calculated using the [**Normal Probability Applet**](https://byuimath.com/apps/normprob.html) by (1) shading the left tail only, (2) entering the desired percentile in the "Area" box, and (3) using the z-score from where the blue shaded region ends solve for $x$ in the equation $z=\frac{x-\mu}{\sigma}$.
<br/>

<!-- - A **Q-Q plot** is used to assess whether or not a set of data is normally distributed. -->
Expand Down
13 changes: 8 additions & 5 deletions Lesson06.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -364,15 +364,18 @@ The mean and standard deviation of $\bar{x}$ are:
<div class="SummaryHeading">Remember...</div>
<div class="Summary">

- The **distribution of sample means** is a distribution of all possible sample means ($\bar x$) for a particular sample size.
1. The **distribution of sample means** is a distribution of all possible sample means ($\bar x$) for a particular sample size.

- The **mean** of the distribution of sample means is the mean $\mu$ of the population: $\mu_{\bar{x}} = \mu$.
2. The **Central Limit Theorem** states that the sampling distribution of the sample mean will be approximately normal if the sample size $n$ of a sample is sufficiently large. In this class, $n\ge 30$ is considered to be sufficiently large.

- The **standard deviation** of the distribution of sample means is the standard deviation $\sigma$ of the population divided by the square root of $n$: $\sigma_{\bar{x}} = \sigma/\sqrt{n}$.
3. The **mean** of the distribution of sample means is the mean $\mu$ of the population: $\mu_{\bar{x}} = \mu$.

- The distribution of sample means is **normal** in either of two situations: (1) when the data is normally distributed **or** (2) when, thanks to the **Central Limit Theorem (CLT)**, our sample size ($n$) is large.
4. The **standard deviation** of the distribution of sample means is the standard deviation $\sigma$ of the population divided by the square root of $n$: $\sigma_{\bar{x}} = \sigma/\sqrt{n}$.

5. The distribution of sample means is **normal** in either of two situations: (1) when the data is normally distributed **or** (2) when, thanks to the **Central Limit Theorem (CLT)**, our sample size ($n$) is large.

6. The **Law of Large Numbers** states that as the sample size ($n$) gets larger, the sample mean ($\bar x$) will get closer to the population mean ($\mu$). This can be seen in the equation for $\sigma_{\bar{x}} = \sigma/\sqrt{n}$. Notice as $n$ increases, then $\sigma_\bar{x}$ will get smaller.

- The **Law of Large Numbers** states that as the sample size ($n$) gets larger, the sample mean ($\bar x$) will get closer to the population mean ($\mu$). This can be seen in the equation for $\sigma_{\bar{x}} = \sigma/\sqrt{n}$. Notice as $n$ increases, then $\sigma_\bar{x}$ will get smaller.
</div>
<br>

Expand Down
Loading

0 comments on commit 1a63193

Please sign in to comment.