-
Notifications
You must be signed in to change notification settings - Fork 5
/
homework.qmd
47 lines (36 loc) · 2.17 KB
/
homework.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
title: "Homework 1b (bonus) - Data Wrangling"
format:
html:
embed-resources: true
date-modified: 2024-09-16
author: George G. Vega Yon, Ph.D.
---
# Due Date
~~Tuesday, September 24~~ Thursday, September 26
# Data Wrangling
The learning objectives are to conduct data wrangling and generate some summary statastics.
You will need to download two datasets from https://github.com/USCbiostats/data-science-data. The [individual](https://raw.githubusercontent.com/USCbiostats/data-science-data/master/01_chs/chs_individual.csv)
and [regional](https://raw.githubusercontent.com/USCbiostats/data-science-data/master/01_chs/chs_regional.csv)
CHS datasets in `01_chs`.
The individual data includes personal and health characteristics of children in
12 communities across Southern California. The regional data include air quality
measurements at the community level.
Once downloaded, you can merge these datasets using the location variable. Once
combined, you will need to do the following:
1. After merging the data, make sure you don’t have any duplicates by counting
the number of rows. Make sure it matches.
In the case of missing values, impute data using the average within the
variables "male" and "hispanic." If you are interested (and feel adventurous)
in the theme of Data Imputation, take a look at this paper on "Multiple Imputation"
using the Amelia R package [here](https://gking.harvard.edu/files/gking/files/amelia_jss.pdf).
2. Create a new categorical variable named “obesity_level” using the BMI measurement
(underweight BMI<14; normal BMI 14-22; overweight BMI 22-24; obese BMI>24).
To make sure the variable is rightly coded, create a summary table that contains
the minimum BMI, maximum BMI, and the total number of observations per category.
3. Create another categorical variable named "smoke_gas_exposure" that summarizes
"Second Hand Smoke" and "Gas Stove." The variable should have four categories
in total.
4. Create four summary tables showing the average (or proportion, if binary) and
sd of “Forced expiratory volume in 1 second (ml)” and asthma indicator by
town, sex, obesity level, and "smoke_gas_exposure."