methods.txt

Description of methods

Part 1: Calculating lung cancer incidence per 100,000 people standardized by gender and age

Part 2: Gathering and incorporating additional data

Part 3: Modeling


Part 1

Scripts: cancerdata3.py, lung_dataframe.py, population.py, cancer_rates.py, age_standardize.py

cancerdata3.py parses the SEER cancer incidence data and exports a dataframe with all respiratory cancer cases to RESPIR.csv. Since this data is not public (research document must be signed) I have not included the raw data files here.

lung_dataframe.py reads in RESPIR.csv and cleans it up. First we filter for lung cancer cases. Only relevant columns are chosen (sex, age at diagnosis, year of diagnosis, year of birth, race, county FIPS code). Also, two age buckets are created (<65, >65) so that we can standardize the data by age (account for the effect that age has on lung cancer risk). I did this because the smoking estimates data that I found was age and gender standardized (according to the U.S. Census Methodology).

population.py parses the SEER cancer population data necessary in order to calculate incidence rates per 100,000 people. Age bins are also created here on the population data. Again, this data is not public and so is not included in my repo.

cancer_rates.py joins the SEER incidence and population data together. I've created two dataframes in this script. The first called grouped_overall only has a single entry (row) per county. The second called grouped has 4 entries per county (for different gender and age groups). Cancer incidence per 100,00 people is calculated for these two dataframes by dividing the cancer counts by the population figures and multiplying by 100,000.

age_standardize.py joins the countywide population figures from combined_overall onto combined. Then, cancer rates are calculated for each county subgroup for all counties multiplying the existing cancer rate in this group (group cancer count / group population) by the ratio of that group's population to the overall county size (group population / county population). This gives us the group cancer count / county population. Now we can simply group this dataframe by county and sum up all the cancer rates among the 4 groups per county to arrive at our age and gender standardized lung cancer incidence rates per 100,000.


Part 2

Scripts: join_smoking_air.py, radon.py, AQI.py, TRI.py, feature_engineering.py


join_smoking_air.py reads in smoking estimates data and air quality data from the Health Tracking Network. The smoking estimates data comprises estimates of  % adult smokers (both daily and non-daily) per county. The air quality data comprises a number of metrics. I chose to include two variables: average concentrations of PM 2.5 and Days of Ozone above the National Quality Standard.

radon.py grabs 5 years of U.S. radon data from EPA surveys conducted between 1988-1992. Measurements generally took place within the same year for each county over this 5 year interval. I decided to calculate the average radon level (measured in pCi/L) per county between 1988-1992.

AQI.py reads in the EPA AirData comprised of Air Quality Index (AQI) metrics. I chose metrics: Days PM 2.5, Days PM10, Median AQI, Max AQI. Details regarding these metrics can be found here:

https://www.epa.gov/outdoor-air-quality-data/about-air-data-reports

TRI.py reads in the EPA Toxic Release Inventory (TRI) data. This data is voluntarily reported by companies in numerous industries, and comprises releases of many types of harmful chemicals into the air, water and land. Due to the sparsity of data on individual chemicals (example: asbestos releases in Los Angeles County in 2005), I decided to group all chemicals together that are labeled carcinogens (related to cancer). Nearly all these chemicals are measure in pounds, but a few are measure in grams. I simply summed the release totals without converting grams to pounds.

feature_engineering.py performs transformations on variables.
Log transformed variables: radon, ozone, Max AQI, Days PM2.5, releases
I also used a Gaussian Mixture Model (GMM) to model probabilities of high/low radon levels given that there seem two distinct distributions for radon levels, as can be seen in eda.py.
Lastly, I created scaled versions of Days PM2.5 and Days PM10 per 366 days per year since the number of days of measurements varied by county.

Part 3

Scripts: bic.py, linear_models.py

bic.py reads in our final dataframe lung_final.csv which comprises all the data. The variables are split up into groups of highly correlated variables (e.g. radon group: mean radon, log mean radon, Probability High Radon). The ic_calc function calculates the AIC and BIC for lasso regression models fit with different sets of variables, each containing a single variable from the specified groups. I ran the function many times with different groups specified; the results of my feature selection process can be seen in README.md (Table 1: Comparing Feature Sets Using BIC Scores).

linear_models.py contains all the modeling code.

There are 4 portions to this script, corresponding to the four models used. Each contains multiple different types of plots to visualize the data. Root Mean Square Error (RMSE) is used to compare models. Same set of BIC-minimizing features used (daily smoking, log mean radon, Days of Harmful PM2.5, Median AQI).

Library: scikit-learn
The first portion of this script runs pooled linear regression, lasso regression and elasticnet regression models. Hyperparameters for lasso and elasticnet are determined via gridsearching.

Library: PyMC3
The second portion of the script calculates point estimates and 95% confidence intervals for the unpooled regression model.

Sample size: 1,000
# Iterations: 50,000
Burn-in: 10,000

Library: PyMC3
The third portion of the script calculates point estimates and 95% confidence intervals for the state-grouped multilevel regression model.

Sample size: 1,000
# Iterations: 150,000
Burn-in: 100,000

Library: PyMC3
The fourth portion of the script calculates point estimates and 95% confidence intervals for the county-level multilevel regression model.

Sample size: 5,000
# Iterations: 150,000
Burn-in: 100,000