Introduction

The purpose of this project is to analyze the extent to which demographic changes in the U.S. population can account for changes in average American household size since 2000. The methodology, based upon Galster (2024), is to construct a multidimensional matrix, with each cell representing a unique combination of demographic (age, sex, race, etc) and geographic (PUMA) characteristics. The value in each cell represents the average household size among persons with that characteristic in the year 2000. The methodology is then applied to data from the year 2020 to determine which demographic groups have the largest changes.

The project uses American Community Survey (ACS) IPUMS microdata, which are freely available to the public after registering for an API key.

Project Structure

File Tree

The following is an overview of the project's file tree:

household-size/
├── README.md                       # The document you're currently reading
├── household-size.Rproj            # RStudio project file
├── renv/                           # R environment directory
│   ├── activate.R             
│   ├── settings.dcf           
│   └── library/               
├── src/                            # Source code directory
│   ├── utils/                      # Helper functions for other scripts
│   │   ├── aggregation-tools.R     # Functions for aggregating observations
│   │   ├── bucketing-tools.R       # Functions for bucketing observations by age, income, etc
│   │   ├── data-validation.R       # Functions for validating data during the pipeline
│   │   ├── graphing-tools.R        # Functions to help graphing
│   │   └── create-synthetic-data.R # For testing? Maybe move to tests/
│   └── scripts                     # Scripts that execute the analysis
│       ├── import-ipums.R          # Imports raw data and saves to database
│       ├── process-ipums.R         # Adds bucket and other cols, saves to database
│       ├── fig01.R 
│       ├── fig02.R
│       ├── ...
│       ├── quick-facts.Rmd         # Produces quick statistics from our paper
│       └── proof-of-concept-bucketing.R # May be deprecated soon
│   
└── tests/                          
    ├── testthat/                   # Unit tests
    │   ├── test-difference-means.R
    │   ├── test-join-columns-db.R
    │   ├── test-join-columns-tb.R
    │   └── ...
    └── test-data/                  # Mock data for unit tests
        ├── mock_data.csv
        ├── mock_age_lookup.csv
        ├── mock_income_lookup.csv
        ├── expected_age_output.csv
        └── expected_income_output.csv

Data Flow Diagram

Project Setup

If you're running the code for the first time...

1. Sign up for an IPUMS account and obtain an API key.

2. Clone the repository:

```sh
git clone https://github.com/lorae/household-size.git
```

3. Open the R project file (household-size.Rproj) in RStudio.

4. Verify that RTools is already installed on your computer. RTools is required for compiling some of the packages in this project's library.

a. Load the pkgbuild package, installing it if necessary:

 ```r
 if (!requireNamespace("pkgbuild", quietly = TRUE)) {
   install.packages("pkgbuild")
 }
 library("pkgbuild")
 ```

b. Use the find_rtools() function from the pkgbuild package to verify that RTools is installed and configured.

 ```r
 pkgbuild::find_rtools()
 ```

 - If the output is `[1] TRUE`, RTools is installed and ready to use. Proceed to step 5.
 - If the output is `[2] FALSE`:

   **i.** Verify your current version of R by running:

     ```r
     R.version.string
     ```

   **ii.** Navigate to [CRAN RTools](https://cran.r-project.org/bin/windows/Rtools/) and download the version of RTools that corresponds to your R version (e.g., RTools44 for version 4.4.0 and above).

   **iii.** Close any instances of R and RStudio, then open the executable and follow the installation instructions. After installation, reopen RStudio and rerun the above steps to verify RTools installation.

5. Install the renv package by running the following command in R:

[TODO: write instructions to check if renv is already installed, then install if needed. Also, doesn't a user need to library(renv) before step 6?]

```r
install.packages("renv")
```

Run renv::restore() to install all the required packages as specified in the renv.lock file:
```
renv::restore()
```
[TODO: modify these instructions to reflect the menu of 4 options a user sees the first time they enter this command.] This will ensure that your environment is set up with the same package versions used in the project.
Run the main script to perform the analysis.

[TODO: instructions here]
View the results in [TODO: DESCRIBE HOW RESULTS APPEAR]

If you're running the code any subsequent time...

Open the R project file (household-size.Rproj) in RStudio.
Run renv::restore() to install all the required packages as specified in the renv.lock file:
```
renv::restore()
```
This will ensure that your environment is set up with the same package versions used in the project.
Run the main script to perform the analysis.

[TODO: instructions here]
View the results as needed.

FAQ

How granular is this analysis, geographically?

The Census Bureau divides the United States into Public Use Microdata Areas (PUMAs). They're designed to collectively cover the entire U.S. population, not cross state boundaries, and "contain no fewer than 100,000 people each." There are roughly 2,000 PUMAs total.

Due to migration and demographic shifts, the Census Bureau is continually redrawing the boundaries of its PUMAs. This can pose a challenge for comparing values in the same region over multiple years. Luckily, for this purpose, the Bureau also defines Constant Public Use Microdata Areas (ConsPUMAs, or CPUMAs) for various date ranges. The ConsPUMAs we use in this analysis are CPUMA0010, which "represents areas that are consistent among 2000 and 2010 PUMAs" and "can be consistently identified from the geographic codes available in U.S. Census PUMS from 2000 to 2021 samples."

Compared to PUMAs, ConsPUMAs are larger: there 1,085 ConsPUMAS defined in CPUMA0010. The land area of these ConsPUMAs varies widely: in unpopulated regions, they can be quite large, while in populated regions they can be quite small. In Minnesota, for example, the smallest ConsPUMA encompass only a small portion of Minneapolis, while the largest ConsPUMA encompasses much of the state.
How granular is this analysis, demographically?

The variables we use to bin individuals into groups are the following. See our codebook [link to codebook] for more information.

Age: 5-year bins:
- 0-4 years
- 5-9 years
- 10-14 years
- 15-19 years
- 20-24 years
- 25-29 years
- 30-24 years
- ...
- 75-79 years
- 80-84 years
- 85+
[perhaps a table of how many individuals fall in each age group?]

Race: Add info here

Household income: Add information here on how it's defined, whether it's pre-tax/transfer, whether it's only earned or earned and unearned, how it's deflated to compare over time, etc.

Unit Testing

Tests are located in the tests/testthat folder. To run all tests:

library("testthat")
test_dir("tests/testthat/")

Performance Testing

Ad hoc performance tests can be run suing the profvis() function in the profvis package. For example, to test the performance of main.R, run:

library("profvis")
profvis(source("main.R"))

Data Validation

As we process the data, we run several tests to validate that the pipeline is working as expected. The module containing these validation functions is located at src/utils/data-validation.R.

Current data checks include:

Row count validation during data bucketing:. After the IPUMS data is read and loaded into the database, the number of observations is recorded in obs_count. Subsequent data bucketing steps are validated against this expected count using the validate_row_counts() function. If any rows are unexpectedly dropped, execution will stop, and an error will be raised.

Project convention

In this project, variables representing data are assigned suffixes based on the following convention:

_tb for tibbles stored in the R global environment.
_db for tables stored within a database connection.

For example, ipums_raw_tb would represent a tibble of raw IPUMS data; ipums_raw_db would represent a connection to a database table containing raw IPUMS data. This helps avoid confusion and accidental mixing of tibbles and database tables.

References

George C. Galster (2024). Is There Enough Housing Production? It Matters Which Indicators Are Used to Answer, Housing Policy Debate. DOI: 10.1080/10511482.2024.2334018
McClure, K., & Schwartz, A. (2024). Where Is the Housing Shortage? Housing Policy Debate, 1–15. https://doi.org/10.1080/10511482.2024.2334011

Name		Name	Last commit message	Last commit date
Latest commit History 275 Commits
data		data
docs		docs
lookup_tables		lookup_tables
renv		renv
results		results
shiny-app		shiny-app
src		src
tests		tests
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
.rscignore		.rscignore
LICENSE		LICENSE
README.md		README.md
contributions-explanation.Rmd		contributions-explanation.Rmd
household-size.Rproj		household-size.Rproj
main.R		main.R
multidim-matrix.R		multidim-matrix.R
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Project Structure

File Tree

Data Flow Diagram

Project Setup

If you're running the code for the first time...

If you're running the code any subsequent time...

FAQ

Unit Testing

Performance Testing

Data Validation

Project convention

References

About

Releases

Packages

Languages

License

lorae/household-size

Folders and files

Latest commit

History

Repository files navigation

Introduction

Project Structure

File Tree

Data Flow Diagram

Project Setup

If you're running the code for the first time...

If you're running the code any subsequent time...

FAQ

Unit Testing

Performance Testing

Data Validation

Project convention

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages