Skip to content

Commit

Permalink
Merge branch 'main' of jaanli.github:jaanli/american-community-survey
Browse files Browse the repository at this point in the history
  • Loading branch information
jaanphare committed Apr 11, 2024
2 parents 6765336 + c7b337e commit 2653e9c
Show file tree
Hide file tree
Showing 114 changed files with 381 additions and 320 deletions.
86 changes: 60 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,14 +53,14 @@ A typical Framework project looks like this:

## Command reference

| Command | Description |
| ----------------- | -------------------------------------------------------- |
| `yarn install` | Install or reinstall dependencies |
| `yarn dev` | Start local preview server |
| `yarn build` | Build your static site, generating `./dist` |
| `yarn deploy` | Deploy your project to Observable |
| `yarn clean` | Clear the local data loader cache |
| `yarn observable` | Run commands like `observable help` |
| Command | Description |
| ----------------- | ------------------------------------------- |
| `yarn install` | Install or reinstall dependencies |
| `yarn dev` | Start local preview server |
| `yarn build` | Build your static site, generating `./dist` |
| `yarn deploy` | Deploy your project to Observable |
| `yarn clean` | Clear the local data loader cache |
| `yarn observable` | Run commands like `observable help` |

## GPT-4 reference

Expand Down Expand Up @@ -93,14 +93,14 @@ Example plot of this data: https://s13.gifyu.com/images/SCGH2.gif (code here: ht

Example visualization: live demo here - https://jaanli.github.io/american-community-survey/ (visualization code [here](https://github.com/jaanli/american-community-survey/))

![image](https://github.com/jaanli/exploring_american_community_survey_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)
![image](https://github.com/jaanli/exploring_data_processing_data/assets/5317244/0428e121-c4ec-4a97-826f-d3f944bc7bf2)

## Requirements

Clone the repo; create and activate a virtual environment:
```
git clone https://github.com/jaanli/exploring_american_community_survey_data.git
cd exploring_american_community_survey_data
git clone https://github.com/jaanli/american-community-survey.git
cd american-community-survey
python3 -m venv .venv
source activate
```
Expand All @@ -123,28 +123,62 @@ brew install duckdb
## Usage for 2022 ACS Public Use Microdata Sample (PUMS) Data

To retrieve the list of URLs from the Census Bureau's server and download and extract the archives for all of the 50 states' PUMS files, run the following:

```
cd data_processing
dbt run --select "public_use_microdata_sample.list_urls" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}'
```

Then save the URLs:

```
dbt run --select "public_use_microdata_sample.urls" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
--threads 8
```

Then execute the dbt model for downloading and extract the archives of the microdata (takes ~2min on a Macbook):

```
dbt run --select "public_use_microdata_sample.download_and_extract_archives" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
--threads 8
```

Then generate the CSV paths:

```
dbt run --select "public_use_microdata_sample.csv_paths" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.json", "output_path": "~/data/american_community_survey"}' \
--threads 8
```
cd american_community_survey
dbt run --exclude "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}'

Then parse the data dictionary:

```
dbt run --select "public_use_microdata_sample.parse_data_dictionary" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2021/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2021.csv", "output_path": "~/data/american_community_survey"}' \
--threads 8
```

Then generate the SQL commands needed to map every state's individual people or housing unit variables to the easier to use (and read) names:

```
python scripts/generate_sql_data_dictionary_mapping_for_extracted_csv_files.py \
~/data/american_community_survey/public_use_microdata_sample_csv_paths.parquet \
~/data/american_community_survey/PUMS_Data_Dictionary_2022.json
python scripts/generate_sql_with_enum_types_and_mapped_values_renamed.py ~/data/american_community_survey/csv_paths.parquet ~/data/american_community_survey/PUMS_Data_Dictionary_2022.json
```

Then execute these generated SQL queries using 1 thread (you can adjust this number to be higher depending on the available processor cores on your system):
```
dbt run --select "public_use_microdata_sample.generated+" --vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' --threads 1
dbt run --select "public_use_microdata_sample.generated+" \
--vars '{"public_use_microdata_sample_url": "https://www2.census.gov/programs-surveys/acs/data/pums/2022/1-Year/", "public_use_microdata_sample_data_dictionary_url": "https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2022.csv", "output_path": "~/data/american_community_survey"}' \
--threads 8
```

Inspect the output folder to see what has been created in the `output_path` specified in the previous command:
```
❯ tree -hF -I '*.pdf' ~/data/american_community_survey
[ 224] /Users/me/data/american_community_survey/
[ 224] /Users/me/data/data_processing/
├── [ 128] 2022/
│ └── [3.4K] 1-Year/
│ ├── [ 128] csv_hak/
Expand All @@ -169,7 +203,7 @@ To see the size of the csv output:

```
❯ du -sh ~/data/american_community_survey/2022
6.4G /Users/me/data/american_community_survey/2022
6.4G /Users/me/data/data_processing/2022
```

And the compressed representation size:
Expand Down Expand Up @@ -284,12 +318,12 @@ Check that you can execute a SQL query against these files:
```
duckdb -c "SELECT COUNT(*) FROM '~/data/american_community_survey/*individual_people_united_states*2021.parquet'"
```
1. Create a data visualization using the compressed parquet files by adding to the `american_community_survey/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb
6. Create a data visualization using the compressed parquet files by adding to the `data_processing/models/public_use_microdata_sample/figures` directory, and using examples from here https://github.com/jaanli/american-community-survey/ or here https://github.com/jaanli/lonboard/blob/example-american-community-survey/examples/american-community-survey.ipynb

To save time, there is a bash script with these steps in `scripts/process_one_year_of_american_community_survey_data.sh` that can be used as follows:
To save time, there is a bash script with these steps in `scripts/process_one_year_of_data_processing_data.sh` that can be used as follows:
```
chmod a+x scripts/process_one_year_of_american_community_survey_data.sh
./scripts/process_one_year_of_american_community_survey_data.sh 2021
chmod a+x scripts/process_one_year_of_data_processing_data.sh
./scripts/process_one_year_of_data_processing_data.sh 2021
```

The argument specifies the year to be downloaded, transformed, compressed, and saved. It takes about 5 minutes per year of data.
Expand Down Expand Up @@ -570,7 +604,7 @@ dbt run --select "public_use_microdata_sample.microdata_area_shapefile_paths"
```
5. Check that the paths are correct:
```
❯ duckdb -c "SELECT * FROM '/Users/me/data/american_community_survey/microdata_area_shapefile_paths.parquet';"
❯ duckdb -c "SELECT * FROM '/Users/me/data/data_processing/microdata_area_shapefile_paths.parquet';"
```
Displays:

Expand All @@ -579,11 +613,11 @@ Displays:
│ shp_path │
│ varchar │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_02_puma10/tl_2010_02_puma10.shp │
│ · │
│ · │
│ · │
│ /Users/me/data/american_community_survey/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
│ /Users/me/data/data_processing/PUMA5/2010/tl_2010_48_puma10/tl_2010_48_puma10.shp │
├─────────────────────────────────────────────────────────────────────────────────────────────┤
│ 54 rows (40 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────┘
Expand Down
10 changes: 5 additions & 5 deletions data_processing/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: "american_community_survey"
name: "data_processing"
version: "1.0.0"
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: "american_community_survey"
profile: "data_processing"

# Variables that can be changed from the command line using the `--vars` flag:
# example: dbt run --vars 'my_variable: my_value'
Expand All @@ -28,8 +28,8 @@ macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"
- "target"
- "dbt_packages"

# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models
Expand All @@ -38,7 +38,7 @@ clean-targets: # directories to be removed by `dbt clean`
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
american_community_survey:
data_processing:
# Config indicated by + and applies to all files under models/example/
# example:
+materialized: view
Expand Down
52 changes: 28 additions & 24 deletions data_processing/models/public_use_microdata_sample/config.yml
Original file line number Diff line number Diff line change
@@ -1,27 +1,31 @@
version: 2

models:
- name: list_urls
config:
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
output_path: "{{ var('output_path') }}"
- name: download_and_extract_archives
config:
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
output_path: "{{ var('output_path') }}"
- name: parse_data_dictionary
config:
public_use_microdata_sample_data_dictionary_url: "{{ var('public_use_microdata_sample_data_dictionary_url') }}"
output_path: "{{ var('output_path') }}"
- name: list_shapefile_urls
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
- name: download_and_extract_shapefiles
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
- name: combine_shapefiles
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
- name: list_urls
config:
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
output_path: "{{ var('output_path') }}"
- name: download_and_extract_archives
config:
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
output_path: "{{ var('output_path') }}"
- name: csv_paths
config:
public_use_microdata_sample_url: "{{ var('public_use_microdata_sample_url') }}"
output_path: "{{ var('output_path') }}"
- name: parse_data_dictionary
config:
public_use_microdata_sample_data_dictionary_url: "{{ var('public_use_microdata_sample_data_dictionary_url') }}"
output_path: "{{ var('output_path') }}"
- name: list_shapefile_urls
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
- name: download_and_extract_shapefiles
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
- name: combine_shapefiles
config:
microdata_area_shapefile_url: "{{ var('microdata_area_shapefile_url') }}"
output_path: "{{ var('output_path') }}"
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ def model(dbt, session):
base_url = dbt.config.get('public_use_microdata_sample_url') # Assuming this is correctly set

# Fetch URLs from your table or view
query = "SELECT * FROM list_urls"
query = "SELECT * FROM list_urls "
result = session.execute(query).fetchall()
columns = [desc[0] for desc in session.description]
url_df = pd.DataFrame(result, columns=columns)
Expand All @@ -50,25 +50,4 @@ def model(dbt, session):
paths_df = pd.DataFrame(extracted_files, columns=['csv_path'])

# Return the DataFrame with paths to the extracted CSV files
return paths_df

# Mock dbt and session for demonstration; replace with actual dbt and session in your environment
class MockDBT:
def config(self, key):
return {
'public_use_microdata_sample_url': 'https://example.com/path/to/your/csv/files',
'output_path': '~/path/to/your/output/directory'
}.get(key, '')

class MockSession:
def execute(self, query):
# Mock response; replace with actual fetching logic
return [{"URL": "https://example.com/path/to/your/csv_file.zip"} for _ in range(10)]

dbt = MockDBT()
session = MockSession()

if __name__ == "__main__":
# Directly calling model function for demonstration; integrate properly within your dbt project
df = model(dbt, session)
print(df)
return paths_df
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hal/psam_h01.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hal/psam_h01.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hak/psam_h02.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hak/psam_h02.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_haz/psam_h04.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_haz/psam_h04.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_har/psam_h05.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_har/psam_h05.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hca/psam_h06.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hca/psam_h06.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hco/psam_h08.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hco/psam_h08.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Original file line number Diff line number Diff line change
Expand Up @@ -905,7 +905,7 @@ CASE FYRBLTP
WGTP78::VARCHAR AS "Housing Weight replicate 78",
WGTP79::VARCHAR AS "Housing Weight replicate 79",
WGTP80::VARCHAR AS "Housing Weight replicate 80",
FROM read_csv('/Users/me/data/american_community_survey/2022/1-Year/csv_hct/psam_h09.csv',
FROM read_csv('~/data/american_community_survey/2022/1-Year/csv_hct/psam_h09.csv',
parallel=False,
all_varchar=True,
auto_detect=True)
auto_detect=True)
Loading

0 comments on commit 2653e9c

Please sign in to comment.