Skip to content

reglab/disaggregation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Enabling Data Disaggregation of Asian American Subgroups: A Dataset of Wikidata Names for Race Imputation

This repository contains code and data related to our work on using Wikidata to derive name-race distributions for race imputation of Asian American subgroups. Please reach out to Derek Ouyang at douyang1@stanford.edu with any questions or feedback.

Quick Start Guide

If you would like perform predictions on your own data immediately, the best place to start is notebooks/0_starter_code.Rmd.

The two key data files we generated through this work are:

These are also available at Harvard Dataverse: https://doi.org/10.7910/DVN/LEOECM

Full File Structure

notebooks

  • 0_starter_code.Rmd: A template for users to prepare their own data and perform predictions.
  • 1_run_wqs.Rmd: Generates the raw Wikidata queries. See scripts/subgroup_queries.json, scripts/wikidata_query_helper_functions.R, and data/raw_wikidata/.
  • 2_process_wiki_result.Rmd: Processes the raw Wikidata queries into first name and surname lists. See scripts/name_cleaning_helper_functions.R, external_data/, and data/intermediate_data/.
  • 3_process_ipums.Rmd: Processes the raw IPUMS extracts into first name and surname lists. See external_data/ and data/intermediate_data/.
  • 4_create_disagg_geo_tables.Rmd: Processes raw American Community Survey data into geography-race tables. See external_data/ and data/geography/.
  • 5_create_name_prior_tables.Rmd: Processes first name and surname lists into name-race tables. See scripts/name_table_helper_functions.R and data/name_race_data/.
  • 6_validate_afc.Rmd: Conducts the validation on EHR data presented in the paper. Note that the EHR data itself is not publicly available. See scripts/imputation_helper_functions.R and output/results/.
  • 7_create_results.Rmd: Produces the figures and tables presented in the paper. See output/figures/.

scripts

external_data

There are two additional SSA name list files that can be added to external_data/, with the filenames below to be correctly loaded in notebooks/6_validate_afc.Rmd and notebooks/7_create_results.Rmd. In particular, these files will enable you to generate the hybrid approach described in the paper. These SAS name list files are not available for direct public download, but can be requested from the original author at lauderdale@health.bsd.uchicago.edu.

  • external_data/SSA_Givennames.csv
  • external_data/SSA_Surnames.csv

data

  • raw_wikidata/: The rawest form of Wikidata query outputs as separate .rds files for each subgroup.
  • geography/: Geography-race information from ACS, in raw count form and as distribution tables.
  • intermediate_data/: Various intermediate files in the processing pipeline from raw Wikidata queries to name-race tables.
  • name_race_data/: The final name-race tables, including alternative versions.

output

  • results/: Various .rds files holding the input data necessary to produce the figures in the paper.
  • figures/: PNG files of figures in the paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages