- Presentation (Screencast)
- Slides
- UI (down) (Screencast)
- Data wrangling work (separate branch)
A tool to see where federal money is going with additional information on associated legislators.
Companies are interested in the distribution of US federal spending on government contracts, across industries and across states, to direct investment or expansion decisions. Taxpayers are also interested because this is their money being spent on their behalf, and the spending should be in line with their priorities.
While federal laws mandate the public accessibility of federal spending data, available exploratory tools are focused on individual payments and the data are provided without context. FedSpend is designed to address this gap in accessibility, by linking federal payments to the recipient's House and Senate representatives. The data are aggregated to abstract away government peculiarities and presented via UI with additional state demographic information.
Federal spending data are provided by the US Government as a PostgreSQL dump archive file, updated monthly.
Every row in the federal spending dataset is an obligation (i.e., promise to pay). Outlays (i.e., actual payments) are not included. Additional contract funding modifications, such as the increase or decrease of a contract obligation, are entered as an additional row in the dataset with the adjustment amount. See an example below.
Date | NAICS code | Recipient | Amount | District | State |
---|---|---|---|---|---|
2019 | 519130 | Bojack Horseman Inc. | $45,000 | 36 | CA |
2019 | 519130 | Bojack Horseman Inc. | -$20,000 | 36 | CA |
2019 | 519130 | Bojack Horseman Inc. | $5,000 | 36 | CA |
In this example, the total value of the contract is $30,000.
Historical and current House of Representatives and Senate data are provided as a public resource, maintained by a community of interested persons. These data are linked to government contracts by year and city/state of the recipient.
An overview is provided here, while full details are in the wiki.
- Set up a PostgreSQL database and restore the full database (~600GB) from the archive file (~60GB).
- Set up a CockroachDB cluster.
- Set up a Spark cluster for the migration of data from PostgreSQL to CockroachDB.
- Migrate each table (~60 tables) from PostgreSQL to CockroachDB via Spark.
- Calculate aggregate tables from PostgreSQL using Spark and save to CockroachDB.
- Set up a web server with Apache and UI with Flask.
- Set up Airflow to run the data pipeline + version control, and refresh the UI.
The data come prepackaged in a PostgreSQL database, which is in itself a stable choice but there are several modern options to choose from. Prioritizing pattern flexibility and efficiency over infinite scale, the PIE theorem dictates that another relational database is the right choice. CockroachDB was chosen because of its built-in distributed computing and its resilency.
Consider the database dump posted in February 2020. Rather than overwriting our existing data with the new archive dump, we select out and append the new data (green), then conduct a version control procedure on the carryover data (yellow). The goal is to check whether that carryover data -- which we expect to be identical -- is in fact identical. A hash function is applied on the obligation column in both versions and its output noted.