Skip to content

Latest commit

 

History

History
232 lines (153 loc) · 28.1 KB

README.md

File metadata and controls

232 lines (153 loc) · 28.1 KB

Who Owns Massachusetts Processing and Deduplication

This repository deduplicates property owners in Massachusetts using the MassGIS standardized assessors' parcel dataset and legal entity data sourced from OpenCorporates under their 'public-benefic project' program. The process is most explicitly constructed with reference to Hangen and O'Brien's methods (2024), which are themselves similar (though not identical) to methods used by Taylor Shelton and Eric Seymour (2024) Henry Gomory (2021) and the Anti-Eviction Mapping Project's Evictorbook (see e.g., McElroy and Amir-Ghassemi 2021). It also builds on Eric's experience leading development of a tool called TenantPower with Mutual Aid Medford and Somerville in 2020, which used the dedupe Python package in a manner similar to Immergluck et al. (2020).

While we share large parts of their approach (i.e., relying on community detection on company-officer relationships, following cosine-similarity deduplication of names), we believe that our results are more robust for several reasons. Inspired, in part, by Preis (2024), we expend a great deal of effort on address standardization so that we can use addresses themselves as network entities (prior approaches, with the exception of Preis, have just concatenated addresses and names prior to deduplication). This is a substantial change: "similar" addresses, by whatever measure, can still be very different addresses. By relying on standardized unique addresses, we believe that we are substantially reducing our false positive rate.

Community detection---based on both network analysis and cosine similarity---is accomplished using the igraph package's implementation of the fast greedy modularity optimization algorithm. Cosine similarity is calculated using the quanteda package.

While the full process requires that you source OpenCorporates data, you can run the cosine-similarity-based deduplication process using only the assessors tables. (See the documentation for the OC_PATH configuration variable.)

Getting Started

Data Dictionary

Please consult the data dictionary for field definitions.

renv

This library's dependencies are managed using renv. To install necessary dependencies, simply install renv and run renv::restore(). If you are using Windows, you'll probably have to install the Rtools bundle appropriate for your version of R.

PostgreSQL/PostGIS

The respository uses an instance of PostgreSQL with the PostGIS extension as its primary data store. You'll need to set up a PostGIS instance on either localhost or a server.

Setting up .Renviron

The scripts expect to find your PostgreSQL credentials, host, port, etc. in an .Renviron file with the following environment variables defined:

DB_HOST="yourhost"
DB_USER="yourusername"
DB_PASS="yourpassword"
# Or whatever your port
DB_PORT=5432
DB_NAME="yourdbname"
# Will likely need to be "require" for remote.
DB_SSL="allow"
# Necessary to run `mapbox_preprocess.R`
MB_USER="your-mapbox-username"
MB_TOKEN="your-mapbox-token"

Optionally, you can use the PUSH_DBS configuration parameter to specify a different database you'd like to point subroutine results to, allowing you to separate, for example, a development environment from a production environment. If you'd like to make of this parameter, you'll need to pass a string value to the appropriate named elements in PUSH_DBS (see section 'Configuration (config.R)' below), or to load_results("yourstring") and define...

YOURSTRING_DB_HOST="yourhost"
YOURSTRING_DB_USER="yourusername"
YOURSTRING_DB_PASS="yourpassword"
# Or whatever your port
YOURSTRING_DB_PORT=5432
YOURSTRING_DB_NAME="yourdbname"
# Will likely need to be "require" for remote.
YOURSTRING_DB_SSL="allow"

If you modify your .Renviron mid-RStudio session, you can simply run readRenviron('.Renviron') to reload.

.Renviron is in .gitignore to ensure that you don't commit your credentials.

Running Deduplication Process (run.R)

This is a very time-consuming process, even for small subsets (this is due to the size of the companies and officers tables, which must be processed for reliable results even for smaller spatial subsets). On a 2021 Apple M1 Max chip with 64 GB of memory, the full state is taking a little under 13 hours.

We provide an onmibus manage_run() function in run.R. It does preflight testing and triggers three sequences: a data ingestion sequence (load_read_write_all(), see R/loaders.R), a data processing sequence (proc_all(), see R/processors.R) and a deduplication sequence (dedupe_all(), see R/deduplicators.R).

We recommend running from the terminal using...

Rscript run.R

This is because when the process is run interactively (i.e., in an RStudio environment), intermediate results are stored in an output object, which has memory costs. You can then read the results with load_results.R, as described above.

If the process is run interactively, it automatically outputs results to objects in your environment (including intermediate results if RETURN_INTERMEDIATE is TRUE in config.R. It also writes results to .csv and .Rda files in /results, but doesn't ever try to read these---the PostgreSQL database is the only output location from which our scripts read data.

Configuring the Deduplication Process (config.R)

We expose a large number of configuration variables in config.R, which is sourced in run.R. In order...

Variable Description
COMPLETE_RUN Default: FALSEA little helper that overrides values such that ROUTINES=list(load = TRUE, proc = TRUE, dedupe = TRUE), REFRESH=TRUE, MUNI_IDS=NULL,and COMPANY_TEST=FALSE. This ensures a fresh, statewide run on complete datasets, not subsets.
REFRESH Default: TRUEIf TRUE, datasets will be reingested regardless of whether results already exist in the database.
PUSH_DBS Default: list(load = "", proc = "", dedupe = "") Named list with string values. If "", looks for .Renviron database connection parameters of the format "DB_NAME". If string passed, looks for parameters of the format "YOURSTRING_DB_NAME" where YOURSTRING can be passed upper or lower case, though parameters must be all uppercase. Note that whatever dedupe is set to is treated as "production", meaning that select intermediate tables from previous subroutines are pushed there as well. Requires that you set .Renviron parameters (see section 'Setting Up .Renviron' above).
ROUTINES Default: list(load = TRUE, proc = TRUE, dedupe = TRUE) Allows the user to run individual subroutines (i.e., load, process, deduplicate). The subroutines are not totally indepdent, but each will run in a simplified manner when it is set to FALSE here, returning only results needed by subsequent subroutines.
MUNI_IDS Default: c(274, 49, 35)If NULL, runs process for all municipalities in Massachusetts. If "hns", runs process for Healthy Neighborhoods Study Municipalities (minus Everett because they don't make owner names consistently available). If "mapc", runs process for all municipalities in the MAPC region. Otherwise, a vector of numbers or strings, but must match municipality IDs used by the state. (Consult muni_ids.csv for these.) If numbers, they will be 0-padded.
MOST_RECENT Default: FALSE If TRUE (and the complete vintages MassGIS collection is being used), reads the most recent vintage for each municipality. If FALSE, attempts to determine which vintage has the largest number of municipalities reporting, selecting that year where possible (and selecting the most recent where a given municipality did not report in that year).
COMPANY_TEST_COUNT Default: 50000The OpenCorporates datasets are big. For that reason, during development it's useful to read in test subsets. This is the number of companies to read in when COMPANY_TEST is TRUE.
COMPANY_TEST Default: TRUEIf TRUE, reads in only COMPANY_TEST_COUNT companies and any officers associated with those companies. (Usually on the order of 4x the number of companies.)
RETURN_INTERMEDIATE Default: TRUEIf TRUE, run() returns intermediate tables. Otherwise, loads only the tables yielded by the last subroutine requested into the R environment while writing all tables to the appropriate databases. (I.e., if ROUTES is list(load=TRUE, proc=TRUE, dedupe=FALSE) and RETURN_INTERMEDIATE is TRUE, it will load tables by proc_all() into the R environment).
COSINE_THRESH Default: 0.85The minimum cosine similarity treated as a match. Lower numbers yield matches on less closely related strings.
INDS_THRESH Default: 0.95The minimum cosine similarity treated as a match for non-institutional owners. Lower numbers yield matches on less closely related strings. This should generally be higher than COSINE_THRESH because there are so many more duplicative names. Note that this is address-bounded, so even close matches will not appear as the same unless there is a shared address.
ZIP_INT_THRESH Default: 1One of our address-parsing tricks is to use ZIP codes that fall entirely within a single MA municipality to fill missing cities, and MA municipalities that fall entirely within a ZIP code to fill missing ZIP codes. This adjusts how close to 'entirely' these need to be - note that a value of 1 introduces substantial computational efficiencies because we can simply use a spatial join with a sf::st_contains_properly predicate rather than the much more expensive intersection. (It also means, unfortunately, that there are none of the second case – no municipalities fall entirely within ZIP codes without some fuzziness.
QUIET Default: FALSEIf TRUE, suppresses log messages. Logs are written to a datetime-stamped file in /logs.
CRS Default: 2249EPSG code for coordinate reference system of spatial outputs and almost any spatial analysis in the workflow. 2249 is NAD83 / Massachusetts Mainland in US feet. (The almost is because ZIPS are processed nationwide using NAD 83 / Conus Albers, AKA EPSG 5070. We don't expose this.)
DATA_PATH Default: "data" This is the folder where input datasets (i.e., OpenCorporates data and MassGIS parcel databases) are located. Do not change unless you also plan on moving luc_crosswalk.csv and muni_ids.csv.
RESULTS_PATH Default: "results" This is the folder where resulting .csv and .Rda files will be written. Note that tables will always be written to the PostGIS database, so this is for backup/uncredentialed result transfer only.
OC_PATH Default: "2024-04-12" Either the name of the folder (within /data) that contains the OpenCorporates bulk data or NULL. Scripts depend on companies.csv and officers.csv. If NULL, a simplified cosine-similarity deduplication routine will run, returning a simpler set of tables.
GDB_PATH Default: "L3_AGGREGATE_FGDB_20240703"This is either a folder (within /data) containing all the vintages of the MassGIS parcel data or a single most recent vintage geodatabase (in /data).

Loading Results (load_results.R)

If you want to simply read the results without worrying about triggering the deduplication process, you can simply begin a new RScript, source load_results.R, and run a one-liner like so...

source('load_results.R')
load_results("your_db_prefix", load_boundaries=TRUE, summarize=TRUE)

This will load companies, munis, officers, owners, sites, sites_to_owners, parcels_point, metacorps_cosine and metacorps_network into your R environment. If load_boundaries is true, it will also return munis, zips, tracts, and block_groups.

Please consult the data dictionary for field definitions.

If summarize is TRUE, it will return a number of summary fields for officers, metacorps_cosine, and metacorps_network that are useful for diagnosing cases of over-inclusion in the network analysis. These appear in the data dictionary as well.

This requires that you have .Renviron set up with appropriate prefixes (see 'Setting up .Renviron', above).

Note that for statewide results, these are very large tables and therefore it might take 5-10 minutes depending on your network connection/whether you're reading from a local or remote database.

Writing to Django (write_to_django.R)

The Who Owns Massachusetts application runs on top of a backend written in Django. We provide a script (write_to_django.R) to support writing to the tables created and managed by the Django application, which assumes that you've already run the deduplication process (run.R) and stored its results in a database. by this application. (It also assumes, obviously, that you have set up our who-owns-mass-backend Django application on a server/localhost). The key thing here is that rather than overwriting the tables, it truncates them (removes all rows) and appends results to the tables. This is important because overwriting tables causes all kinds of mayhem with e.g., Django's indexing and constraint management.

To run the script from the terminal, you can use the -l or --load_prefix flags to specify the prefix used to indicate the database that contains the results of the deduplication process and the -d or --django-prefix flags to specify the prefix used to indicate the database containing your managed Django models (as stored in your .Renviron file). For example...

Rscript write_to_django.R -l 'prod' -d 'django'
# or...
Rscript write_to_django.R --load_prefix 'prod' --django_prefix 'django'

To use the provided script within RStudio you can simply run...

source('write_to_django.R')
write_to_django(load_prefix='prod', django_prefix='django')

...where load_prefix and django_prefix are the prefixes, separated from the database connection parameters by an underscore, stored in your .Renviron. For example, the above would look for...

DJANGO_DB_HOST="yourhost"
DJANGO_DB_USER="yourusername"
# ....etc and
PROD_DB_HOST="yourhost"
PROD_DB_USER="yourusername"
# ...etc

Mapbox Preprocessing (mapbox_preprocess.R)

The Who Owns Massachusetts application leverages several preprocessed data layers, which must be pushed to the Mapbox Tilesets API. This script runs the necessary preprocessing steps and pushes the results to Mapbox. Note that the script requires the following environment variables be defined in .Renviron:

# Necessary to run `mapbox_preprocess.R`
MB_USER="your-mapbox-username"
MB_TOKEN="your-mapbox-token"

Note also that the Mapbox token you use must be scoped to have DATASETS:READ VISION:READ, TILESETS:LIST, TILESETS:READ, and TILESETS:WRITE access. You can run from a terminal like so...

Rscript mapbox_preprocess.R -l 'prod'
# or...
Rscript mapbox_preprocess.R --load_prefix 'prod'

To use the provided script within RStudio you can simply run...

source('mapbox_preprocess.R')
mapbox_preprocess(load_prefix='prod')

Data

Required External Data

Successful execution of all features of this software requires that you source the following datasets:

  • MassGIS. "Property Tax Parcels." https://www.mass.gov/info-details/massgis-data-property-tax-parcels.

    • You should be able to use either the 'all vintages' data product (which is packaged as many geodatabases) or the 'most recent' geodatabase (which is a single GDB). If the former, pass the name of the folder to the GDB_PATH config parameter. If the latter, pass the geodatabase filename.
  • OpenCorporates. Bulk Data Product. Massachusetts extract.

    • Unfortunately, we can't provide a copy of this due to our licensing agreement, but OpenCorporates has a 'public-benefit project' program that might be worth looking into. Also, you can run a simpler cosine similarity deduplication process if you set OC_PATH to NULL in config.R.

Additional Data Sources

In addition, the script pulls in data from a range of sources to enrich our datasets. All of these are ingested from API and web sources by the script, so there is no need to source them independently.

  • MassGIS. 2023. Geographic Placenames. October.

    • This is used to standardize municipality names. Placenames are tied to places using a spatial join. (I.e., Roxbury > Boston). A simplified and transformed version of this is written to the database and is an intermediate output from loaders.R.
  • MassGIS. 2024. Municipalities.

    • This is used to standardize municipality names (i.e., Roxbury > Boston). Placenames are tied to places using a spatial join. This is written to the database and is an intermediate output from loaders.R.
  • US Census Bureau TIGER/Line. ZIP Code Tabulation Areas. 2022. Fetched using Tidycensus.

    • These are used to both attach ZIP codes to parcels whose site locations are missing them and to perform a range of address standardizations (for example, many ZIP codes lie within one municipality, meaning that we can assign a municipality when one is missing assuming that it has an unambiguous ZIP code).
  • US Census Bureau TIGER/Line. States and Equivalent Entities. 2022. Fetched using Tidycensus.

    • Used to tie ZIP codes to states. (This is useful because many, though not all ZIP codes lie within a single state, so a ZIP code can be used to assign a state when one is missing in many cases.)
  • US Census Bureau TIGER/Line. Census Tracts and Block Groups. 2022. Fetched using Tidycensus.

    • Used to locate parcels for subsequent analysis.
  • MassGIS. 2024. Master Address Data.

  • City of Boston. 2024. Boston Live Street Address Management System (SAM) Addresses.

    • We use these geolocated addresses for three primary purposes: linking MA owner addresses to locations, identifying unique addresses which are treated as network entities, and, estimating unit counts for properties missing them (following, largely, a method provided by the Metropolitan Area Planning Council).

Acknowledgements

This work received grant support from the Conservation Law Foundation and was developed under the auspices of the Healthy Neighborhoods Study in the Department of Urban Studies and Planning at MIT with input from the Metropolitan Area Planning Council. OpenCorporates has also been a supportive data partner.

References