This repository contains photos of U.S. congressional representatives through the ages, as well as code necessary to regenerate this data from scratch. We currently have approximately 10,175 of 12,475 representatives accounted for including every member serving since 1945.
This project is a part of voteview.com, a website dedicated to providing information about historical U.S. legislators, including NOMINATE ideological scores, historical roll-call votes, and biographical information. voteview.com is a project of University of California Los Angeles' Department of Political Science. The corresponding maintainer for this repository is Aaron Rudkin.
members.csv
contains a list of all members we have photos for at the time the file was generated. This will allow you to map familiar names to the ICPSR IDs that index our photo filenames. The file is sorted by most recent congress served, then alphabetically. The photos presented are scaled to 600px in height, 4x5 aspect ratio. Files smaller than 600px in height are not upscaled, and images very near 4x5 aspect ratio are not cropped.
Example results:
Name | ICPSR | State | Party | Congress | Chamber | Born | Died | Image | Source | Provenance |
---|---|---|---|---|---|---|---|---|---|---|
WELLSTONE, Paul David | 049101 | Minnesota | Democratic Party | 107 | Senate | 1944 | 2002 | images/bio_guide/049101.jpg | bio_guide | |
CLINTON, William Jefferson (Bill) | 099909 | President | Democratic Party | 106 | President | 1946 | images/wiki/099909.jpg | wiki | ||
GUILL, Ben Hugh | 003874 | Texas | Republican Party | 81 | House | 1909 | 1994 | images/manual/003874.jpg | manual | Representing Texas |
In order to add images, you will need to install several dependencies. The scrapers, which seek and download new images, require Python and several libraries. The image processing side, which resizes and crops images, requires ImageMagick, smartcrop-cli, JPEGTran, and jpegoptim.
To install external dependencies, users running environments with support for apt
or brew
can run dependencies.sh
.
To install Python dependencies, use poetry install
via Poetry or pip install -r requirements.txt
.
- Run a scraper or manually add a photo to the appropriate raw folder (likely
images/raw/manual/
). - If a manual image has been added, add a provenance statement to
config/provenance.json
- Run
constrain_images.py
to generate processed versions of the images from raw images. - Run
config/compile_members.py
to update the database with the new images. This ensures that themembers.csv
file is up to date and thatverify.py
's tests work. - Run
verify.py
to ensure the data integrity. - If added images upgrade earlier images (for example,
bio_guide
images replacingwiki
images), runverify.py --flush
to remove the no longer used files. - Open a pull request to submit your images to us.
check_missing.py
allows users to check for representatives whose photos are missing and generates a table based on criteria provided.
Arguments:
--type flat
: Use a flatfile database instead of our default MongoDB instance. Most end users should use this argument.--min N
: Provide a numberN
which represents the minimum Congress to scan for missing photos (default81
[1947-1949])--max N
: Provide a numberN
which represents the maximum Congress to scan for missing photos. Default is left black.--chamber chamber
: Province a chamberchamber
describing a specific chamber of congress. Valid options areHouse
orSenate
. Default is left blank.--state state
: Province a two-characterstate
postal abbreviation to limit searches to one state. Example:CO
for Colorado.--sort sort
: Provide a stringsort
which describes which field to sort on. Valid options arebioname
,icpsr
,state_abbrev
,party_code
,congress
. Default iscongress
. When grouping, you can also sort byAmount
.--year
: If specified, table will include "year" instead of "congress" and the--min
and--max
arguments will expect a year.--raw
: If specified, the script will check for images where we have processed copies, but no raw copies. Clones of the repository that have not yet re-scraped the raw files frombio_guide
andwiki
should see all such images; clones of the repository that have scraped images should report no missing raw files.--group [state_abbrev | congress]
: If specified, instead of printing a table of individual missing images, a count grouped by the group parameter will be printed. Useful to see which states or congresses are complete.
Example usage:
python check_missing.py --type flat --min 50 --state CT --chamber House --sort bioname
bio_guide.py
allows users to scrape the Congressional Bioguide for photos.
Arguments:
--type flat
: Use a flatfile database instead of our default MongoDB instance. Most end users should use this argument.--min N
: Provide a numberN
which represents the minimum Congress to scan for missing photos (default20
)
Example usage:
python bio_guide.py --type flat --min 50
wiki.py
allows users to scrape Wikipedia for photos.
Arguments:
--type flat
: Use a flatfile database instead of our default MongoDB instance. Most end users should use this argument.--min N
: Provide a numberN
which represents the minimum Congress to scan for missing photos (default20
)--icpsr ICPSR --url "http://..."
: Provide an ICPSR and a URL to manually scrape a Wikipedia article for that ICPSR. Useful when the default name or search is inadequate. The resulting page will still be checked against the scoring algorithm to ensure the page is appropriate for the member.--override 1
: By default, we cache data from Wikipedia articles so that we don't check for every congressperson every time we run. Use this argument to override the cached data and re-scrape every users who would otherwise fit the parameters. Useful during a cutover of congress.--blacklist ICPSR
: Mutually exclusive to all other arguments; tells the scraper to not scrape this ICPSR from Wikipedia in the future. Useful when the correct page has a photo that is incorrectly scraped (i.e. house or memorial photo or military insignia instead of photo of person).
Example usage:
python wiki.py --type flat --min 50
manual_wiki_override.sh
will scrape photos for all our currently known cases where the default scraper scrapes an incorrect photo or misses the search query.
Some photos were collected manually from other sources. In addition to distributing the already-resized versions of these, raw versions of these photos (best available quality/resolution) are stored in images/raw/manual/
. Information about where each of these images came from is stored in config/provenance.json
. These images are automatically downsampled and cropped when running the processing steps below.
We use facial recognition for two purposes. First, for intelligent cropping of the images. This use requires OpenCV and is described in the face_detect_crop()
method of constrain_images.py
. This use does not require any configuration or external API access.
Our second use is for gaze detection. In order to ensure a more uniform set of images, we want all images to face the same direction. We use Azure for a facial recognition API. If API keys are contained in config/facial_recognition.json
, then constrain_images.py
(which resizes and re-aspects input images) will additionally detect which direction the image is facing and if necessary flip it so that it is facing stage left (our right). Code describing the lookup is in constrain_images.py
under needs_horizontal_flip()
. To set up this API, copy config/facial_recognition_blank.json
to config/facial_recognition.json
and fill out the two fields with valid credentials.
constain_images.py
will resize, re-aspect, flip, and optimize images. Images will move fromimages/raw/<source>/<file>.<ext>
toimages/<source>/<file>.jpg
.scrape_all.sh
will scrape Bioguide, Wikipedia, perform the manual Wikipedia overrides, and then constrain the images in order. This should generate the repository essentially as-is from scratch.
config/config.json
: User-Agent for scraper and some default URLs, as well as database connection info if you are connecting to a MongoDB database to search members.config/facial_recognition_blank.json
: A blank template for inserting Azure Face API key/endpoint, see section Facial Recognition for detailsconfig/bio_guide_results.json
: Blacklist for Congressional bioguide.config/wiki_results.json
: Blacklist for Wikipedia and greylist (articles recently scraped, confirmed to contain nothing, skip for a while)config/parties.json
: Party metadata, used for both checking Wikipedia articles and outputting party names.config/states.json
: State metadata, used for both checking Wikipedia articles and outputting party names.config/database-raw.json
: Large raw database dump, used for flat-file searches. Generated byconfig/dump_db_to_flatfile.py
config/haarcascade_frontalface_default.xml
: [https://github.com/opencv/opencv/tree/master/data/haarcascades](Pre-trained OpenCV) facial detection classifier.
config/dump_db_to_flatfile.py
: Dumps current Mongo database to flatfile (use this first, to update the local flat file). Requires our local MongoDB instance.config/compile_members.py
: Dumps the current images to amembers.csv
file. Can take--type flat
to dump from flat file. (Use this after updates to correctly log the information in the menbers file)verify.py
: Runs basic sanity tests to ensure data is running correctly. Used in our travis-CI build.upload_raw.sh
: Uploads the current folder's raw images to our S3 store.download_raw.sh
: Downloads our S3 store's set of raw images to your local copy of this repository.constrain_images.py
: Powers some of the image resizing behind the scenes.
We welcome contributions of photos or code improvements. For code improvements, please open a pull request.
For sources for photos, please see our Issues page. If you are contributing a photo to an existing project, just reply with a comment including the photo (highest resolution possible, include information about where the photo is from and any rights issues). If no project seems applicable, or if you are letting us know about a new source of many photos, please open a new Issue. We believed that the use of low resolution images of historical public figures, freely obtained largely from public domain or government sources, constitutes fair use. Please ensure that any images you suggest are cleared for use by voteview.com and users of this repository.