This repository contains data related to the activities of ~35,000 police officers in the Chicago Police department (CPD), including ~11,000 tactical response reports from 2004-2016 and ~110,000 civilian and administrative complaints from 2000-2018. The data was obtained following a series of requests covered by the Freedom of Information Act (FOIA) and coordinated by the Invisible Institute.
Details about the FOIA requests and which information about the CPD they cover
can be found in the file raw/datasets.csv
. The original
data which serves as a starting point for this repository was imported from the
Invisible Institute's download page
This code requires Python>=3.8 and GNU Make 4.3 (it will not work on earlier versions).
You will require xlrd and
openpyxl to read .xls
and .xlsx
files,
respectively. Optionally, if you are planning to contribute changes to the code in this
repository, you will need the black package for code formatting.
All Python dependencies can be installed by running
pip install -r requirements.txt
in the repository root folder.
We have included a .pdf
of the documentation in the current release version.
But if you want to compile the documentation yourself from the source file docs/main.tex
, you can
either compile it however you normally would with your favourite LaTeX compiler
(e.g. with pdflatex
and bibtex
), or you can run
make
in the docs/
folder to compile it with latexrun.
In order to build the cleaned and linked data, run
make
in the repository root folder. This will result in the creation of a single cleaned and linked
set of data in the final/
folder, where all records (officers, complaints, and tactical response reports) are associated
with unique IDs that enable linkage among the records.
See the documentation main.pdf
for an in-depth discussion of the data cleaning and linking.
In brief, the make
command will result in two primary data processing steps.
First, in the cleaning step, the raw Excel files are converted to .csv
files and field
names are uniformized across files. To perform just the cleaning step, run the following command
in the repository root folder:
make prepare
This will create a tidy/
folder containing cleaned versions of the original raw data.
Second, in the linking step, records of officers appearing in the different data files are linked by cleaning and matching their attributes, removing erroneous entries, etc. The linking step produces the final clean data files listed above. To perform just the linking step (after you have already run the cleaning step), the following command in the repository root folder:
make finalize
This will create a final/
folder containing the final cleaned and linked version of the data.
Once you have completed the above build step, the repository will contain the cleaned and linked data. In particular, the following files will have been generated:
final/roster.csv
: A merged and linked roster of all unique officers in the datafinal/officer_profiles.csv
: A list of all officers, including duplicate entries when an officer appears in multiple source filesfinal/erroneous_officers.csv
: A list of probable erroneous/duplicate officer records in the original datafinal/unit_assignments.csv
: A list of unit assignments for each officer with start and end datefinal/unit_descriptions.csv
: A list of unit namesfinal/complaints.csv
: Formal complaints filed against officersfinal/complaints_officers.csv
: The officers involved in the complaints, with allegations, findings, and sanctionsfinal/tactical_response_reports.csv
: Forms that officers are required to file when their response involves use of forcefinal/tactical_response_reports_discharges.csv
: Details about the weapons used as part of the use of force recorded in the TRRfinal/awards.csv
: A list of awards requested for officers, request date, and resultfinal/salary.csv
: A list of officer salaries, positions, and paygrades
A detailed description of the fields present in all of these files may be found in description.md.
You will find Jupyter notebooks in the examples/
folder that reproduce the visualizations in the documentation.
In Jupyter lab/notebook, run Kernel -> Restart & Run All
to run the notebooks. Note: currently the notebooks
are coded such that they must be run in linear, top-to-bottom order (hence Kernel -> Restart & Run All
).
If you use this dataset in your own project, please cite our paper published in the NeurIPS 2021 Track on Datasets and Benchmarks:
Thibaut Horel, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr, and Trevor Campbell. The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police Department. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021.
@inproceedings{Horel_NeurIPS21,
author = {Horel, Thibaut and Masoero, Lorenzo and Agrawal, Raj and Roithmayr, Daria and Campbell, Trevor},
booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
editor = {J. Vanschoren and S. Yeung},
pages = {},
publisher = {Curran},
title = {The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police Department},
url = {https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/7f6ffaa6bb0b408017b62254211691b5-Paper-round2.pdf},
volume = {1},
year = {2021}
}
Copyright 2021 Thibaut Horel, Trevor Campbell, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr
The code that cleans and links the data, as well as the code that produces the
documentation for this project, is licensed under the MIT License;
see MIT-LICENSE.txt
for the license text.
The dataset that is produced by the code is licensed under the
Creative Commons 4.0 Attribution NonCommercial ShareAlike License;
see CC-BY-NC-SA-LICENSE.txt
for the license text.
The header image in this README is by Bert Kaufmann via Wikimedia Commons (CC BY 2.0).