forked from stanford-policylab/opp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtutorial.Rmd
968 lines (797 loc) · 40.3 KB
/
tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
---
title: "Open Policing Project Tutorial"
output:
html_document:
code_folding: hide
---
## Setup
First, let's load the necessary libraries and data that will allow us to
begin our investigation!
```{r setup, message=FALSE, warning=FALSE}
## Libraries to include
library(tidyverse)
library(lubridate)
# For Veil of Darkness
library(lutz)
library(suncalc)
library(splines)
## Load the data
# Replace the path below with the path to where your data lives
data_path <- "~/Desktop/pa_philadelphia_2019_02_25.rds"
stops <- read_rds(data_path)
# Additional data and fixed values we'll be using
population_2017 <- tibble(
subject_race = c(
"asian/pacific islander", "black", "hispanic", "other/unknown","white"
),
num_people = c(110864, 648846, 221777, 39858, 548312)
) %>%
mutate(subject_race = as.factor(subject_race))
center_lat <- 39.9525839
center_lng <- -75.1652215
```
## Covering the basics
The core of `R` is the dataframe. We've given you one to start with, in the form
of `stops`. Think of dataframes like a spreadsheet: they have rows and columns.
Usually, rows are a "datapoint": in `stops`, each row corresponds to a single
stop from Philadelphia. The columns are the "variables": again, in `stops`,
these are the things we know about the stop, like where the stop happened, the
age of the stop subject, whether an arrest was made, and so on.
First, let's take a look at what columns exist in our `stops` dataframe.
```{r}
colnames(stops)
```
How many stops do we have in our dataset? (**Hint**: Try the `nrow()` function.)
```{r}
nrow(stops)
```
What date range does our data cover? One way to do this would be to examine the
`min()` and `max()` dates in our dataframe. The problem is we can't just apply
`min()` to our dataframe: doing so produces an error.
The `min()` function wants its input to be just one column from the `stops`
dataframe. To access a single column from a dataframe, use `$`: what goes before
`$` is the dataframe, and what comes after is the name of the column we want.
(**Hint**: What happens if you type `stops$date` into the console?)
```{r}
min(stops$date)
max(stops$date)
```
Since we only have four and a half months of data for 2018, let's filter it out.
Note that there are ways to deal with partial years in analysis, but to make
things easier for ourselves, let's focus on 2014-2017.
To do this we'll want to:
1. Use the `filter()` function to specify the years we want, and
2. Use the assignment operator `<-` (it's like = in `R`) to overwrite `stops`.
#### Aside: `filter()`
The `filter()` function is used to separate rows from the dataframe that
interest us from rows that do not. In particular, `filter(DATA, CONDITION)`
returns `DATA` with all of the rows that don't satisfy `CONDITION` removed. For
instance, we might want to only look at stops where the subject was arrested. To
do this, we would type `filter(stops, arrest_made == TRUE)`, since we only want
rows from `stops` where the `arrest_made` column is (i.e., `==`) `TRUE`. (**Note**:
we could actually just type `filter(stops, arrest_made)`. Do you see why?)
You might have noticed in the colnames, that we have no `year` column (our data
only has `date`). These dates are full dates that look like `YYYY-MM-DD`.
Luckily, with the `lubridate` package, getting the year from
the date is as easy as `year(date)`!
```{r}
stops <- filter(stops, year(date) < 2018)
```
How many stops do we have now?
```{r}
# NOTE that this is equivalent to `nrow(stops)`
stops %>% nrow()
```
**tidyverse tip**: The "pipe" operator, `%>%`, helps to keep our code clean. It
just places the previous item into the first argument of the function. So,
`x %>% f()` is simply `f(x)`. While in a one-function call the pipe might feel
silly and unnecessary, it's going to become _really_ helpful once we start
wanting to do multiple transformations to our data. For instance, that means
that instead of typing `nrows(filter(stops, year < 2018))` we can write
`stops %>% filter(year < 2018) %>% nrows()`, which is easier to understand.
Now, you may have noticed before that there was a column called `type`. Let's
take a closer look to see what that means.
#### Aside: `select()`
A few minutes ago, we used `filter()` to whittle our dataframe down to just the
rows that interested us. What happens if we want to whittle down the columns
instead? The `select()` function is designed to solve exactly this problem. It
takes a dataframe, and gets rid of all of the columns you don't ask for. For
instance, `select(stops, lat, long)` produces a dataframe with exactly two
columns: `lat` and `long`. If we wanted to know the time as well, we could type
`select(stops, lat, long, time)`.
Try using `select()` to look at the `type` column.
```{r}
stops %>%
select(type) %>%
head(10) # this allows us to look at just the first 10 rows
```
Ah! Some of our stops are vehicular (i.e., traffic) stops, but some are
pedestrian stops. Many of the larger cities in our database provided us with
both pedestrian and vehicular data. We've provided it all for you so that you
can dive in and uncover stories relating to pedestrian stops too! Many of the
analysis techniques we'll be covering today can be applied to the pedestrian
data as well. And we encourage you to spend some time on your own replicating the
relevant analyses on the pedestrian data. But for now, let's filter to just
vehicular stops, since one of our analyses is only relevant for traffic stops,
and since this matches what the majority of the data in our database looks like.
```{r}
stops <- stops %>% filter(type == "vehicular")
```
How many stops do we have now?
```{r}
stops %>% nrow()
```
It'd be nice to see if the 1.1 million stops were evenly distributed across
years, or if stop counts have changed. To find stop counts per year, we need to
define a notion of `year` (recall that our data only has `date`).
The `count()` function gives us an easy way to tally up observations over any
number of groupings we desire. In this case, let's count over year.
#### Aside: `mutate()`
Notice, though, that we have the same problem as before: there's no `year`
column in the `stops` dataframe. We'll use the `mutate()` function to fix that.
The `mutate()` function adds new columns to a dataframe based on old columns.
The basic setup is `mutate(DATA, NEW_COL = FUN(OLD_COL))` where `DATA` is our
dataframe, `NEW_COL` is the name of the new column we want, and `FUN` is the
function we apply to the old column, `OLD_COL`, to get it.
Use the `year()` and `mutate()` functions to add a new column called `year`
to our `stops` dataframe and then use `count()` to determine the number of stops per year.
```{r}
stops %>%
mutate(year = year(date)) %>%
count(year)
```
How about counts over race?
```{r}
stops %>%
count(subject_race)
```
Let's make another table that gives us the proprtion of stops by race.
(There are a few equivalent ways to do this---choose the method that feels
most natural to you.)
```{r}
# This method builds off of using `count` as above
stops %>%
count(subject_race) %>%
mutate(prop = n / sum(n))
# This method uses the group_by/summarize paradigm
stops %>%
group_by(subject_race) %>%
summarize(
n = n(),
prop = n / nrow(.)
)
```
At first glance, we see there are far more stops of black drivers than any other
race. About two-thirds of stops in our dataset were of black drivers! This
stat on is own, though, doesn't actually say much. We'll return to this more
rigorously later on.
How about counting how many stops by year _and_ race?
```{r}
stops %>%
# notice that you can also mutate `date` *within* the count funciton
count(year = year(date), subject_race)
```
Now we've gotten to the point where we have too many rows to interpret in a
single table. A simple visualization could really help us out here.
```{r}
stops %>%
count(year = year(date), subject_race) %>%
ggplot(aes(x = year, y = n, color = subject_race)) +
geom_point() +
geom_line()
```
We used `ggplot` to make this plot. This package provides a lot of options for
customization, and there are many tweaks we could make to this plot to make it
nicer. Its syntax is a little complicated, though, and in our data
exploration, it's enough just to get a coarse visual like this.
From this plot we see that, at least for black and white drivers, the annual
trends are very different by race. (It's hard to tell from this plot for
drivers of other races because the counts are comparatively so much smaller.)
All races experienced a spike in enforcement in 2015, but thereafter, there
were fewer white drivers stopped (in 2016 and 2017), whereas there continued to
be an _increase_ in number of black drivers stopped over those two years.
This is already a potential lead! We'd have to investigate further to see what
the trend looks like when adjusting for population changes over time or other
factors we haven't considered in this example. But it's often simple
explorations like this that can uncover the path to stories.
**Fun fact**: This same trend does _not_ hold true for pedestrian stops. In fact,
if we hadn't filtered to only vehicular stops, we wouldn't have noticed this,
because this same plot over the combined pedestrian and vehicular data looks
unremarkable. The takeaway is that looking at trends by sub-categories can often
be very helpful. (E.g., in Nashville, looking at the different listed stop
reasons uncovers the extent of the disparities.)
To give us time to dive into a variety of analysis methods, we'll let you
investigate Philly's annual traffic stop patterns on your own time. But because
it seems like the disparities could be changing over time, we'll filter to
only 2017 and focus on that data for the rest of our analysis (and overwrite
`stops`).
```{r}
stops <- stops %>% filter(year(date) == 2017)
```
## Benchmark test
We saw before that over two-thirds of stops were of black drivers. The by-race
stop counts are only meaningful, though, when compared to some baseline. If
the Philadelphia population was about two-thirds black, then two-thirds of stops
being of black drivers wouldn't be at all surprising.
### Stop rates
In order to do this baseline comparison, we need to understand the racial
demographics in our Philly population data. The data as we've given it to you
has raw population numbers. To make it useful, we'll need to compute the
_proportion_ of Philadelphia residents in each demographic group. (Hint: use the
`mutate()` function.)
```{r}
population_2017 %>%
mutate(prop = num_people / sum(num_people))
```
As an eyeball comparison leads us to see that black drivers are being stopped
disproportionately, relative to the city's population. But let's be a bit more
rigorous about this. If we join the two tables together, we can compute stop
rates by race (i.e., number of stops per person). Remember to take into acount
how many years are in your stop data, in order to get a true value of stops per
capita; we're using only 2017 for stops and for population, so we're in good
shape.
#### Aside: `left_join()`
The way we put tables together is with the `left_join()` function. We need to
input three things into this function: the two tables we'd like to combine,
along with instructions on how to join them. In this case, the two tables we
want to join are the table of stops counted according to `subject race` and
`population_2017`. The instruction for combining the tables is to join rows that
contain the same race, so we have to also give `left_join()` the argument `by =
"subject_race"`. This means that `left_join()` will look at the first table---
i.e., the table stops counted by race---and then go to the `subject_race` column
in each row. Then, `left_join()` will take what it finds there---in this case,
`"asian/pacific islander"`, `"black"`, `"hispanic"`, `"other/unknown"`, and
`"white"`---and look in the second table, i.e., `population_2017`, for all the
rows that contain the same information in `population_2017`'s race column. Then,
it will add the second row at the end of the first to create a new row with
information from both. What we end up with is a dataframe with all of the
columns from _both_ tables.
The process is a little complicated, and we won't use it too much, so don't
worry if the abstract description doesn't make sense. To get a better
understanding of what's going on, try joining the two tables described above,
being sure to include the `by = "subject_race"` argument.
```{r}
stops %>%
count(subject_race) %>%
left_join(
population_2017,
by = "subject_race"
) %>%
mutate(stop_rate = n / num_people)
```
Good! Now we can divide the black stop rate by the white stop rate to be able to make
a quantitative statement about how much more often black drivers are stops
compared to white drivers, relative to their share of the city's population.
Black drivers are stopped at a rate 3.4 times higher than white drivers, and
Hispanic drivers are stopped at a rate 1.5 times higher than white drivers.
### Search rates
Let's do the same sort of benchmark comparison for search and frisk rates. These
are easier than the last one since we don't need an external population benchmark.
We can use the stopped population as our baseline, defining search rate
to be proportion of stopped people who were subsequently searched, and frisk rate
as proportion of stopped people who were subsequently frisked. Let's get
these values by race.
#### Aside: `group_by()` and `summarize()`
One thing that we often want to do with data is disaggregate it. That is, we
might want to take the data and break it down into smaller subpopulations. Then,
when we ask questions, we can ask about each piece---for instance, each
demographic group, or each police district---instead of asking about the population
as a whole.
The way to do this in `R` is with `group_by()` and `summarize()`. The standard way
to use `group_by()` is to call `group_by(DATA, COL_NAME)`, where `DATA` is our
dataframe and `COL_NAME` is the name of a column. What `group_by()` then does is
take all the rows in the dataframe `DATA` and put them into different groups,
one for each different value in the column `COL_NAME`. So, for instance, if we
called `group_by(stops, district)`, `R` would hand back to us the `stops`
dataframe with all of its columns broken into different groups, one for each
police district.
The second step is to do something with those groups. That's what `summarize()`
does. The way `summarize()` works is to take a dataframe broken into groups by
`group_by()` and calculate a statistic for each group . The basic syntax is
`summarize(DATA, STAT = FUN(COL_NAME))`, where `DATA` is some dataframe broken
up by `group_by()`, `STAT` is some statistic we want to calculate, `FUN` is the
function that calculates that statistic, and `COL_NAME` is the name of the
column (or columns) used to calculate the statistic.
Let's put it all together with an example. A straightforward one is: `stops %>%
group_by(subject_race) %>% summarize(arrest_rate = mean(arrest_made))`. Typing
this into the `R` console will calculates arrest rates disaggregated by race.
Using `group_by()` and `summarize()`, let's calculate frisk rates by race.
```{r}
stops %>%
group_by(subject_race) %>%
summarize(
search_rate = mean(search_conducted, na.rm = T),
frisk_rate = mean(frisk_performed, na.rm = T)
)
```
Note that with functions like `mean()`, if any of the values that are being
averaged are `NA`, the output value of the mean will simply be `NA`. We don't have
to worry about that with our Philly data, because it's super clean. But you may
encounter this problem in some of your own data. The easy workaround is to pass
the argument `na.rm = T` into `mean()`---as you see in the code above. This tells
`R` to remove the `NA` values and compute the mean over all the non-NA values.
Now let's dive into these results! As with the stop rates, we can make a
quantitative claim about disparities in search and frisk rates by dividing the
minority rate by the white rate. Here we see that among drivers who were
stopped, black drivers were searched at a rate 1.5 times higher than white
drivers, and Hispanic drivers were searched at a rate 1.2 times higher than
white drivers. Black drivers were frisked at a rate 2.1 times higher than
white drivers were, and Hispanic drivers were frisked at a rate 1.5 times
higher than white drivers were.
### Caveats about the benchmark test
While these baseline stats give us a sense that there are racial disparities in
policing practices in Philadelphia, they are not evidence of discrimination. The
argument against the benchmark test is that we haven't identified the correct
baseline to compare to.
For the stop rate benchmark, what we really want to know is what the true
distribution is for individuals breaking traffic laws or exhibiting other
criminal behavior in their vehicles. If black and Hispanic drivers are
disproportionately stopped relative to their rates of offending, that would be
stronger evidence. Some people then proposed to use benchmarks that approximate
those offending rates, like arrests, for example. However, we know arrests to
themselves be racially skewed (especially for low-level drug offenses, for
example), so it wouldn't give us the true offending population's racial
distribution. Furthermore, only `r stops %>% summarize(p = round(100 *
mean(arrest_made), 1)) %>% pull(p)`% of stops result in an arrest, so the
arrested population will naturally not match the stopped population.
An even simpler critique of the population bechmark for stop rates is that it
doesn't account for possible race-specific differences in driving behavior,
including amount of time spent on the road (and adherence to traffic laws, as
mentioned above). If black drivers, hypothetically, spend more time on the
road than white drivers, that in and of itself could explain the higher stop
rates we see for black drivers, even in the absence of discrimination.
Search and frisk rates are slightly less suspect, since among the stopped
population, it's more reasonable to believe that people of different races
offend at equal rates. In the context of searches, this means assuming that
all races exhibit probable cause of possessing contraband at equal rates. And
in the case of frisks, this means assuming that all races exhibit reasonable
articulable suspicion of possessing a weapon at equal rates. Of course, one
could also argue against these assumptions. One could claim that the stopped
population isn't a good measure of the true racial distribution of probable
cause. This is all to say that while benchmark stats are a good place to
start, more investigation is required before we can draw any conclusions.
## Outcome test
To circumvent the benchmarking problem, it's common to turn to the search
decision, rather than the stop decision. This is because we have a notion of
what a "successful" search is. The legal justification for performing a search
is probable cause that the driver possesses contraband. So a successful search
is one which uncovers contraband.
We thus turn to rates of successful searches. That is, what proportion of
searches, by race, were successful? This proportion is known as the contraband
recovery rate, or the "hit rate." If racial groups have different hit rates, it
can imply that racial groups are being subjected to different standards.
As a caricatured example, suppose among white drivers who were searched,
officers found contraband 99% of the time, while among black drivers who were
searched, officers found contraband only 1% of the time. This would lead us to
believe that officers made sure they were _certain_ white individuals had
contraband before deciding to search, but that they were searching black
individuals on a whiff of evidence.
Let's investigate hit rates by race in Philly in 2017.
```{r}
stops %>%
filter(search_conducted) %>%
group_by(subject_race) %>%
summarize(
hit_rate = mean(contraband_found, na.rm = T)
)
```
We see that hit rates are slightly lower for black and Hispanic drivers than
for white drivers.
However, what if hit rates vary by police district? If the bar for stopping
people, irrespective of race, is lower in certain police districts, and black
individuals are more likely to live in neighborhoods in those districts, then
the observed disparities may not reflect bias.
### Adjusting for location
Now let's compute hit rates by race _and_ district. Name your result `hit_rates`.
```{r}
hit_rates <-
stops %>%
filter(search_conducted) %>%
group_by(subject_race, district) %>%
summarize(hit_rate = mean(contraband_found, na.rm = T))
hit_rates
```
Again, this is too many hit rates to compare in one table. To plot the hit rates
of black vs. white drivers and of Hispanic vs. white drivers, we need to
reshape our table to have each row containing a district, a minority race,
minority hit rate in that district, and white hit rate in that district. We'll
walk you through the code below that reshapes the data for us.
```{r}
# Reshape table to show hit rates of minorities vs white drivers
hit_rates <-
hit_rates %>%
filter(subject_race %in% c("black", "white", "hispanic")) %>%
spread(subject_race, hit_rate, fill = 0) %>%
rename(white_hit_rate = white) %>%
gather(minority_race, minority_hit_rate, c(black, hispanic)) %>%
arrange(district)
hit_rates
```
**Quick Tip**: You might have noticed a strange symbol in the code above:
`%in%`. This is really just shorthand for the sentence, "the `subject_race` is
`"black"` or it is `"white"` or it is `"hispanic"`." Without the `%in%` syntax, we
would have to write this in `R` as: `subject_race == "black" | subject_race ==
"white" | subject_race == "hispanic"`. The `%in%` syntax is a lot shorter and
clearer, as we think you'll agree.
Now let's plot it!
```{r}
# We'll use this just to make our axes' limits nice and even
max_hit_rate <-
hit_rates %>%
select(ends_with("hit_rate")) %>%
max()
hit_rates %>%
ggplot(aes(
x = white_hit_rate,
y = minority_hit_rate
)) +
geom_point() +
# This sets a diagonal reference line (line of equal hit rates)
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
# These next few lines just make the axes pretty and even
scale_x_continuous("White hit rate",
limits = c(0, max_hit_rate + 0.01),
labels = scales::percent
) +
scale_y_continuous("Minority hit rate",
limits = c(0, max_hit_rate + 0.01),
labels = scales::percent
) +
# This makes sure that 1% on the x-axis is the same as 1% on the y-axis
coord_fixed() +
# This allows us to compare black v. white and Hispanic v. white side by
# side, in panels
facet_grid(. ~ minority_race)
# Depending on your version of ggplot2, you may be able to use the syntax
# below (the newer ggplot2 syntax)---which is clearer, in my opinion.
# But older versions of ggplot2 will only accept the above syntax
# facet_grid(cols = vars(minority_race))
```
This is a good start. However, it's hard to tell where the mass is. That is,
maybe the points above the dotted line (i.e., the points where minority hit
rates are _higher_ than white hit rates), maybe those districts have all of
Philadelphia's population, and the districts below the line only represent
a few people. While this is unlikely, it's still good to have a marker of how
much weight or emphasis to give each of the points on our plot.
Let's therefore size each of the points by number of searches.
```{r}
# Get corresponding number of searches (to size points).
# Again, for each district we want to know the number of white+black searches
# and white+Hispanic searches. This requires the same spreading and gathering
# as our previous data-munging.
search_counts <-
stops %>%
filter(
search_conducted,
subject_race %in% c("black", "white", "hispanic")
) %>%
count(district, subject_race) %>%
spread(subject_race, n, fill = 0) %>%
rename(num_white_searches = white) %>%
gather(minority_race, num_minority_searches, c(black, hispanic)) %>%
mutate(num_searches = num_minority_searches + num_white_searches) %>%
select(district, minority_race, num_searches)
hit_rates %>%
left_join(
search_counts,
by = c("district", "minority_race")
) %>%
ggplot(aes(
x = white_hit_rate,
y = minority_hit_rate
)) +
geom_point(aes(size = num_searches), pch = 21) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
scale_x_continuous("White hit rate",
limits = c(0, max_hit_rate + 0.01),
labels = scales::percent
) +
scale_y_continuous("Minority hit rate",
limits = c(0, max_hit_rate + 0.01),
labels = scales::percent
) +
coord_fixed() +
facet_grid(. ~ minority_race)
# same facet_grid syntax issue here as before!
# facet_grid(cols = vars(minority_race))
```
It seems like we can now say with confidence that even looking within location,
nearly every police district has lower hit rates for minorities than white
drivers.
Note that we can also run a simple model (logistic regression) to quantify how
much lower the odds are of recovering contraband from searching a black or
Hispanic driver versus from searching a white driver, adjusting for location.
### Investigating anomalies
One thing that we should always keep an eye out for, both as journalists and as
data scientists, are anomalies in the data that could be changing the meaning of
our results.
In the plot above, you may have noticed a rather larger point (i.e., a district
with a lot of searches) that has a hit rate of essentially zero for both
white and minority drivers. That's a red flag that something's up.
Let's figure out what district that is by filtering our `hit_rates` table
to the smallest white hit rate.
```{r}
hit_rates %>%
filter(white_hit_rate == min(white_hit_rate))
```
Next, let's take a look at what addresses are usually written in for stops in
this district.
```{r}
# Hint: Note that districts in our dataset are encoded as characters,
# so 77 is actually "77"
stops %>%
filter(district == "77") %>%
count(location, sort = T)
```
Depending on how you investigated the `location`, you might have found that the
majority of the stops happen at the airport, or have location listed as `NA` or
unavailable. Indeed, if we check the Philadelphia Police's website, we'll find
that there is no District 77, and that the airport does not fall within the
jurisdiction of any particular police district. Between this information and the
fact that hit rates are essentially zero for this district, we can conclude that
the types of searches happening in this location are qualitatively different
from searches in our other locations.
Without knowing more about stops and searches in this district, it makes sense
to remove them from our hit rate analysis. Try computing citywide hit rates
again, this time filtering out district 77.
```{r}
# Remember: Districts in our dataset are encoded as characters,
# so 77 is actually "77"
stops %>%
filter(search_conducted, district != "77") %>%
group_by(subject_race) %>%
summarize(
hit_rate = mean(contraband_found, na.rm = T)
)
```
Wow! A sizeable share of searches of white individuals must have been taking
place at the airport with zero hit rates, deflating the white hit rate we
were calculating before. But this didn't influence hit rates of black or
Hispanic drivers as much. Our aggregate hit rate analysis now shows stronger
evidence of bias, similar to our by-location plot. (Note, thought, that
aggregate and by-location hit rates do not necessarily always tell the same
story---a phenomenon known as Simpson's Paradox---which is why it's important
to check both to get the full picture.)
### Caveats about the outcome test
While the outcome test is a very simple and intuitively appealing test, it doesn't
allow us to observe the actual threshold for searching someone, it only allows
us to observe outcomes. There is a subtle statistical flaw with the outcome
test, known as _infra-marginality_, which in certain cases can lead to the outcome
test indicating discrimination when equal thresholds are actually being applied,
and in other cases can lead to the outcome test indicating equal hit rates when
different thresholds are actually being applied. We thus often use the threshold
test (a more complex test that uses both search and hit rates to infer thresholds),
in order to validate the outcome test.
(**Fun fact**: The threshold test confirms the results we see in Philadelphia.
Officers are indeed applying lower thresholds when deciding to search black and
Hispanic drivers than when deciding to search white drivers.)
We don't have time to go into the threshold test now. The important takeaway,
though, is that every statistical test has its flaws. So when reporting on
data, we have to make sure not overstate our findings.
## Veil of Darkness test
The outcome test is a good way to get around the benchmarking problem at the
search decision, but when it comes to assessing bias at the stop decision,
it's a little trickier. The outcome test usually doesn't work, since rarely
is there a notion of what a "successful" stop is. (There are, of course, some
exceptions to this, like when stop reason is listed as suspected Criminal
Possession of a Weapon (CPW)---which is the case in NYC stop-question-frisk
data.)
An alternative approach to assessing bias in stop decisions was proposed
by Grogger and Ridgeway in 2006. This approach, known as
the Veil of Darkness test, relies on the hypothesis that officers who are
engaged in racial profiling are less likely to be able to identify a driver's
race after dark than during daylight. Under this hypothesis, if stops made
after dark had smaller proportion of black drivers stopped than stops made
during daylight, this would be evidence of the presence of racial profiling.
The naive approach is just to compare whether daytime stops or nighttime stops
have higher proportion of black drivers. This, however, is problematic. More black
drivers being stopped at night could be the case for a multitude of reasons:
different deployment and enforcement patterns, different driving patterns,
etc. All of these things correlate with clock time and would be confounding our
results. So let's find a way to adjust for clock time.
To adjust for clock time, we want to compare times that were light at some
point during the year (i.e., occurred before dusk) but were dark at other points
during the year (i.e., occurred after dusk). This is called the "inter-twilight
period": the range from the earliest time dusk occurs in the year to the latest
time dusk occurs in the year.
Let's calculate the sunset and dusk times for Philly in 2017.
(Note that we distinguish here between _sunset_ and _dusk_. Sunset is the point
in time where the sun dips below the horizon. Dusk, also known as the end of
civil twilight, is the time when the sun is six degrees below the horizon, and
it is widely considered to be dark. We'll explain why this is relevant in a
moment.)
```{r}
# Get timezone for Philly
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)
# Helper function
time_to_minute <- function(time) {
hour(hms(time)) * 60 + minute(hms(time))
}
# Compute sunset time for each date in our dataset
sunset_times <-
stops %>%
mutate(
lat = center_lat,
lon = center_lng
) %>%
select(date, lat, lon) %>%
distinct() %>%
getSunlightTimes(
data = .,
keep = c("sunset", "dusk"),
tz = tz
) %>%
mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>%
mutate(
sunset_minute = time_to_minute(sunset),
dusk_minute = time_to_minute(dusk),
date = ymd(str_sub(date, 1, 10))
) %>%
select(date, sunset, dusk, ends_with("minute"))
```
Great. Now, using `sunset_times`, what is Philly's inter-twilight period?
```{r}
sunset_times %>%
filter(dusk == min(dusk) | dusk == max(dusk))
```
So the earliest sunset in 2017 was at around 4:30 p.m., in December (and it was
fully dark about 30 minutes later, at dusk, around 5:00 p.m.). The latest sunset time
in 2017 was at around 8:30 p.m., in late June (and it was fully dark about 30
minutes later, at dusk, around 9:00 p.m.).
For a time in this range, like 6:30 p.m., part of the year it's still light, and part
of the year, it's still dark. So we can compare the proportion of black drivers
stopped at 6:30 p.m. when it's light (usually in the summer) to the proportion of
black drivers stopped at 6:30 p.m. when it's dark (usually in the winter). This way,
we're really seeing the effect of darkness, since driving and deployment
patterns tend to change only with clock time. For example, the demographics
of commuters who are on the road at 6:30 p.m. likely stays pretty steady throughout
the year, compared to demographics of commuters at 5:00 p.m. versus at 9:00 p.m.
One quick note: We want to make sure to filter out stops that occurred in the
approximately 30-minute period between sunset and dusk (end of civil twilight),
since it is ambiguous whether that period is light or dark, which would muddle
up our analysis.
Let's join the sunset times to our stops data and:
1. Filter to the inter-twilight period;
2. Remove that ambigous pre-dusk period; and
3. For simplicity, compare only black and white drivers.
```{r}
vod_stops <-
stops %>%
left_join(
sunset_times,
by = "date"
) %>%
mutate(
minute = time_to_minute(time),
minutes_after_dark = minute - dusk_minute,
is_dark = minute > dusk_minute,
min_dusk_minute = min(dusk_minute),
max_dusk_minute = max(dusk_minute),
is_black = subject_race == "black"
) %>%
filter(
# Filter to get only the intertwilight period
minute >= min_dusk_minute,
minute <= max_dusk_minute,
# Remove ambigous period between sunset and dusk
!(minute > sunset_minute & minute < dusk_minute),
# Compare only white and black drivers
subject_race %in% c("black", "white")
)
```
How many stops are in our new, filtered data frame?
```{r}
vod_stops %>% nrow()
```
Now, to get some intuition, let's use our 6:30 p.m. example. If we filter the data
to stops that occurred in the narrow window from 6:30 p.m. to 6:45 p.m., we
can compute what proportion of stops were of black drivers for stops when it
was dark and compare that to the proportion for stops when it was not dark yet.
```{r}
vod_stops %>%
filter(time > hm("18:30"), time < hm("18:45")) %>%
group_by(is_dark) %>%
summarize(prop_black = mean(is_black))
```
Indeed, we see that the proportion is smaller when 6:30--6:45 p.m. is dark
than when it's light. According to the veil of darkness hypothesis, this is
because officers engaged in racial profiling can't determine drivers' races as
well at night. We know that the proportion of black drivers stopped dropped by
about 5% when it was dark and race was harder to identify. It's hard to say what
this means in terms of the extent or pervasiveness, though. We *cannot* make
*any* claims about whether some small percent of the officers were engaged in
stops that were 100% profiled on race, or whether all officers were engaged in
racial profiling in some small percentage of their stops, or anything of the
sort.
Furthermore, while this stat gives us some intuition that racial profiling might
be present in Philly, we've only looked at one time. In order to make sure the
result is robust, we want to check this for _every_ time in our inter-twilight
period. To do this, we can use _logistic regression_. The basic idea here is
that we're finding what our data implied about how darkness and clock time
influence the whether a stopped driver will be black. We then extract the
coefficient of darkness. (You can think of this loosely as the measure of how
much darkness---as opposed to clock time---influences whether a stopped driver
will be black.)
```{r}
mod1 <- glm(
is_black ~ is_dark + splines::ns(minute, df = 6),
family = binomial,
data = vod_stops
)
summary(mod1)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]
```
The fact that the coefficient is negative means that darkness _lessens_ the
likelihood that a stopped driver will be black, after adjusting for clock
time. This matches our intuitive result from 6:30 p.m.
The fact that the standard error is small means that this is a _statistically significant_
finding. Another important thing to do, besides checking
statistical significance of your results, is to do a bunch of different
robustness checks. The idea is that all models make some choices about how to
turn the question at hand into math, and we want to vary those choices to make
sure that none of them change the model result---that the result holds regardless
of, for example, how you choose to control for time. In our model we used a
natural spline with six degrees of freedom, but we could have chosen to use
3 degrees of freedom, or to do something way simpler like bin time to one-minute
or five-minute windows.
Another robustness check that often comes up with this type of modeling is
adding in a control for sub-geography. For the outcome test, we controlled
for police district under the proposition that maybe it was an intentional
policy decision for certain police districts to lower search thresholds
than other police districts---and if those districts happened to be
disproportionately black, that would "explain away" the bias we saw in the
aggregate outcome test results.
In the case of the Veil of Darkness test, one could make an analogous claim:
is the result that we're seeing just a product of certain districts being policed
more after dark and those districts have a lower proportion of black drivers
than the districts being policed before dark? If so, then adjusting for location
in our model would lead to the darkness coefficient being zero. But if the
coefficient on darkness doesn't change much (i.e., stays negative in a
statistically significant way), then that's another marker that our finding is
robust.
```{r}
mod2 <- glm(
is_black ~ is_dark + splines::ns(minute, df = 6) + as.factor(district),
family = binomial,
data = vod_stops
)
summary(mod2)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]
```
Looks like no matter how you slice it, the veil of darkness test shows racial
profiling of black drivers is present in Philly traffic stops.
### Caveats on the Veil of Darkness test
The veil-of-darkness test is a popular technique for assessing disparate
treatment, but, like all statistical methods, it comes with caveats.
(Hopefully that's one of the takeaways today---data and statistical analysis
can be very powerful, but it's good to know what the limitations are.) Perhaps
most importantly, darkness---after adjusting for time of day---is a function of
the date. As such, to the extent that driver behavior changes throughout the
year, and these changes are correlated with race, the test can suggest
discrimination where there is none. One way to account for seasonal effects
is to consider the brief period around daylight savings. Driving behavior
likely doesn't change much at, say, 6:00 p.m. the day before daylight savings to
the day after; but the sky is light on one day and dark on the next. Clever
checks like this can help circumvent some of the concerns about the veil of
darkness test. However, if driving behavior is more related to lighting than
time of day, the daylight savings test wouldn't help.
Another thing to note is that artificial lighting (e.g., from street lamps)
can weaken the relationship between sunlight and visibility, and so the method
may underestimate the extent to which stops are predicated on perceived race.
Other things, like vehicle make, year, and model often correlate with race and
are still visible at night, which could lead to the test under-estimating the
extent of racial profiling. Similarly, the test doesn't control for stop
reason, which is often correlated with both race and time of day. (Consider,
for example, pretextual broken tail light stops.)
Finally, the test only speaks to presence of racial profiling in the
intertwilight period -- doesn't say anything about other hours.
Nevertheless, despite these shortcomings, the test provides a useful if
imperfect measure of bias in stop decisions.
## Parting thoughts
Hopefully this gets you started on the basics of how to analyze data, and gives
you a little glimpse of some more complex modeling you can do with the data too.
While these analyses (from simple hit rates to modeling racial
profiling) can be really helpful in quantifying racial disparities, it's often
basic exploration of the data that yields the most interesting stories. So, the
number one rule in working with data is: *be curious!*
Make sure to read the [full paper](https://5harad.com/papers/100M-stops.pdf) for a
more in-depth analysis. And we look forward to reading about all of the analysis
you all do, and all the interesting stories you uncover!