forked from michaelpollmann/spatialTreat-example
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_place_in_grid.Rmd
318 lines (281 loc) · 12.7 KB
/
01_place_in_grid.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
jupyter:
jupytext:
formats: Rmd,ipynb
text_representation:
extension: .Rmd
format_name: rmarkdown
format_version: '1.2'
jupytext_version: 1.13.0
kernelspec:
display_name: R
language: R
name: ir
---
# 01: Place Data in Grid
## Required R packages
* `dplyr` for data management
* `reticulate` to call Python code for fast calculation of distances between many points
* `sf` to transform latitude and longitude into planar coordinates
```{r load-packages, message=FALSE, warning=FALSE, results="hold"}
library(dplyr)
library(reticulate)
library(sf)
```
and set the seed of the random number generator for reproducability.
```{r set-seed}
set.seed(24601)
```
## Required Python packages
The Python 3 code calculating distances between many points requires
* `numpy` for matrix operations
* `numba` for just-in-time compilation for faster computation
The Python code called from this R code is contained in the two files `dist_between_vectors.py` and `distance.py` which are included in the repository.
## Load the simulated example data
The data files included in the Github repository contain data
simulated to resemble actual business locations in the San Francisco Bay Area.
Neither locations nor visits are real.
* dat_S.rds contains the locations of grocery stores
* dat_A.rds contains the locations of (and foot-traffic to) nearby businesses of various industries
```{r load-data}
dat_S <- readRDS("data/dat_S.rds")
dat_A <- readRDS("data/dat_A.rds")
```
## Calculate distances from businesses to grocery stores
For (much) faster computation of the distances between the `r nrow(dat_A)` businesses and the `r nrow(dat_S)` grocery stores, we call Python code.
First, make the Python code available in R.
See the top of this document for the required Python packages.
```{r source-python}
source_python("dist_between_vectors.py")
```
Note that the function `dist_between` automatically omits from its ouput pairs of points that are far apart
(not within a 2.5 mile square of each other).
Next, calculate and save distances between businesses and grocery stores.
```{r dist-AS}
dat_D <- dist_between(as.matrix(dat_S %>% select(s_id,latitude,longitude)),
as.matrix(dat_A %>% select(a_id,latitude,longitude)))
colnames(dat_D) <- c("s_id","a_id","dist_m")
dat_D <- as_tibble(dat_D) %>%
mutate(s_id = as.integer(s_id),
a_id = as.integer(a_id),
dist_m = as.integer(dist_m),
dist_km = dist_m / 1000,
dist_mi = dist_m*0.62137119223733/1000)
saveRDS(dat_D,"data/dat_D.rds")
```
Similarly, calculate the distances between all grocery stores
```{r dist-SS}
dat_DS <- dist_between(as.matrix(dat_S %>% select(s_id,latitude,longitude)),
as.matrix(dat_S %>% select(s_id,latitude,longitude)))
colnames(dat_DS) <- c("s1_id","s2_id","dist_m")
dat_DS <- as_tibble(dat_DS) %>%
mutate(s1_id = as.integer(s1_id),
s2_id = as.integer(s2_id),
dist_m = as.integer(dist_m),
dist_km = dist_m / 1000,
dist_mi = dist_m*0.62137119223733/1000) %>%
filter(s1_id != s2_id)
saveRDS(dat_DS,"data/dat_DS.rds")
```
## Random points not near real grocery stores
The neural network code looks for counterfactual treatment locations in the larger neighborhoods of pre-specified points.
The first set of pre-specified points are the *real* treatment locations.
However, to find counterfactual locations that are not close to real locations, we also draw (many) random locations.
These random locations should be:
1. near other businesses (because locations not near any businesses are implausible counterfactual locations anyway),
2. not near real grocery stores (because these are covered by the first set of pre-specified points), and
3. not too close to one another (because the neural network searches the neighborhood of each point, we don't need the points to be close together)
To achieve 1 and 2, start with the locations of other businesses that are not very close (<0.2 miles) but also not very far (>2 miles) from their respective *nearest* grocery store.
```{r pick-isolated}
dat_S_isolated <- dat_D %>%
# find distance to nearest grocery store
group_by(a_id) %>%
summarize(dist_mi = min(dist_mi)) %>%
# keep in the relevant range
filter(between(dist_mi,0.2,2))
```
and move the points slightly around the business such that the center of an area does not point exactly at a different business
```{r move-random}
dat_S_isolated_random <- dat_S_isolated %>%
inner_join(dat_A, by="a_id") %>%
# randomly shift by approximately 0.025 - 0.05 miles to move away from business
# flip needed due to mean shift!
mutate(random_shock_lat = rnorm(n=n(),mean=0.0004,sd=0.0001),
random_shock_lon = rnorm(n=n(),mean=0.0004,sd=0.0001),
sign_flip_lat = (runif(n=n()) > 0.5)*2-1,
sign_flip_lon = (runif(n=n()) > 0.5)*2-1
) %>%
mutate(latitude=latitude+random_shock_lat*sign_flip_lat,
longitude=longitude+random_shock_lon*sign_flip_lon) %>%
mutate(s_id = as.integer(1000 + row_number())) %>%
select(s_id, latitude, longitude)
```
here, the random shocks are in degrees of latitude and longitude.
When all observations are reasonably close to one another, this is fine because these random points are not used directly as counterfactual locations.
When observations come from different regions that are far apart, the one degree of latitude / longitude may have signify different distances.
In those cases, it may be better to, for instance, first project each point into 2D space.
The random points here get `s_id` starting with 1,001.
Because there are only `r nrow(dat_S)` real grocery store locations in the sample, this makes it easy to distinguish between real and random locations, and has no other meaning.
To skip random points that are quite close to one another, and thereby do not help us explore different neighborhoods, we can check the distance between the random points.
Whenever two random points are close (<100 meters) to one another, we drop one of the two.
This again is not important conceptually, but may help computationally.
```{r skip-close}
tmp <- dist_between(as.matrix(dat_S_isolated_random),as.matrix(dat_S_isolated_random))
colnames(tmp) <- c("s1_id","s2_id","dist_m")
tmp <- as_tibble(tmp) %>%
filter(s2_id < s1_id) %>%
filter(dist_m < 100) %>%
arrange(s2_id) %>%
.$s2_id
dat_S_candidate <- dat_S_isolated_random %>% filter(!(s_id %in% tmp))
```
Finally, save the remaining random points.
```{r save-candidates}
saveRDS(dat_S_candidate,"data/dat_S_candidate_random.rds")
```
Note that these random locations by themselves are NOT usually good counterfactual locations.
The hope is that the neural network will discover some good counterfactual locations near some of the random locations.
## Find 2D (grid) coordinates and save files for neural nets
Project latitude and longitude into 2D space using the `sf` package, for real treatment locations, the random candidate locations, and the other businesses.
```{r calc-2d-proj}
# use NAD83(2011) projection, EPSG:6419 for California 3 zone (~ Bay Area)
mat_S_xy <- cbind(dat_S$s_id,
sf_project(from="WGS84", to="EPSG:6419",
pts=cbind(dat_S$longitude,dat_S$latitude)))
colnames(mat_S_xy) <- c("s_id","x","y")
mat_S_candidate_xy <- cbind(dat_S_candidate$s_id,
sf_project(from="WGS84", to="EPSG:6419",
pts=cbind(dat_S_candidate$longitude,
dat_S_candidate$latitude)))
colnames(mat_S_candidate_xy) <- c("s_id","x","y")
mat_A_xy <- cbind(dat_A$a_id,
sf_project(from="WGS84", to="EPSG:6419",
pts=cbind(dat_A$longitude,dat_A$latitude)))
colnames(mat_A_xy) <- c("a_id","x","y")
# convert to tables instead of matrices for easier handling
dat_S_xy <- as_tibble(mat_S_xy)
dat_S_candidate_xy <- as_tibble(mat_S_candidate_xy)
dat_A_xy <- as_tibble(mat_A_xy)
```
To simplify the setup of the neural network, for each real treatment or random location, save the locations of all nearby businesses as relative locations.
```{r define-nearby}
dist_keep_mi <- 2 * sqrt(2) # * sqrt(2) fills a square with base 4 miles instead of a circle with diameter 4
```
First, for each real treatment location.
```{r all-near-real}
dat_S_A <- rbind(
# other businesses
dat_D %>%
filter(dist_mi < dist_keep_mi) %>%
select(s_id,a_id) %>%
# grid position of the grocery store
inner_join(dat_S_xy, by="s_id") %>%
rename(x_s = x, y_s = y) %>%
# grid position of other business
inner_join(dat_A_xy, by="a_id") %>%
# relative positions
mutate(x = x-x_s, y = y-y_s) %>%
# industry of other business
inner_join(dat_A %>% select(a_id,naics_code), by="a_id") %>%
select(s_id, a_id, x, y, naics_code),
# other treatment locations
dat_DS %>%
filter(dist_mi < dist_keep_mi) %>%
select(s1_id,s2_id) %>%
# grid position of the grocery store
inner_join(dat_S_xy, by=c("s1_id"="s_id")) %>%
rename(x_s = x, y_s = y) %>%
# grid position of other business
inner_join(dat_S_xy, by=c("s2_id"="s_id")) %>%
# relative positions
mutate(x = x-x_s, y = y-y_s) %>%
# industry of other business
inner_join(dat_S %>% select(s_id,naics_code), by=c("s2_id"="s_id")) %>%
mutate(s_id = s1_id, a_id = -s2_id) %>%
select(s_id, a_id, x, y, naics_code) %>%
arrange(s_id)
) %>%
arrange(s_id) %>%
mutate(x = round(x,2),
y = round(y,2))
```
Second, for the random points, we need the distances to other businesses and treatment locations.
```{r distances-to-random}
# distance to other businesses
dat_D_candidate <- dist_between(as.matrix(dat_S_candidate),
as.matrix(dat_A %>% select(a_id,latitude,longitude)))
colnames(dat_D_candidate) <- c("s_id","a_id","dist_m")
dat_D_candidate <- as_tibble(dat_D_candidate) %>%
mutate(s_id = as.integer(s_id),
a_id = as.integer(a_id),
dist_mi = dist_m*0.62137119223733/1000)
# distance to grocery stores
dat_DS_candidate <- dist_between(as.matrix(dat_S_candidate),
as.matrix(dat_S %>% select(s_id,latitude,longitude)))
colnames(dat_DS_candidate) <- c("s1_id","s2_id","dist_m")
dat_DS_candidate <- as_tibble(dat_DS_candidate) %>%
mutate(s_id = as.integer(s1_id),
a_id = as.integer(s2_id),
dist_mi = dist_m*0.62137119223733/1000)
```
Then we can gather all the relative distances for businesses near the random points in the same way as we did it for the real treatment locations.
```{r all-near-random}
dat_S_candidate_A <- rbind(
# other businesses
dat_D_candidate %>%
filter(dist_mi < dist_keep_mi) %>%
select(s_id,a_id) %>%
# grid position of the grocery store
inner_join(dat_S_candidate_xy, by="s_id") %>%
rename(x_s = x, y_s = y) %>%
# grid position of other business
inner_join(dat_A_xy, by="a_id") %>%
# relative positions
mutate(x = x-x_s, y = y-y_s) %>%
# industry of other business
inner_join(dat_A %>% select(a_id,naics_code), by="a_id") %>%
select(s_id, a_id, x, y, naics_code),
# real treatment locations
dat_DS_candidate %>%
filter(dist_mi < dist_keep_mi) %>%
select(s1_id,s2_id) %>%
# grid position of the grocery store
inner_join(dat_S_candidate_xy, by=c("s1_id"="s_id")) %>%
rename(x_s = x, y_s = y) %>%
# grid position of other business
inner_join(dat_S_xy, by=c("s2_id"="s_id")) %>%
# relative positions
mutate(x = x-x_s, y = y-y_s) %>%
# industry of other business
inner_join(dat_S %>% select(s_id,naics_code), by=c("s2_id"="s_id")) %>%
mutate(s_id = s1_id, a_id = -s2_id) %>%
select(s_id, a_id, x, y, naics_code) %>%
arrange(s_id)
) %>%
arrange(s_id) %>%
mutate(x = round(x,2),
y = round(y,2))
```
Finally, save the data as `.csv` files as input into the (Python) neural net.
```{r save-csv}
write.csv(x = dat_S_A %>%
filter(floor(naics_code/100) != 4451) %>%
select(s_id,a_id,x,y,naics_code),
file = "neural-net/grid_S_I.csv",
row.names=FALSE, quote=FALSE)
write.csv(x = dat_S_A %>%
filter(floor(naics_code/100) == 4451) %>%
select(s_id,a_id,x,y),
file = "neural-net/grid_S_S.csv",
row.names=FALSE, quote=FALSE)
write.csv(x = dat_S_candidate_A %>%
filter(floor(naics_code/100) != 4451) %>%
select(s_id,a_id,x,y,naics_code),
file = "neural-net/grid_S_random_I.csv",
row.names=FALSE, quote=FALSE)
write.csv(x = dat_S_candidate_A %>%
filter(floor(naics_code/100) == 4451) %>%
select(s_id,a_id,x,y),
file = "neural-net/grid_S_random_S.csv",
row.names=FALSE, quote=FALSE)
```