-
Notifications
You must be signed in to change notification settings - Fork 0
/
r4da-week12.qmd
669 lines (534 loc) · 20.3 KB
/
r4da-week12.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
---
title: "Lab 12: Logistic Regression"
author: "Viktoriia Semenova"
editor: source
date: "November 22, 2023"
format:
html:
theme: slides.scss
code-fold: false
code-tools: true
code-summary: "Show the code"
code-overflow: wrap
pdf:
toc: true
colorlinks: true
geometry:
- top=30mm
- left=30mm
highlight-style: github
editor_options:
chunk_output_type: console
---
## Setup
```{r setup}
#| results: 'hide'
#| message: false
#| warning: false
# put all packages we use in a vector
p_needed <- c(
"tidyverse", # shortcut for ggplot2, dplyr, readr
"janitor", # clean variable names in datasets
"modelsummary", # descriptive & regression tables
"haven", # to style the code
"broom", # model output
"clarify", # for simulations
"patchwork" # for plot arragngement
)
# check if they are already installed, install if not installed
lapply(
p_needed[!(p_needed %in% rownames(installed.packages()))],
install.packages
)
# load the packages
lapply(p_needed, library, character.only = TRUE)
# set the theme for all plots to theme_bw
ggplot2::theme_set(ggplot2::theme_classic())
theme_update(legend.position = "bottom")
UA <- read_csv("data/UA_survey.csv")
```
Today most of our lab will be based on the following paper:
> Peisakhin, L. and Rozenas, A. 2018. Electoral Effects of Biased Media: Russian Television in Ukraine. *American Journal of Political Science* 62: 535-550. https://doi.org/10.1111/ajps.12355
In brief, the paper addresses the following question:
> We use plausibly exogenous variation in the availability of the Russian analog television signal in Ukraine to study how a media source with a conspicuous political agenda impacts political behavior and attitudes. Using highly disaggregated election data and an original survey, we estimate that Russian television substantially increased average electoral support for parties and candidates with a “pro-Russian” agenda in the 2014 presidential and parliamentary elections. Evidence suggests that this effect is attributable to persuasion rather than differential mobilization. The effectiveness of biased media varied in a politically consequential way: Its impact was largest on voters with strong pro-Russian priors but was less effective, and to some degree even counter-effective, in persuading those with strong pro-Western priors. Our finding suggests that exposing an already polarized society to a biased media source can result in even deeper polarization.
We will simplify the analysis and focus on the following piece of the DGP:
```{dot}
//| echo: false
//| fig-width: 3
digraph D {
node [shape=ellipse];
edge [len = 1.2, arrowhead = vee];
a [label = "Russian TV\nReception"];
b [label = "Pro-Russian Vote"];
c [label = "Living within\n25 km from Border"];
{rank=same a b};
{rank=sink c};
a->b;
c->a;
c->b;
}
```
### Data
Dataset `UA_survey.csv` contains the following variables:
- `russian_tv`: indicator for whether voter's precinct received Russian TV (1) or not (0)
- `pro_russian_vote`: indicator for whether respondent voted for pro-Russian party in 2014 Ukrainian elections (1) or not (0)
- `within_25km`: indicator for whether respondent's precinct is within 25 kilometers from Russian border (1) or not (0)
```{r}
# summary statistics
datasummary_skim(UA)
# summary statistics split by DV
modelsummary::datasummary_balance(1 ~ pro_russian_vote, UA)
```
```{r}
ggplot(
UA,
aes(
y = pro_russian_vote,
fill = if_else(within_25km == 0, "No", "Yes"),
color = if_else(within_25km == 0, "No", "Yes"),
x = if_else(russian_tv == 0, "No", "Yes")
)
) +
geom_point(
position = position_jitterdodge(
jitter.width = 0.3,
jitter.height = 0.1
),
size = 3, shape = 21
) +
labs(
y = "Pro-Russian Vote",
color = "Living within 25 km",
fill = "Living within 25 km",
x = "Russian TV Reception"
) +
# theme(legend.position = "bottom") +
scale_color_viridis_d(end = 0.8, alpha = 0.8) +
scale_fill_viridis_d(end = 0.8, alpha = 0.3) +
scale_y_continuous(breaks = c(0, 1), labels = c("No", "Yes"))
```
---
## Linear Probability Model
Sometimes when modeling a binary variable, people just use a linear model. Because the dependent variable is still interpreted as a probability, this is called a _Linear Probability Model_. Can you think of any advantages of using this approach? any drawbacks?
$$
\text{Pro-Russian Vote} \sim \beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km} + \varepsilon
$$
```{r}
# fitting a linear model with OLS
m1 <- lm(pro_russian_vote ~ russian_tv + within_25km, data = UA)
# generate predictions
newdata <- expand_grid(
russian_tv = 0:1,
within_25km = 0:1
)
# generate all combinations for variables
newdata
# look at predictions
predict(m1, newdata, interval = "c") %>%
bind_cols(newdata)
```
We can also do the same using simulations approach and get our uncertainty right away:
```{r}
# calculate the expected values (no first differences!)
sim_m1 <- sim(m1) # simulate coefficients once per model
evs <- sim_setx(
sim = sim_m1,
x = list(
russian_tv = 0:1,
within_25km = 0:1
)
)
summary(evs)
# plot(evs) +
# theme(legend.position = "bottom") +
# scale_color_viridis_d() +
# scale_fill_viridis_d() +
# labs(x = "Expected Probability of Pro-Russian Vote")
```
> Do you spot any problems with our predictions?
```{r}
fd_within_25km <- sim_setx(
sim = sim_m1,
x = list(
russian_tv = 0,
within_25km = 1
),
x1 = list(
russian_tv = 1,
within_25km = 1
)
)
summary(fd_within_25km)
# fd_outside_25km <- sim_setx(sim = sim_m1,
# x = list(russian_tv = 0,
# within_25km = 0),
# x1 = list(russian_tv = 1,
# within_25km = 0))
# summary(fd_outside_25km)
```
> Are the first differences in _expected probabilities of pro-Russian vote_ the same or different between those who _watch Russian TV_ and those who do not?
> Do we need both `fd_within_25km` and `fd_outside_25km` objects to evaluate the effect of watching Russian TV or is one of the sufficient? Why?
Here is an example of a plot of expected values for this model. Can you think of any information that is missing on this plot?
```{r}
evs_plotting <- evs %>%
as.data.frame() %>%
pivot_longer(everything()) %>%
separate(name, sep = ", ", into = c("russian_tv", "within_25km")) %>%
mutate(
russian_tv = if_else(str_detect(russian_tv, "0$"), "No", "Yes"),
within_25km = if_else(str_detect(within_25km, "0$"), "No", "Yes")
) %>%
group_by(russian_tv, within_25km) %>%
# group_by(name) %>%
summarize(
mean = mean(value), # calculate the mean and cis
ci_low = quantile(value, 0.025), # lower ci for 95% CIs
ci_upper = quantile(value, 0.975) # lower ci for 95% CIs
)
evs_plotting %>%
ggplot(aes(
y = russian_tv,
x = mean,
xmin = ci_low,
xmax = ci_upper,
color = within_25km,
shape = within_25km
)) +
geom_pointrange(
aes(
xmin = ci_low,
xmax = ci_upper
),
position = position_dodge(width = 0.5) # adds
) +
labs(
x = "Probability of Pro-Russian Vote",
shape = "Living within 25 km",
color = "Living within 25 km",
y = "Russian TV Reception"
) +
scale_color_viridis_d(end = 0.8) +
scale_fill_viridis_d(end = 0.8)
```
## Logistic Regression
Now let's fit the model that is specifically designed to model binary dependent variables, namely the logistic regression.
#### Fitting the Model in `R`
Here you can spot the differences in the syntax:
- we are using `glm()` (`g`eneralized `l`inear `m`odel) command instead of `lm()`
- we are specifying the `family` argument to `binomial`. Both ways will give you a logistic regression:
- `family = binomial`
- `family = binomial(link = "logit")`
```{r}
# OLS
m1 <- lm(pro_russian_vote ~ russian_tv + within_25km, data = UA)
# logistic regression
m2 <- glm(pro_russian_vote ~ russian_tv + within_25km, data = UA, family = binomial(link = "logit"))
summary(m2)
```
#### Interpretation of Coefficients and Mechanics of Logistic Regression
In logistic regression, coefficient interpretation is not as straightforward as in the linear models. The reason for that is the follows: we are no longer modeling the dependent variable as an unbound cintinuous variable. We are applying a transformation that ensures that no matter how larger/small value we obtain when plugging in the values for independent variables and multiplying them with respective coefficients, the predicted probability is always bound between 0 and 1.
In our example, this means we apply a transformation to the *linear predictor* $\beta_0 + \beta_1 \text{Russian TV} + \beta_2 \text{Living within 25km}$ to ensure that outcome is bounded between 0 and 1. One of the functions that allows us to achieve this is an _inverse logit_ function.
The inverse logit (aka logistic or sigmoid) function takes a value between $-\infty$ and $+\infty$ and maps it to a value between 0 and 1:
$$
logit^{-1}(x) = logistic(x) = \frac{\exp(x)}{1+\exp(x)} = \frac{1}{1+\exp(-x)}
$$
Let's look at what this transformation does:
```{r}
# custom function
inv_logit <- function(x) {
exp(x) / (1 + exp(x))
}
inv_logit_example <- tibble(
x = seq(-10, 10, length.out = 50),
y = inv_logit(seq(-10, 10, length.out = 50))
)
inv_logit_example
ggplot(
data = inv_logit_example,
aes(x, y)
) +
geom_line(alpha = 0.5)
```
This transformation allows us to still model the _probability_ $\pi = \text{Pr(Pro-Russian Vote} = 1)$ , although not directly. Instead we are modelling the _log odds of $\pi$_. This handy trick comes with a price: the coefficients $\beta$ now represent not changes in _probability_ but changes in the _log odds of $\pi$_ $\log\Big(\frac{\pi}{1-\pi}\Big)$, which is a rather unintuitive quantity.
$$
\log\Big(\frac{\pi}{1-\pi}\Big) = \underbrace{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k}_{\text{linear predictor}}
$$
The equivalent of the equation above but with probability on the left-hand side would be:
$$\pi = \frac{\exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}{1 + \exp\{\beta_0 + \beta_1~X_1 + \cdots + \beta_k~X_k\}}$$
$$
\pi = \dfrac{\overbrace{\exp(X\beta)}^{\text{always greater than 0}}}{\underbrace{1 + \exp(X\beta)}_{\text{1 + "always greater than 0"}}} \Rightarrow 0 < \pi < 1
$$
Knowing these formulas, we will be able to calculate the predicted probabilities from the linear predictor ($X\beta$).
#### Model Output
```{r}
modelsummary(
list(
"OLS" = m1,
"Logit" = m2
),
gof_omit = "RMSE|AIC|BIC|F"
)
```
If we needed to interpret coefficients for the logit model directly, we could say the following:
> The **log-odds** of voting for a pro-Russian party are expected to be 1.79 more for those exposed to Russian TV compared to those without exposure to Russian TV (the baseline group), holding all else constant.
> The **odds** of voting for a pro-Russian party or those exposed to Russian TV are expected to be 5.98 ($e^{1.79}$) **times** the odds for those without exposure to Russian TV, holding all else constant.
In general, from the regression tables alone we can only refer to the significance and sign of the coefficients, but not their relative sizes.
#### Predicted Probabilities (without uncertainty)
A much more meaningful way to go about it would be to work with predicted probabilities. Let's calculate them by hand first:
```{r}
# by hand
# russian_tv = 1, within_25km = 1
pp1 <- m2$coefficients[1] + m2$coefficients[2] * 1 + m2$coefficients[3] * 1
# russian_tv = 0, within_25km = 1
pp2 <- m2$coefficients[1] + m2$coefficients[2] * 0 + m2$coefficients[3] * 1
```
> Why are these values negative?
```{r}
# custom inverse logit function
inv_logit <- function(x) {
exp(x) / (1 + exp(x))
}
inv_logit(pp1)
inv_logit(pp2)
```
We can also use `predict()` function to obtain these quantities. However, we need to be careful when specifying the argument `type` in this case:
```{r}
pred_data <- expand_grid(russian_tv = 0:1, within_25km = 0:1)
predict(m2, newdata = pred_data)
predict(m2, newdata = pred_data, type = "link")
predict(m2, newdata = pred_data, type = "response")
```
> What is the difference between these functions/arguments?
---
## Simulations for Logit Interpretation
### Expected Values (aka Predicted Probabilities)
```{r}
# enure reproducibility in the knitted document
set.seed(2023)
# simulate coefficients from each model once, and then reuse them
sim_m2 <- sim(m2)
evs_logit <- sim_setx(
sim = sim_m2,
x = list(
russian_tv = 0:1,
within_25km = 0:1
)
)
plot(evs_logit)
summary(evs_logit)
```
> How would you interpret the results? Do you notice any differences to the linear model?
```{r}
library(patchwork)
# simple combination of plots
plot(evs_logit) + plot(evs)
# more informative plot
(plot(evs_logit) + labs(title = "Logit Model")) +
(plot(evs) + labs(title = "Linear Model")) +
plot_annotation(title = "Distributions of Simulated Predicted Probabilities of Pro-Russian Vote") +
plot_layout(guides = "collect") &
theme(legend.position = "bottom")
```
### First Differences
Let's have a better look at the first differences to evaluate the effect of the Russian propaganda. That means, we will evaluate the differences between being exposed to Russian TV and not being exposed to it, holding other variables constant:
```{r}
fd_within_25km <- sim_setx(
sim = sim_m2,
x = list(
russian_tv = 0,
within_25km = 1
),
x1 = list(
russian_tv = 1,
within_25km = 1
)
)
summary(fd_within_25km)
fd_outside_25km <- sim_setx(
sim = sim_m2,
x = list(
russian_tv = 0,
within_25km = 0
),
x1 = list(
russian_tv = 1,
within_25km = 0
)
)
summary(fd_outside_25km)
```
> How do these first differences compare to the first differences in the linear model?
> If we needed to present the results of fitting such a model in a paper, which scenario would you choose?
Here is how you can extract the quantities you need and make a different plot for presenting your results:
```{r}
pps_tv <- sim_setx(
sim = sim_m2, # object with simulated coefs
x = list(
russian_tv = 1, # scenario with desired (plausible) values
within_25km = 0:1
)
)
pps_no_tv <- sim_setx(
sim = sim_m2, # object with simulated coefs
x = list(
russian_tv = 0, # scenario with desired (plausible) values
within_25km = 0:1
)
)
fds_tv <- pps_tv - pps_no_tv
## plotting using pointrange geom
pps_plotting_tv <- summary(pps_tv, ci = TRUE, level = 0.95) %>%
as.data.frame() %>%
janitor::clean_names() %>%
rownames_to_column(var = "within_25km") %>%
mutate(tv = "Yes")
pps_plotting_no_tv <- summary(pps_no_tv, ci = TRUE, level = 0.95) %>%
as.data.frame() %>%
janitor::clean_names() %>%
rownames_to_column(var = "within_25km") %>%
mutate(tv = "No")
pps_plotting <- pps_plotting_tv %>%
bind_rows(pps_plotting_no_tv) %>%
mutate(within_25km = if_else(str_detect(within_25km, "within_25km = 0"), "No", "Yes"))
pps_plotting %>%
ggplot(aes(
x = estimate,
xmin = x2_5_percent,
xmax = x97_5_percent,
y = tv,
shape = within_25km,
color = within_25km
)) +
geom_pointrange(
position = position_dodge(width = .5),
size = 1.2, # change the size of the point (x)
linewidth = 1.2 # change the size of the line (xmin-xmax)
) +
labs(
title = "Russian TV Reception in Precinct and Predicted Probability of Pro-Russian Voting",
x = "Probability of Pro-Russian Vote",
shape = "Living within 25 km",
color = "Living within 25 km",
y = "Russian TV Reception\nin Precinct"
) +
xlim(0, 1) +
scale_color_viridis_d(end = 0.8) +
theme(plot.title.position = "plot")
```
In addition to expected values, we almost always want to present the first differences as well:
```{r}
fds_tv_plotting <- summary(fds_tv) %>%
as.data.frame() %>%
rownames_to_column(var = "within_25km") %>%
clean_names() %>%
mutate(
within_25km = if_else(str_detect(within_25km, "within_25km = 0"), "No", "Yes")
)
fds_tv_plotting %>%
ggplot(aes(
x = estimate,
xmin = x2_5_percent,
xmax = x97_5_percent,
y = within_25km,
# shape = within_25km,
# color = within_25km
)) +
geom_pointrange(
position = position_dodge(width = .5),
size = 1.2,
linewidth = 1.2
) +
labs(
title = "Effect of Russian TV Reception in Precinct on the Probability of Pro-Russian Vote",
subtitle = "First Differences with 95% Confidence Intervals",
x = "Difference in Probability of Pro-Russian Vote \nbetween Precincts with and without Russian TV Reception",
# shape = "Living within 25 km",
# color = "Living within 25 km",
y = "Living within 25 km"
) +
scale_color_viridis_d(end = 0.8) +
theme(plot.title.position = "plot") +
geom_vline(xintercept = 0, linetype = "dashed")
```
## Example: Relationship between Mountaneous Terrain and Probability of Civil War Onset
Now we will briefly revisit a highly cited paper by Fearon and Laitin and replicate their analysis. We will be working with the data from the following study:
> Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” *American Political Science Review* 97 (1). Cambridge University Press: 75–90. https://doi:10.1017/S0003055403000534.
The authors summarize their findings in the abstarct as follows:
> An influential conventional wisdom holds that civil wars proliferated rapidly with the end of the Cold War and that the root cause of many or most of these has been ethnic and religious antagonisms. We show that the current prevalence of internal war is mainly the result of a steady accumulation of protracted conflicts since the 1950s and 1960s rather than a sudden change associated with a new, post-Cold War international system. We also find that after controlling for per capita income, more ethnically or religiously diverse countries have been no more likely to experience significant civil violence in this period. We argue for understanding civil war in this period in terms of insurgency or rural guerrilla warfare, a particular form of military practice that can be harnessed to diverse political agendas. **The factors that explain which countries have been at risk for civil war are not their ethnic or religious characteristics but rather the conditions that favor insurgency.** These include poverty—which marks financially and bureaucratically weak states and also favors rebel recruitment—political instability, **rough terrain**, and large populations.
Let's have a look at the dataset:
```{r}
fearon <- read_dta("data/fearon_rep.dta")
glimpse(fearon)
# summary statistics
datasummary_skim(fearon)
```
We will be working with `clarify` later, and in order to get plots on the reader-friendly scales of the variable we need to do transformations right inside the formula in the model rather than create separate variables:
```{r}
fearon <- fearon %>%
mutate(
mountain = exp(log_mountain),
population = exp(log_population)
)
m1_fearon <- glm(
civilwar ~ priorwar + gdp_lagged + log_population + log(mountain) +
noncontiguous + oil + newstate + instability + democracy1 +
ethnicfrac + relifrac,
data = fearon,
family = binomial(link = logit)
)
```
```{r}
pps_mountain <- sim_setx(
sim = sim(m1_fearon), # object with simulated coefs
x = list(
mountain = seq(min(fearon$mountain), max(fearon$mountain), length.out = 100)
)
)
plot(pps_mountain) +
geom_rug(
data = fearon,
aes(x = mountain)
) +
labs(
y = "Probability of Civil War onset",
x = "Mountainous Terrain in %"
)
```
Let's see how large is the change in probability of civil war onset when terrain changes from 10% to 90%?
```{r}
sim_setx(
sim = sim(m1_fearon), # object with simulated coefs
x = list(
mountain = 10
),
x1 = list(
mountain = 90
)
) %>%
summary()
```
> Does this effect seem substantively large? Would you agree with the conclusions the authors come to?
We can also have a look at the _average case_ in the dataset:
```{r}
median_scenario <- fearon %>%
summarise_all(median, na.rm = T)
# use dput() command to easily copypaste the values
pps_mountain <- sim_setx(
sim = sim(m1_fearon),
# object with simulated coefs
x = list(
priorwar = 0,
gdp_lagged = 1.99500000476837,
log_population = 9.00417709350586,
noncontiguous = 0,
oil = 0,
newstate = 0,
instability = 0,
democracy1 = -3,
ethnicfrac = 0.325454533100128,
relifrac = 0.357599973678589,
mountain = 11.3000006150937
)
)
plot(pps_mountain)
```