forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 0
/
proportional_ink.Rmd
391 lines (314 loc) · 25.5 KB
/
proportional_ink.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(tidyr)
library(lubridate)
library(ggridges)
library(ggforce)
library(treemapify)
```
# (PART\*) Part II: Principles of figure design {-}
# The principle of proportional ink {#proportional-ink}
In many different visualization scenarios, we represent data values by the extent of a graphical element. For example, in a bar plot, we draw bars that begin at 0 and end at the data value they represent. In this case, the data value is not only encoded in the end point of the bar but also in the height or length of the bar. If we drew a bar that started at a different value than 0, then the length of the bar and the bar endpoint would convey contradicting information. Such figures are internally inconsistent, because they show two different values with the same graphical element. Contrast this to a scenario where we visualize the data value with a dot. In this case, the value is only encoded in the location of the dot but not in the size or shape of the dot.
Similar issues will arise whenever we use graphical elements such as bars, rectangles, shaded areas of arbitrary shape, or any other elements that have a clear visual extent which can be either consistent or inconsistent with the data value shown. In all these cases, we need to make sure that there is no inconsistency. This concept has been termed by Bergstrom and West as the *principle of proportional ink* [@BergstromWest2016]:
> **The principle of proportional ink:** The sizes of shaded areas in a visualization need to be proportional to the data values they represent.
(It is common practice to use the word "ink" to refer to any part of a visualization that deviates from the background color. This includes lines, points, shared areas, and text. In this chapter, however, we are talking primarily about shaded areas.) Violations of this principle are quite common, in particular in the popular press and in the world of finance.
## Visualizations along linear axes
We first consider the most common scenario, visualization of amounts along a linear scale. Figure \@ref(fig:hawaii-income-bars-bad) shows the median income in the five counties that make up the state of Hawaii. It is a typical figure one might encounter in a newspaper article. A quick glance at the figure suggests that the county of Hawaii is incredibly poor while the county of Honolulu is much richer than the other counties. However, Figure \@ref(fig:hawaii-income-bars-bad) is quite misleading, because all bars begin at \$50,000 median income. Thus, while the endpoint of each bar correctly represents the actual median income in each county, the bar height represents the extent to which median incomes exceed $50,000, an arbitrary number. And human perception is such that the bar height is the key quantity we perceive when looking at this figure, not the location of the bar endpoint relative to the *y* axis.
(ref:hawaii-income-bars-bad) Median income in the five counties of the state of Hawaii. This figure is misleading, because the *y* axis scale starts at \$50,000 instead of \$0. As a result, the bar heights are not proportional to the values shown, and the income differential between the county of Hawaii and the other four counties appears much bigger than it actually is. Data source: 2015 Five-Year American Community Survey.
```{r hawaii-income-bars-bad, fig.cap = '(ref:hawaii-income-bars-bad)'}
p_income_base <- ggplot(filter(hawaii_income, year == 2015), aes(x = reorder(county, desc(median_income)), y = median_income)) +
geom_col(fill = "#56B4E9") +
xlab("county") +
theme_dviz_hgrid() +
theme(
axis.ticks.x = element_blank(),
plot.margin = margin(3, 7, 3, 1.5)
)
p_income_bad <- p_income_base +
coord_cartesian(xlim = c(0.5, 5.55), ylim = c(50000, 75000), expand = FALSE) +
scale_y_continuous(
name = "median income (USD)",
breaks = 10000*(5:7),
labels = function(x) paste0("$", scales::comma(x))
)
stamp_bad(p_income_bad)
```
An appropriate visualization of these data makes for a less exciting story (Figure \@ref(fig:hawaii-income-bars-good)). While there are differences in median income between the counties, they are nowhere near as big as Figure \@ref(fig:hawaii-income-bars-bad) suggested. Overall, the median incomes in the different counties are somewhat comparable.
(ref:hawaii-income-bars-good) Median income in the five counties of the state of Hawaii. Here, the *y* axis scale starts at \$0 and therefore the relative magnitudes of the median incomes in the five counties are accurately shown. Data source: 2015 Five-Year American Community Survey.
```{r hawaii-income-bars-good, fig.cap = '(ref:hawaii-income-bars-good)'}
p_income_good <- p_income_base +
coord_cartesian(xlim = c(0.5, 5.55), ylim = c(0, 78000), expand = FALSE, clip = "off") +
scale_y_continuous(
name = "median income (USD)",
breaks = 20000*(0:3),
labels = function(x) paste0("$", scales::comma(x))
) +
theme(axis.line.x = element_blank())
p_income_good
```
```{block type='rmdtip', echo=TRUE}
Bars on a linear scale should always start at 0.
```
Similar visualization problems frequently arise in the visualization of time series, such as those of stock prices. Figure \@ref(fig:fb-stock-drop-bad) suggests a massive collapse in the stock price of Facebook occurred around Nov. 1, 2016. In reality, the price decline was moderate relative to the total price of the stock (Figure \@ref(fig:fb-stock-drop-good)). The *y*-axis range in Figure \@ref(fig:fb-stock-drop-bad) would be questionable even without the shading underneath the curve. But with the shading, the figure becomes particularly problematic. The shading emphasizes the distance from the location of the *x* axis to the specific *y* values shown, and thus it creates the visual impression that the height of the shaded area at a given day represents the stock price of that day. Instead, it only represents the difference in stock price from the baseline, which is $110 in Figure \@ref(fig:fb-stock-drop-bad).
(ref:fb-stock-drop-bad) Stock price of Facebook (FB) from Oct. 22, 2016 to Jan. 21, 2017. This figure seems to imply that the Facebook stock price collapsed around Nov. 1, 2016. However, this is misleading, because the *y* axis starts at $110 instead of $0.
```{r fb-stock-drop-bad, fig.cap = '(ref:fb-stock-drop-bad)'}
df_fb_drop <- filter(tech_stocks, ticker == "FB", date >= ymd("2016-10-22") & date < ymd("2017-01-22"))
fb_drop_bad <- ggplot(df_fb_drop, aes(x=date, height=price - 110, y = 110)) +
geom_ridgeline(alpha = 0.7) +
scale_x_date(name = NULL, #name = "day",
breaks = ymd(c("2016-11-01", "2016-12-01", "2017-01-01")),
labels = c("Nov 1, 2016", "Dec 1, 2016", "Jan 1, 2017"),
expand=c(0, 0)) +
scale_y_continuous(name="stock price (USD)",
limits = c(110, 135),
expand=c(0, 0)) +
theme_dviz_open() +
background_grid(major = 'y', minor = 'none') +
theme(plot.margin = margin(14, 7, 3, 1.5))
stamp_bad(fb_drop_bad)
```
(ref:fb-stock-drop-good) Stock price of Facebook (FB) from Oct. 22, 2016 to Jan. 21, 2017. By showing the stock price on a *y* scale from $0 to $150, this figure more accurately relays the magnitude of the FB price drop around Nov. 1, 2016.
```{r fb-stock-drop-good, fig.cap = '(ref:fb-stock-drop-good)'}
fb_drop_good <- ggplot(df_fb_drop, aes(x=date, height=price, y = 0)) +
geom_ridgeline(alpha = 0.7) +
scale_x_date(name = NULL, #name = "day",
breaks = ymd(c("2016-11-01", "2016-12-01", "2017-01-01")),
labels = c("Nov 1, 2016", "Dec 1, 2016", "Jan 1, 2017"),
expand=c(0,0)) +
scale_y_continuous(name="stock price (USD)",
limits = c(0, 150),
expand=c(0,0)) +
theme_dviz_open() +
background_grid(major = 'y', minor = 'none') +
theme(plot.margin = margin(14, 7, 3, 1.5))
stamp_phantom(fb_drop_good)
```
The examples of Figures \@ref(fig:hawaii-income-bars-good) and Figure \@ref(fig:fb-stock-drop-good) could suggest that bars and shaded areas are not useful to represent small changes over time or differences between conditions, since we always have to draw the whole bar or area starting from 0. However, this is not the case. It is perfectly valid to use bars or shaded areas to show differences between conditions, as long as we make it explicit which differences we are showing. For example, we can use bars to visualize the change in median income in Hawaiian counties from 2010 to 2015 (Figure \@ref(fig:hawaii-income-change)). For all counties except Kalawao, this change amounts to less than $5000. (Kalawao is an unusual county, with fewer than 100 inhabitants, and it can experience large swings in median income from a small number of people moving into or out of the county.) And for Hawaii County, the change is negative, i.e., the median income in 2015 was lower than it was in 2010. We represent negative values by drawing bars that go in the opposite direction, i.e., that extend from 0 down rather than up.
(ref:hawaii-income-change) Change in median income in Hawaiian counties from 2010 to 2015. Data source: 2010 and 2015 Five-Year American Community Surveys.
```{r hawaii-income-change, fig.cap = '(ref:hawaii-income-change)'}
hawaii_income_diff <- select(hawaii_income, county, year, median_income) %>%
spread(year, median_income) %>%
mutate(income_diff = `2015` - `2010`,
income_ratio = `2015` / `2010`)
ggplot(hawaii_income_diff, aes(x = reorder(county, desc(filter(hawaii_income, year == 2015)$median_income)),
y = income_diff)) +
geom_col(fill = "#56B4E9") +
xlab("county") +
scale_y_continuous(
name = "5-year change in median income (USD)",
limits = c(-5000, 25000),
expand = c(0, 0),
labels = c("-$5,000", "$0", "$5,000", "$10,000", "$15,000", "$20,000", "$25,000")
) +
theme_dviz_hgrid() +
theme(axis.ticks.x = element_blank(),
plot.margin = margin(7, 7, 3, 1.5))
```
Similarly, we can draw the change in Facebook stock price over time as the difference from its temporary high point on Oct. 22, 2016 (Figure \@ref(fig:fb-stock-drop-inverse)). By shading an area that represents the distance from the high point, we are accurately representing the absolute magnitude of the price drop without making any implicit statement about the magnitude of the price drop relative to the total stock price.
(ref:fb-stock-drop-inverse) Loss in Facebook (FB) stock price relative to the price of Oct. 22, 2016. Between Nov. 1, 2016 and Jan. 1, 2017, the price remained approximately \$15 lower than it was at its high point on Oct. 22, 2016. But then the price started to recover in Jan. 2017.
```{r fb-stock-drop-inverse, fig.cap = '(ref:fb-stock-drop-inverse)'}
df_fb_drop2 <- filter(tech_stocks,
ticker == "FB", date >= ymd("2016-10-22") & date < ymd("2017-01-22")) %>%
mutate(price_drop = price - max(price))
fb_drop_inverse <- ggplot(df_fb_drop2, aes(x=date, height=price_drop, y = 0)) +
geom_ridgeline(alpha = 0.7, min_height = -50) +
scale_x_date(name = NULL, #name = "day",
breaks = ymd(c("2016-11-01", "2016-12-01", "2017-01-01")),
labels = c("Nov 1, 2016", "Dec 1, 2016", "Jan 1, 2017"),
expand=c(0, 0)) +
scale_y_continuous(name="price loss (USD)",
limits = c(-25, 5),
expand=c(0, 0)) +
theme_dviz_open() +
background_grid(major = 'y', minor = 'none') +
theme(plot.margin = margin(7, 7, 3, 1.5))
stamp_phantom(fb_drop_inverse)
```
## Visualizations along logarithmic axes
When we are visualizing data along a linear scale, the areas of bars, rectangles, or other shapes are automatically proportional to the data values. The same is not true if we are using a logarithmic scale, because data values are not linearly spaced along the axis. Therefore, one could argue that, for example, bar graphs on a log scale are inherently flawed. On the flip side, the area of each bar will be proportional to the logarithm of the data value, and thus bar graphs on a log scale satisfy the principle of proportional ink in log-transformed coordinates. In practice, I think neither of these two arguments can resolve whether log-scale bar graphs are appropriate. Instead, the relevant question is whether we want to visualize amounts or ratios.
In Chapter \@ref(coordinate-systems-axes), I have explained that a log scale is the natural scale to visualize ratios, because a unit step along a log scale corresponds to multiplication with or division by a constant factor. In practice, however, log scales are often used not specifically to visualize ratios but rather just because the numbers shown vary over many orders of magnitude. As an example, consider the gross domestic products (GDPs) of countries in Oceania. In 2007, these varied from less than a billion U.S. dollars (USD) to over 300 billion USD (Figure \@ref(fig:oceania-gdp-logbars)). Visualizing these numbers on a linear scale would not work, because the two countries with the largest GDPs (New Zealand and Australia) would dominate the figure.
(ref:oceania-gdp-logbars) GDP in 2007 of countries in Oceania. The lengths of the bars do not accurately reflect the data values shown, since bars start at the arbitrary value of 0.3 billion USD. Data source: Gapminder.
```{r oceania-gdp-logbars, fig.width = 6, fig.asp = 0.5, fig.cap = '(ref:oceania-gdp-logbars)'}
library(gapminder)
df_oceania <- filter(gapminder_unfiltered, year == 2007, continent == "Oceania") %>%
mutate(GDP = pop*gdpPercap) %>%
arrange(desc(GDP))
oc_bad <- ggplot(df_oceania, aes(x = reorder(country, -GDP), y = log10(GDP))) +
geom_col(fill = "#56B4E9") +
scale_y_continuous(breaks = log10(c(3.1e8, 1e9, 3.e9, 1e10, 3.e10, 1e11, 3.e11, 1e12)),
labels = c("0.3", "1.0", "3.0", "10", "30", "100", "300", "1000"),
name = "GDP (billion USD)") +
scale_x_discrete(name = NULL) +
coord_flip(ylim = log10(c(3.1e8, 9.9e11)), expand = FALSE) +
theme_dviz_vgrid(12, rel_small = 1) +
theme(axis.ticks.y = element_blank(),
plot.margin = margin(12, 6, 3, 1.5))
stamp_bad(oc_bad)
```
However, the visualization with bars on a log scale (Figure \@ref(fig:oceania-gdp-logbars)) does not work either. The bars start at an arbitrary value of 0.3 billion USD, and at a minimum the figure suffers from the same problem of Figure \@ref(fig:hawaii-income-bars-bad), that the bar lengths are not representative of the data values. The added difficulty with a log scale, though, is that we cannot simply let the bars start at 0. In Figure \@ref(fig:oceania-gdp-logbars), the value 0 would lie infinitely far to the left. Therefore, we could make our bars arbitrary long by pushing their origin further and further way, see e.g. Figure \@ref(fig:oceania-gdp-logbars-long). This problem always arises when we try to visualize amounts (which is what the GDP values are) on a log scale.
(ref:oceania-gdp-logbars-long) GDP in 2007 of countries in Oceania. The lengths of the bars do not accurately reflect the data values shown, since bars start at the arbitrary value of 10<sup>-9</sup> billion USD. Data source: Gapminder.
```{r oceania-gdp-logbars-long, fig.width = 6, fig.asp = 0.5, fig.cap = '(ref:oceania-gdp-logbars-long)'}
oc_bad2 <- ggplot(df_oceania, aes(x = reorder(country, -GDP), y = log10(GDP))) +
geom_col(fill = "#56B4E9") +
scale_y_continuous(breaks = 2*(0:6),
labels = function(x) label_log10(10^x/1e9),
name = "GDP (billion USD)") +
scale_x_discrete(name = NULL) +
coord_flip(ylim = log10(c(1, 9.9e11)), expand = FALSE) +
theme_dviz_vgrid(12, rel_small = 1) +
theme(axis.ticks.y = element_blank(),
plot.margin = margin(12, 6, 3, 1.5))
stamp_bad(oc_bad2)
```
For the data of Figure \@ref(fig:oceania-gdp-logbars), I think bars are inappropriate. Instead, we can simply place a dot at the appropriate location along the scale for each country's GDP and avoid the issue of bar lengths altogether (Figure \@ref(fig:oceania-gdp-dots)). Importantly, by placing the country names right next to the dots rather than along the *y* axis, we avoid generating the visual perception of a magnitude conveyed by the distance from the country name to the dot.
(ref:oceania-gdp-dots) GDP in 2007 of countries in Oceania. Data source: Gapminder.
```{r oceania-gdp-dots, fig.width = 6, fig.asp = 0.5, fig.cap = '(ref:oceania-gdp-dots)'}
ggplot(df_oceania, aes(x = reorder(country, -GDP), y = log10(GDP))) +
geom_point(size = 3.5, color = "#0072B2") +
geom_label(aes(label = country, y = log10(GDP) - .08), hjust = 1, size = 12/.pt,
family = dviz_font_family,
fill = "white", alpha = 0.5, label.padding = grid::unit(2, "pt"),
label.r = grid::unit(0, "pt"), label.size = 0) +
scale_y_continuous(breaks = log10(c(3e8, 1e9, 3.e9, 1e10, 3.e10, 1e11, 3.e11, 1e12)),
labels = c("0.3", "1.0", "3.0", "10", "30", "100", "300", "1000"),
name = " GDP (billion USD)",
limits = log10(c(3e7, 9.9e11)),
expand = c(0, 0)) +
scale_x_discrete(name = NULL, breaks = NULL) +
coord_flip() +
theme_dviz_vgrid(12, rel_small = 1) +
theme(plot.margin = margin(12, 6, 3, 1.5))
```
If we want to visualize ratios rather than amounts, however, bars on a log scale are a perfectly good option. In fact, they are preferable over bars on a linear scale in that case. As an example, let's visualize the GDP values of countries in Oceania relative to the GDP of Papua New Guinea. The resulting figure does a good job highlighting the key relationships between the GDPs of the various countries (Figure \@ref(fig:oceania-gdp-relative)). We can see that New Zealand has over eight times the GDP of Papua New Guinea and Australia over 64 times, while Tonga and the Federated States of Micronesia have less than one-sixteenth of the GDP of Papua New Guinea. French Polynesia and New Caledonia are close but have a slightly smaller GDPs than Papua New Guinea does.
(ref:oceania-gdp-relative) GDP in 2007 of countries in Oceania, relative to the GDP of Papua New Guinea. Data source: Gapminder.
```{r oceania-gdp-relative, fig.width = 6, fig.asp = 0.5, fig.cap = '(ref:oceania-gdp-relative)'}
GDP_PNG <- filter(df_oceania, country == "Papua New Guinea")$GDP
df_oceania_ratios <- mutate(df_oceania, gdp_ratio = GDP/GDP_PNG) %>%
filter(country != "Papua New Guinea")
ggplot(df_oceania_ratios, aes(x = reorder(country, gdp_ratio), y = gdp_ratio)) +
geom_col(fill = "#56B4E9") +
scale_y_log10(breaks = c(1/16, 1/8, 1/4, 1/2, 1, 2, 4, 8, 16, 32, 64),
labels = c("1/16", "1/8", "1/4", "1/2", "1", "2", "4", "8", "16", "32", "64"),
name = "GDP relative to Papua New Guinea",
limits = c(.055, 70),
expand = c(0, 0)) +
scale_x_discrete(name = NULL) +
coord_flip() +
theme_dviz_vgrid(12, rel_small = 1) +
theme(axis.line = element_blank(),
axis.ticks.length = grid::unit(0, "pt"),
axis.ticks.y = element_blank(),
plot.margin = margin(12, 6, 3, 1.5))
```
Figure \@ref(fig:oceania-gdp-relative) also highlights that the natural midpoint of a log scale is 1, with bars representing numbers above 1 going in one direction and bars representing numbers below one going in the other direction. Bars on a log scale represent ratios and must always start at 1, and bars on a linear scale represent amounts and must always start at 0.
```{block type='rmdtip', echo=TRUE}
When bars are drawn on a log scale, they represent ratios and need to be drawn starting from 1, not 0.
```
## Direct area visualizations
All preceding examples visualized data along one linear dimension, so that the data value was encoded both by area and by location along the *x* or *y* axis. In these cases, we can consider the area encoding as incidental and secondary to the location encoding of the data value. Other visualization approaches, however, represent the data value primarily or directly by area, without a corresponding location mapping. The most common one is the pie chart (Figure \@ref(fig:RI-pop-pie)). Even though technically the data values are mapped onto angles, which are represented by location along a circular axis, in practice we are typically not judging the angles of a pie chart. Instead, the dominant visual property we notice is the size of the areas of each pie wedge.
(ref:RI-pop-pie) Number of inhabitants in Rhode Island counties, shown as a pie chart. Both the angle and the area of each pie wedge are proportional to the number of inhabitants in the respective county. Data source: 2010 Decennial U.S. Census.
```{r RI-pop-pie, fig.cap = '(ref:RI-pop-pie)'}
RI_pop <- US_census %>%
filter(state == "Rhode Island") %>%
extract(name, "county", regex = "(.+) County") %>%
select(county, pop2010) %>%
mutate(
label_pop = scales::comma(signif(pop2010, 3)),
label_comb = paste(county, label_pop, sep = "\n")
) %>%
arrange(desc(pop2010))
RI_pop$county = factor(RI_pop$county, levels = RI_pop$county)
RI_pie <- RI_pop %>%
mutate(
total = sum(pop2010),
end_angle = 2*pi*cumsum(pop2010)/total, # ending angle for each pie slice
start_angle = lag(end_angle, default = 0), # starting angle for each pie slice
mid_angle = 0.5*(start_angle + end_angle), # middle of each pie slice, for the text label
hjust = ifelse(mid_angle>pi, 1, 0),
vjust = ifelse(mid_angle<pi/2 | mid_angle>3*pi/2, 0, 1)
)
rpie = 1
RI_pie$rlabel = c(.3, .4, .4, 1.02, 1.02) * rpie
RI_pie$size = c(16, 14, 14, 12, 12)/.pt
plot_RI_pies <- ggplot(RI_pie) +
geom_arc_bar(
aes(
x0 = 0, y0 = 0, r0 = 0, r = rpie,
start = start_angle, end = end_angle, fill = county
),
color = "white", size = 0.75
) +
geom_text(
aes(
x = rlabel*sin(mid_angle),
y = rlabel*cos(mid_angle),
label = label_pop,
hjust = hjust, vjust = vjust,
size = size
),
family = dviz_font_family
) +
coord_fixed() +
scale_x_continuous(limits = c(-1.1, 1.1), expand = c(0, 0), name = NULL, breaks = NULL, labels = NULL) +
scale_y_continuous(limits = c(-1.1, 1.1), expand = c(0, 0),name = NULL, breaks = NULL, labels = NULL) +
#scale_fill_brewer(type = "qual", palette = "Pastel1") +
scale_fill_OkabeIto(darken = -.3) +
scale_size_identity() +
guides(fill = guide_legend(override.aes = list(size = 1.))) +
theme_dviz_open() +
theme(
axis.line.x = element_blank(),
axis.title.x = element_blank(),
axis.ticks.y = element_blank(),
legend.title.align = 0.5,
legend.key.size = grid::unit(25, "pt"),
legend.spacing.x = grid::unit(2, "pt"),
legend.spacing.y = grid::unit(2, "pt"),
plot.margin = margin(7, 7, 3, 1.5)
)
plot_RI_pies
```
Because the area of each pie wedge is proportional to its angle which is proportional to the data value the wedge represents, pie charts satisfy the principle of proportional ink. However, we perceive the area in a pie chart differently from the same area in a bar plot. The fundamental reason is that human perception primarily judges distances and not areas. Thus, if a data value is encoded entirely as a distance, as is the case with the length of a bar, we perceive it more accurately than when the data value is encoded through a combination of two or more distances that jointly create an area. To see this difference, compare Figure \@ref(fig:RI-pop-pie) to Figure \@ref(fig:RI-pop-bars), which shows the same data as bars. The difference in the number of inhabitants between Providence County and the other counties appears larger in Figure \@ref(fig:RI-pop-bars) than in Figure \@ref(fig:RI-pop-pie).
(ref:RI-pop-bars) Number of inhabitants in Rhode Island counties, shown as bars. The length of each bar is proportional to the number of inhabitants in the respective county. Data source: 2010 Decennial U.S. Census.
```{r RI-pop-bars, fig.width = 5.5, fig.cap = '(ref:RI-pop-bars)'}
ggplot(RI_pop, aes(x = factor(county, levels = rev(county)), y = pop2010, fill = county)) +
geom_col() +
scale_fill_OkabeIto(darken = -.3, guide = "none") +
scale_y_continuous(
expand = c(0, 0),
breaks = c(0, 2e5, 4e5, 6e5),
labels = c("0", "200,000", "400,000", "600,000"),
name = "number of inhabitants"
) +
scale_x_discrete(name = "county") +
coord_flip() +
theme_dviz_vgrid() +
theme(
#axis.line = element_blank(),
#axis.ticks.length = grid::unit(0, "pt"),
axis.ticks.y = element_blank(),
plot.margin = margin(3, 14, 3, 1.5)
)
```
The problem that human perception is better at judging distances than at judging areas also arises in treemaps (Figure \@ref(fig:RI-pop-treemap)), which can be thought of as a square versions of pie charts. Again, in comparison to Figure \@ref(fig:RI-pop-bars), the differences in the number of inhabitants among the counties appears less pronounced in Figure \@ref(fig:RI-pop-treemap).
(ref:RI-pop-treemap) Number of inhabitants in Rhode Island counties, shown as a treemap. The area of each rectangle is proportional to the number of inhabitants in the respective county. Data source: 2010 Decennial U.S. Census.
```{r RI-pop-treemap, fig.width = 4.5, fig.asp = .75, fig.cap = '(ref:RI-pop-treemap)'}
# fix label for Bristol
RI_pop$label_comb[5] <- paste(RI_pop$county[5], RI_pop$label_pop[5])
RI_pop$label_comb = factor(RI_pop$label_comb, levels = RI_pop$label_comb)
p <- ggplot(RI_pop, aes(area = pop2010, fill = county, label = label_comb)) +
geom_treemap(
color = "white", size = .75*.pt,
fixed = FALSE
) +
geom_treemap_text(
family = dviz_font_family,
colour = "black", place = "centre",
grow = FALSE,
fixed = FALSE
) +
scale_fill_OkabeIto(darken = -.3, guide = "none") +
coord_cartesian(clip = "off")
p
```