-
Notifications
You must be signed in to change notification settings - Fork 1
/
BasicDataAndPlots.jmd
238 lines (165 loc) · 8.11 KB
/
BasicDataAndPlots.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
# Acquiring some data and plotting it
## A simple data analysis tutorial using Julia
### By: Daniel Lakeland
### Lakeland Applied Sciences LLC
So you want to answer some questions you have about the
world... Suppose you'd like to see how the population of several
states has changed over time. We'll rely on CSV files published by the
Census. The goal of this tutorial will be to simply walk you through
acquiring the data, processing it into a form where it's easy to
analyze, and plotting some aspects of the data to help you see what
happened.
In the companion Discussion we'll talk about why the code looks like
it does and what other options there are.
## Getting the data
We'll rely on the [Queryverse](https://www.queryverse.org/)
meta-package which will provide us with ways to reads CSV files and to
pipe datasets through filtering and mutation operations. To find out
more you can
[watch a great introduction](https://youtu.be/OFPNph-WxLM?t=73) by the
author. And we'll use
[DataFrames](https://juliadata.github.io/DataFrames.jl/stable/), which
will let us create in-memory tabular data objects. We will also use
[Gadfly](http://gadflyjl.org/stable/) which is a plotting package
particularly oriented towards statistical data analysis plots. We
won't cover the Queryverse built-in plotting system VegaLite in this
tutorial, but might provide some separate tutorials.
Let's get started by loading up the required packages, and grabbing
some data:
```julia
using Queryverse,DataFrames;
cenfile="https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv"
df = DataFrame(load(cenfile));
display(first(df,5))
```
## Reshaping the data to long form
Of course, the shape of this data is all wrong. There are 164 columns
each column has information on a different variable in a different
year! This data format is convenient for people wishing to by-hand
click some buttons and graph some data in Excel, because it places
related data items together in the dataset and they can select those
related items with a mouse. It is, however, horrible for someone
wishing to **program** a computer to do analysis in bulk.
This is a common data problem. To solve it we will want to reshape the
data using some
[DataFrame functions](https://juliadata.github.io/DataFrames.jl/stable/man/reshaping_and_pivoting/). In
particular `stack` will take a dataset and "stack" it into a long
form, with one row for each column mentioned. In this case we want all
the columns except Not the ones that identify the row.
## Let's Stack It
We'll stack the data frame, and then subset to the columns we might
care about to avoid a lot of extra junk on our screen, and select the
rows that represent statewide totals (where `STNAME == CTYNAME`), and
we'll add a `year` column for later use.
```julia
dflong = stack(df,Not([:SUMLEV,:REGION,:DIVISION,:STATE,:COUNTY,:STNAME,:CTYNAME ])) |> @filter(_.STNAME == _.CTYNAME) |> @mutate(year=-1) |> DataFrame;
select!(dflong,[:STNAME,:year,:variable, :value,:DIVISION,:REGION])
display(first(dflong,5))
```
Since the table is now "long form" we have one row per county per
measurement. We can therefore select out just the ones that have
population measurements. Let's take a look at which measurements those
are.
```julia
unique(dflong[:,:variable])
```
## Munging the data
Clearly POPESTIMATE variables are the ones we want, but the year has
been encoded into the symbol name because there was one column per
year... What we want is to split this column into one containing
POPESTIMATE and one containing the year.
Since we're going to want to edit the variable names in the variable
column, it's not convenient for them to be symbols. Let's convert them
to String and in the process, also strip off the year and put it into
a numerical `year` column.
We notice that the census always put the year pasted on to the end of
the symbol name (when appropriate). But it's not for every
symbol. We'll strip off the last 4 characters, and try to convert it
to an integer using `tryparse`.
```julia
display(first(dflong,5))
dflong2 = DataFrame(dflong |> @mutate(year = something(tryparse(Int,string(_.variable)[end-3:end]),-1), variable=String(_.variable)) |> @mutate(variable = _.year > 0 ? _.variable[1:end-4] : _.variable) )
display(first(dflong2,5))
```
There are lots of variables, so let's just select out the ones that
look like POPESTIMATE.
```julia
display(unique(dflong2.variable) )
pop = dflong2 |> @filter(_.variable == "POPESTIMATE") |>DataFrame
display(first(pop,5))
```
# Visualizing Aspects of Population Data
Now we have a DataFrame called `pop` that we can examine. Let's find
out what is going on in the population of these states over
time. We'll need to make some plots using Gadfly.
```julia
using Gadfly
set_default_plot_size(20cm,10cm);
display(plot(pop,x=:year,y=:value,Geom.line,color=:STNAME))
display(plot(pop,x=:value,Geom.density(bandwidth=1e6)))
```
Normally Jupyter only displays the last thing in a cell, so we
explicitly display each graph.
These plots get us some basic information, but they have all sorts of
problems. For example the color key for the 50 states takes up more of
the plot than the plot does. And there are too many lines to really
tell what's going on. And the density plot for population is pretty
interesting, but it has no labels and the units are not very
convenient. Let's fix the density plot first.
```julia
display(plot(DataFrame(pop |> @mutate(stpop=_.value/1e6)),
x=:stpop,Geom.density(bandwidth=1),
Guide.title("Distribution of State Populations"),
Guide.xlabel("Population (Millions of People)")))
```
Of course, this is the density for all observations across all the
years. Let's look at how the distribution of populations changed
between say 2015 and 2019:
```julia
set_default_plot_size(20cm,10cm)
display(hstack(plot(DataFrame(pop |> @mutate(stpop=_.value/1e6) |> @filter(_.year == 2015)),
x=:stpop,Geom.density(bandwidth=1),
Guide.title("Distribution of State Populations 2015"),
Guide.xlabel("Population (Millions of People)")),
plot(DataFrame(pop |> @mutate(stpop=_.value/1e6) |> @filter(_.year == 2019)),
x=:stpop,Geom.density(bandwidth=1),
Guide.title("Distribution of State Populations 2019"),
Guide.xlabel("Population (Millions of People)"))))
```
We can see that the distribution is relatively stable as might be
expected since tens of millions of people don't tend to all move
between states every few years. Let's select all the states with more
than 10 Million people, and look at how they trended in time, and
compare to states with less than 2 Million people.
```julia
bigstates = unique(pop[pop.value .> 10e6,:STNAME])
bigdata = DataFrame(pop |> @filter(_.STNAME in bigstates) |> @mutate(stpop=_.value/1e6))
bigplot = plot(bigdata, x=:year,y=:stpop, Geom.line,color=:STNAME,Geom.point,Guide.title("Population of Large States"))
smallstates = unique(pop[pop.value .< 2e6,:STNAME])
smalldata = DataFrame(pop |> @filter(_.STNAME in smallstates) |> @mutate(stpop=_.value/1e6))
smallplot = plot(smalldata, x=:year,y=:stpop,Geom.line,color=:STNAME,Geom.point,Guide.title("Population of Small States"))
set_default_plot_size(9inch,4inch)
hstack(bigplot,smallplot)
```
So far so good. We notice that Idaho has been growing nonlinearly
through time. Let's fit a quadratic to its growth curve and then we'll
call it a day.
```julia
using GLM
iddata = smalldata |> @filter(_.STNAME == "Idaho") |> DataFrame
idgrowth = lm(@formula(stpop ~ (year-2015)+(year-2015)^2),iddata)
display(coef(idgrowth))
preds = DataFrame(predict(idgrowth,DataFrame(year=[2020,2021]),interval=:prediction))
preds.State = ["Idaho","Idaho"]
preds.year = [2020,2021]
preds
```
Who knows how many people will be in Idaho in 2021, perhaps COVID-19
will cause an even bigger influx than the recent trend as people leave
CA. But in any case, now we know what we'd expect if the trend
continued as in the past. Extrapolating using a quadratic can be
problematic, but this isn't too far outside the range of plausible. If
we want to do more modeling, we should really learn some Bayesian
methods 😉.
That's it for now. If you have some interest in what we did, and why
we did it, you can read the Discussion document.