-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-loops-R.html
247 lines (241 loc) · 16.2 KB
/
07-loops-R.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Programming with R</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<a href="index.html"><h1 class="title">Programming with R</h1></a>
<h2 class="subtitle">Analyzing multiple data sets</h2>
<section class="objectives panel panel-warning">
<div class="panel-heading">
<h2 id="learning-objectives"><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Explain what a <code>for</code> loop does.</li>
<li>Correctly write <code>for</code> loops to repeat simple calculations.</li>
<li>Trace changes to a loop variable as the loop runs.</li>
<li>Trace changes to other variables as they are updated by a <code>for</code> loop.</li>
<li>Use a function to get a list of filenames that match a simple pattern.</li>
<li>Use a <code>for</code> loop to process multiple files.</li>
</ul>
</div>
</section>
<p>We have created a function called <code>calcGDP</code> that calculates the Gross Domestic Product:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">calcGDP <-<span class="st"> </span>function(dat, <span class="dt">year=</span><span class="ot">NULL</span>, <span class="dt">country=</span><span class="ot">NULL</span>) {
if(!<span class="kw">is.null</span>(year)){
dat <-<span class="st"> </span>dat[dat$year %in%<span class="st"> </span>year, ]
}
if(!<span class="kw">is.null</span>(country))
{
dat <-<span class="st"> </span>dat[dat$country %in%<span class="st"> </span>country,]
}
gdp <-<span class="st"> </span>dat$pop *<span class="st"> </span>dat$gdpPercap
new <-<span class="st"> </span><span class="kw">cbind</span>(dat, <span class="dt">gdp=</span>gdp)
<span class="kw">return</span>(new)
}
gdp.argentina <-<span class="st"> </span><span class="kw">calcGDP</span>(<span class="dt">dat=</span>gapminder, <span class="dt">country=</span><span class="st">"Argentina"</span>)</code></pre></div>
<p>Following the plot example in lesson <a href="05-data-analysis.html">05</a> we can plot the gross domestic product of Argentina over the years:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">plot</span>(gdp.argentina$year, gdp.argentina$gdp)</code></pre></div>
<p><img src="fig/07-loops-gapminder-Argentina.png" title="plot of gross domestic product - Argentina" alt="plot of gross domestic product - Argentina" style="display: block; margin: auto;" /></p>
<p>but we have data for 142 countries in our dataset. We want to create plots for all of them with a single statement. To do that, we’ll have to teach the computer how to repeat things.</p>
<h3 id="for-loops">For Loops</h3>
<p>Suppose we want to print each word in a sentence. One way is to use six <code>print</code> statements:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">best_practice <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Let"</span>, <span class="st">"the"</span>, <span class="st">"computer"</span>, <span class="st">"do"</span>, <span class="st">"the"</span>, <span class="st">"work"</span>)
print_words <-<span class="st"> </span>function(sentence) {
<span class="kw">print</span>(sentence[<span class="dv">1</span>])
<span class="kw">print</span>(sentence[<span class="dv">2</span>])
<span class="kw">print</span>(sentence[<span class="dv">3</span>])
<span class="kw">print</span>(sentence[<span class="dv">4</span>])
<span class="kw">print</span>(sentence[<span class="dv">5</span>])
<span class="kw">print</span>(sentence[<span class="dv">6</span>])
}
<span class="kw">print_words</span>(best_practice)</code></pre></div>
<pre class="output"><code>[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
</code></pre>
<p>but that’s a bad approach for two reasons:</p>
<ol style="list-style-type: decimal">
<li><p>It doesn’t scale: if we want to print the elements in a vector that’s hundreds long, we’d be better off just typing them in.</p></li>
<li><p>It’s fragile: if we give it a longer vector, it only prints part of the data, and if we give it a shorter input, it returns <code>NA</code> values because we’re asking for elements that don’t exist!</p></li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">best_practice[-<span class="dv">6</span>]</code></pre></div>
<pre class="output"><code>[1] "Let" "the" "computer" "do" "the"
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">print_words</span>(best_practice[-<span class="dv">6</span>])</code></pre></div>
<pre class="output"><code>[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] NA
</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>R has has a special variable, <code>NA</code>, for designating missing values that are <strong>N</strong>ot <strong>A</strong>vailable in a data set. See <code>?NA</code> and <a href="http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Missing-values">An Introduction to R</a> for more details.</p>
</div>
</aside>
<p>Here’s a better approach:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">print_words <-<span class="st"> </span>function(sentence) {
for (word in sentence) {
<span class="kw">print</span>(word)
}
}
<span class="kw">print_words</span>(best_practice)</code></pre></div>
<pre class="output"><code>[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
[1] "work"
</code></pre>
<p>This is shorter—certainly shorter than something that prints every character in a hundred-letter string—and more robust as well:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">print_words</span>(best_practice[-<span class="dv">6</span>])</code></pre></div>
<pre class="output"><code>[1] "Let"
[1] "the"
[1] "computer"
[1] "do"
[1] "the"
</code></pre>
<p>The improved version of <code>print_words</code> uses a <a href="reference.html#for-loop">for loop</a> to repeat an operation—in this case, printing—once for each thing in a collection. The general form of a loop is:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">for (variable in collection) {
do things with variable
}</code></pre></div>
<p>We can name the <a href="reference.html#loop-variable">loop variable</a> anything we like (with a few <a href="http://cran.r-project.org/doc/manuals/R-intro.html#R-commands_003b-case-sensitivity-etc">restrictions</a>, e.g. the name of the variable cannot start with a digit). <code>in</code> is part of the <code>for</code> syntax. Note that the body of the loop is enclosed in curly braces <code>{ }</code>. For a single-line loop body, as here, the braces aren’t needed, but it is good practice to include them as we did.</p>
<p>Here’s another loop that repeatedly updates a variable:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">len <-<span class="st"> </span><span class="dv">0</span>
vowels <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"e"</span>, <span class="st">"i"</span>, <span class="st">"o"</span>, <span class="st">"u"</span>)
for (v in vowels) {
len <-<span class="st"> </span>len +<span class="st"> </span><span class="dv">1</span>
}
<span class="co"># Number of vowels</span>
len</code></pre></div>
<pre class="output"><code>[1] 5
</code></pre>
<p>It’s worth tracing the execution of this little program step by step. Since there are five elements in the vector <code>vowels</code>, the statement inside the loop will be executed five times. The first time around, <code>len</code> is zero (the value assigned to it on line 1) and <code>v</code> is <code>"a"</code>. The statement adds 1 to the old value of <code>len</code>, producing 1, and updates <code>len</code> to refer to that new value. The next time around, <code>v</code> is <code>"e"</code> and <code>len</code> is 1, so <code>len</code> is updated to be 2. After three more updates, <code>len</code> is 5; since there is nothing left in the vector <code>vowels</code> for R to process, the loop finishes.</p>
<p>Note that a loop variable is just a variable that’s being used to record progress in a loop. It still exists after the loop is over, and we can re-use variables previously defined as loop variables as well:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">letter <-<span class="st"> "z"</span>
for (letter in <span class="kw">c</span>(<span class="st">"a"</span>, <span class="st">"b"</span>, <span class="st">"c"</span>)) {
<span class="kw">print</span>(letter)
}</code></pre></div>
<pre class="output"><code>[1] "a"
[1] "b"
[1] "c"
</code></pre>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="co"># after the loop, letter is</span>
letter</code></pre></div>
<pre class="output"><code>[1] "c"
</code></pre>
<p>Note also that finding the length of a vector is such a common operation that R actually has a built-in function to do it called <code>length</code>:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">length</span>(vowels)</code></pre></div>
<pre class="output"><code>[1] 5
</code></pre>
<p><code>length</code> is much faster than any R function we could write ourselves, and much easier to read than a two-line loop; it will also give us the length of many other things that we haven’t met yet, so we should always use it when we can (see this <a href="01-supp-data-structures.html">lesson</a> to learn more about the different ways to store data in R).</p>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge---using-loops"><span class="glyphicon glyphicon-pencil"></span>Challenge - Using loops</h2>
</div>
<div class="panel-body">
<ol style="list-style-type: decimal">
<li>Write a function called <code>total</code> that calculates the sum of the values in a vector. (R has a built-in function called <code>sum</code> that does this for you. Please don’t use it for this exercise.)</li>
</ol>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">ex_vec <-<span class="st"> </span><span class="kw">c</span>(<span class="dv">4</span>, <span class="dv">8</span>, <span class="dv">15</span>, <span class="dv">16</span>, <span class="dv">23</span>, <span class="dv">42</span>)
<span class="kw">total</span>(ex_vec)</code></pre></div>
<pre class="output"><code>[1] 108
</code></pre>
</div>
</section>
<h3 id="generating-multiple-plots">Generating Multiple Plots</h3>
<p>We now have almost everything we need to generate the gdp plots for all countries. The only thing missing is creating a vector that contains all the countries in our dataset. Using the function <code>unique</code> on the <code>country</code> column of <code>gapminder</code> object, we can extract the 142 countries:</p>
<div class="sourceCode"><pre class="sourceCode r"><code class="sourceCode r">countries <-<span class="st"> </span><span class="kw">unique</span>(gapminder$country)
<span class="kw">length</span>(countries)
<span class="kw">tail</span>(countries, <span class="dt">n=</span><span class="dv">3</span>)</code></pre></div>
<pre class="output"><code>[1] 142
[1] "Yemen Rep." "Zambia" "Zimbabwe"
</code></pre>
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="tip-1"><span class="glyphicon glyphicon-pushpin"></span>Tip</h2>
</div>
<div class="panel-body">
<p>For larger projects, it is recommended to organize separate parts of the analysis into multiple subdirectories, e.g. one subdirectory for the raw data, one for the code, and one for the results like figures. We have done that here to some extent, putting all of our data files into the subdirectory “data”. For more advice on this topic, you can read <a href="http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424">A quick guide to organizing computational biology projects</a> by William Stafford Noble.</p>
</div>
</aside>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2 id="challenge---using-loops-to-generate-multiple-plots"><span class="glyphicon glyphicon-pencil"></span>Challenge - Using loops to generate multiple plots</h2>
</div>
<div class="panel-body">
<p>Our goal is to generate a script (please go back to your <code>data_analysis</code> R script) that analyses our <code>gapminder</code> data as follows:</p>
<ol style="list-style-type: decimal">
<li><p>Reads the data into a data frame.</p></li>
<li><p>For each country the script generates three plots; one of the total population, one of life expectancy and one of gross domestic product over the years.</p></li>
<li><p>Save the script and commit the changes.</p></li>
</ol>
</div>
</section>
<!---
> ## Key Points {.callout}
>
> * Use `for (variable in collection)` to process the elements of a collection one at a time.
> * The body of a `for` loop is surrounded by curly braces (`{ }`).
> * Use `uni(thing)` to determine the length of something that contains other values.
> * Use `list.files(path = "path", pattern = "pattern", full.names = TRUE)` to create a list of files whose names match a pattern.
-->
<aside class="callout panel panel-info">
<div class="panel-heading">
<h2 id="next-steps"><span class="glyphicon glyphicon-pushpin"></span>Next Steps</h2>
</div>
<div class="panel-body">
<p>We have now solved our original problem: we can generate any number of plots with a single command. More importantly, we have met two of the most important ideas in programming:</p>
<ul>
<li>Use functions to make code easier to re-use and easier to understand.</li>
<li>Use vectors and data frames to store related values, and loops to repeat operations on them.</li>
</ul>
<p>We have one more big idea to introduce…</p>
</div>
</aside>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>