-
Notifications
You must be signed in to change notification settings - Fork 0
/
01-basic_data_handling.Rmd
305 lines (225 loc) · 15.7 KB
/
01-basic_data_handling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
```{r, echo=FALSE}
library(knitr)
options(scipen = F)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
```
# Data handling
This chapter covers the basics of data handling in R
## Basic data handling
[You can download the corresponding R-Code here](./Code/01-basic_data_handling (1).R)
### Creating objects
Anything created in R is an object. You can assign values to objects using the assignment operator ``` <-```:
```{r}
x <- "hello world" #assigns the words "hello world" to the object x
#this is a comment
```
Note that comments may be included in the code after a ```#```. The text after ```#``` is not evaluated when the code is run; they can be written directly after the code or in a separate line.
To see the value of an object, simply type its name into the console and hit enter:
```{r}
x #print the value of x to the console
```
You can also explicitly tell R to print the value of an object:
```{r}
print(x) #print the value of x to the console
```
Note that because we assign characters in this case (as opposed to e.g., numeric values), we need to wrap the words in quotation marks, which must always come in pairs. Although RStudio automatically adds a pair of quotation marks (i.e., opening and closing marks) when you enter the opening marks it could be that you end up with a mismatch by accident (e.g., ```x <- "hello```). In this case, R will show you the continuation character “+”. The same could happen if you did not execute the full command by accident. The "+" means that R is expecting more input. If this happens, either add the missing pair, or press ESCAPE to abort the expression and try again.
To change the value of an object, you can simply overwrite the previous value. For example, you could also assign a numeric value to "x" to perform some basic operations:
```{r}
x <- 2 #assigns the value of 2 to the object x
print(x)
x == 2 #checks whether the value of x is equal to 2
x != 3 #checks whether the value of x is NOT equal to 3
x < 3 #checks whether the value of x is less than 3
x > 3 #checks whether the value of x is greater than 3
```
Note that the name of the object is completely arbitrary. We could also define a second object "y", assign it a different value and use it to perform some basic mathematical operations:
```{r}
y <- 5 #assigns the value of 2 to the object x
x == y #checks whether the value of x to the value of y
x*y #multiplication of x and y
x + y #adds the values of x and y together
y^2 + 3*x #adds the value of y squared and 3x the value of x together
```
<b>Object names</b>
Please note that object names must start with a letter and can only contain letters, numbers, as well as the ```.```, and ```_``` separators. It is important to give your objects descriptive names and to be as consistent as possible with the naming structure. In this tutorial we will be using lower case words separated by underscores (e.g., ```object_name```). There are other naming conventions, such as using a ```.``` as a separator (e.g., ```object.name```), or using upper case letters (```objectName```). It doesn't really matter which one you choose, as long as you are consistent.
### Data types
The most important types of data are:
Data type | Description
------------- | --------------------------------------------------------------------------
Numeric | Approximations of the real numbers, $\normalsize\mathbb{R}$ (e.g., mileage a car gets: 23.6, 20.9, etc.)
Integer | Whole numbers, $\normalsize\mathbb{Z}$ (e.g., number of sales: 7, 0, 120, 63, etc.)
Character | Text data (strings, e.g., product names)
Factor | Categorical data for classification (e.g., product groups)
Logical | TRUE, FALSE
Date | Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.)
Variables can be converted from one type to another using the appropriate functions (e.g., ```as.numeric()```,```as.integer()```,```as.character()```, ```as.factor()```,```as.logical()```, ```as.Date()```). For example, we could convert the object ```y``` to character as follows:
```{r}
y <- as.character(y)
print(y)
```
Notice how the value is in quotation marks since it is now of type character.
Entering a vector of data into R can be done with the ``` c(x1,x2,..,x_n)``` ("concatenate") command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator ``` <-```. You can choose names arbitrarily (but the first character of a name cannot be a number). Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first. Instead of converting a variable we can also create a new one and use an existing one as input. In this case we ommit the ```as.``` and simply use the name of the type (e.g. ```factor()```). There is a subtle difference between the two: When converting a variable, with e.g. ```as.factor()```, we can only pass the variable we want to convert without additional arguments and R determines the factor levels by the existing unique values in the variable or just returns the variable itself if it is a factor already. When we specifically create a variable (just ```factor()```, ```matrix()```, etc.), we can and should set the options of this type explicitly. For a factor variable these could be the labels and levels, for a matrix the number of rows and columns and so on.
```{r }
#Numeric:
top10_track_streams <- c(163608, 126687, 120480, 110022, 108630, 95639, 94690, 89011, 87869, 85599)
#Character:
top10_artist_names <- c("Axwell /\\ Ingrosso", "Imagine Dragons", "J. Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", "French Montana", "Calvin Harris", "Liam Payne", "Lauv") # Characters have to be put in ""
#Factor variable with two categories:
top10_track_explicit <- c(0,0,0,0,0,0,1,1,0,0)
top10_track_explicit <- factor(top10_track_explicit,
levels = 0:1,
labels = c("not explicit", "explicit"))
#Factor variable with more than two categories:
top10_artist_genre <- c("Dance","Alternative","Latino","Dance","Dance","Dance","Hip-Hop/Rap","Dance","Pop","Pop")
top10_artist_genre <- as.factor(top10_artist_genre)
#Date:
top_10_track_release_date <- as.Date(c("2017-05-24", "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", "2017-05-19"))
#Logical
top10_track_explicit_1 <- c(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE)
```
In order to "call" a vector we can now simply enter its name:
```{r}
top10_track_streams
```
```{r}
top_10_track_release_date
```
In order to check the type of a variable the ```class()``` function is used.
```{r}
class(top_10_track_release_date)
```
The video below gives a general overview of vectors and provides a more in-depth discussion.
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/Xioc33EZxgU" frameborder="0" allowfullscreen></iframe>
</div>
### Data structures
Now let's create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use <b>data frames</b>.
<p style="text-align:center;"><img src="https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/Graphics/dataframe.JPG" alt="data types" height="320"></p>
Data frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: "F", "M" instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where "data frames" come into play. Data frames can have different types of data in each column. For example, we can combine the vectors created above in one data frame using ```data.frame()```. This creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).
```{r}
music_data <- data.frame(top10_track_streams,
top10_artist_names,
top10_track_explicit,
top10_artist_genre,
top_10_track_release_date,
top10_track_explicit_1)
```
#### Accessing data in data frames
When entering the name of a data frame, R returns the entire data frame:
```{r}
music_data # Returns the entire data frame
```
Hint: You may also use the ```View()```-function to view the data in a table format (like in SPSS or Excel), i.e. enter the command ```View(data)```. Note that you can achieve the same by clicking on the small table icon next to the data frame in the "Environment"-window on the right in RStudio.
Sometimes it is convenient to return only specific values instead of the entire data frame. There are a variety of ways to identify the elements of a data frame. One easy way is to explicitly state, which rows and columns you wish to view. The general form of the command is ```data.frame[rows,columns]```. By leaving one of the arguments of ```data.frame[rows,columns]``` blank (e.g., ```data.frame[rows,]```) we tell R that we want to access either all rows or columns, respectively. Here are some examples:
```{r}
music_data[ , 2:4] # all rows and columns 2,3,4
music_data[ ,c("top10_artist_names", "top_10_track_release_date")] # all rows and columns "top10_artist_names" and "top_10_track_release_date"
music_data[1:5, c("top10_artist_names", "top_10_track_release_date")] # rows 1 to 5 and columns "top10_artist_names"" and "top_10_track_release_date"
```
You may also create subsets of the data frame, e.g., using mathematical expressions:
```{r}
music_data[top10_track_explicit == "explicit",] # show only tracks with explicit lyrics
music_data[top10_track_streams > 100000,] # show only tracks with more than 100,000 streams
music_data[top10_artist_names == 'Robin Schulz',] # returns all observations from artist "Robin Schulz"
music_data[top10_track_explicit == "explicit",] # show only explicit tracks
```
The same can be achieved using the ```subset()```-function
```{r}
subset(music_data,top10_track_explicit == "explicit") # selects subsets of observations in a data frame
#creates a new data frame that only contains tracks from genre "Dance"
music_data_dance <- subset(music_data,top10_artist_genre == "Dance")
music_data_dance
rm(music_data_dance) # removes an object from the workspace
```
You may also change the order of the variables in a data frame by using the ```order()```-function
```{r}
#Orders by genre (ascending) and streams (descending)
music_data[order(top10_artist_genre,-top10_track_streams),]
```
#### Inspecting the content of a data frame
The ```head()``` function displays the first X elements/rows of a vector, matrix, table, data frame or function.
```{r}
head(music_data, 3) # returns the first X rows (here, the first 3 rows)
```
The ```tail()``` function is similar, except it displays the last elements/rows.
```{r}
tail(music_data, 3) # returns the last X rows (here, the last 3 rows)
```
```names()``` returns the names of an R object. When, for example, it is called on a data frame, it returns the names of the columns.
```{r}
names(music_data) # returns the names of the variables in the data frame
```
```str()``` displays the internal structure of an R object. In the case of a data frame, it returns the class (e.g., numeric, factor, etc.) of each variable, as well as the number of observations and the number of variables.
```{r}
str(music_data) # returns the structure of the data frame
```
```nrow()``` and ```ncol()``` return the rows and columns of a data frame or matrix, respectively. ```dim()``` displays the dimensions of an R object.
```{r}
nrow(music_data) # returns the number of rows
ncol(music_data) # returns the number of columns
dim(music_data) # returns the dimensions of a data frame
```
```ls()``` can be used to list all objects that are associated with an R object.
```{r}
ls(music_data) # list all objects associated with an object
```
#### Append and delete variables to/from data frames
To call a certain column in a data frame, we may also use the ```$``` notation. For example, this returns all values associated with the variable "top10_track_streams":
```{r}
music_data$top10_track_streams
```
Assume that you wanted to add an additional variable to the data frame. You may use the ```$``` notation to achieve this:
```{r}
# Create new variable as the log of the number of streams
music_data$log_streams <- log(music_data$top10_track_streams)
# Create an ascending count variable which might serve as an ID
music_data$obs_number <- 1:nrow(music_data)
head(music_data)
```
To delete a variable, you can simply create a ```subset``` of the full data frame that excludes the variables that you wish to drop:
```{r}
music_data <- subset(music_data,select = -c(log_streams)) # deletes the variable log streams
head(music_data)
```
You can also rename variables in a data frame, e.g., using the ```rename()```-function from the ```plyr``` package. In the following code "::" signifies that the function "rename" should be taken from the package "plyr". This can be useful if multiple packages have a function with the same name. Calling a function this way also means that you can access a function without loading the entire package via ```library()```.
```{r, message=FALSE, warning=FALSE}
library(plyr)
music_data <- plyr::rename(music_data, c(top10_artist_genre="genre",top_10_track_release_date="release_date"))
head(music_data)
```
Note that the same can be achieved using:
```{r, message=FALSE, warning=FALSE}
names(music_data)[names(music_data)=="genre"] <- "top10_artist_genre"
head(music_data)
```
Or by referring to the index of the variable:
```{r, message=FALSE, warning=FALSE}
names(music_data)[4] <- "genre"
head(music_data)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(knitr)
library(dplyr)
library(stringr)
options(scipen = F)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=FALSE, rownames.print = FALSE, rows.print = 10, eval = TRUE, warning = FALSE, message = FALSE)
top10_track_streams <- c(163608, 126687, 120480, 110022, 108630, 95639, 94690, 89011, 87869, 85599)
top10_artist_names <- c("Axwell /\\ Ingrosso", "Imagine Dragons", "J. Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", "French Montana", "Calvin Harris", "Liam Payne", "Lauv") # Characters have to be put in ""
top10_track_explicit <- c(0,0,0,0,0,0,1,1,0,0)
top10_track_explicit <- factor(top10_track_explicit,
levels = 0:1,
labels = c("not explicit", "explicit"))
top10_artist_genre <- c("Dance","Alternative","Latino","Dance","Dance","Dance","Hip-Hop/Rap","Dance","Pop","Pop")
top10_artist_genre <- as.factor(top10_artist_genre)
top_10_track_release_date <- as.Date(c("2017-05-24", "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", "2017-05-19"))
top10_track_explicit_1 <- c(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE,FALSE)
music_data <- data.frame(top10_track_streams,
top10_artist_names,
top10_track_explicit,
top10_artist_genre,
top_10_track_release_date,
top10_track_explicit_1,
stringsAsFactors = FALSE)
```