-
Notifications
You must be signed in to change notification settings - Fork 1
/
Rscript01DataFormat.R
442 lines (381 loc) · 18 KB
/
Rscript01DataFormat.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
################################################################################
### ###
### ---- Rscript01DataFormat.R: a script for the introduction to RSiena ---- ###
### ###
### version: September 8, 2020 ###
################################################################################
#
# Rscript01DataFormat.R is followed by
# RScriptSNADescriptives.R, code for descriptive analysis of the data, and
# Rscript02SienaVariableFormat.R, which formats data and specifies the model,
# and Rscript03SienaRunModel.R, which runs the model and estimates parameters
# Rscript04SienaBehaviour.R, which illustrates an example of analysing the
# coevolution of networks and behaviour
#
# The entire model fitting is summarised at the end of RscriptSienaRunModel.R
# (without comments)
#
# This is an R script for getting started with RSiena, written by
# Tom Snijders, with earlier contributions from Robin Gauthier,
# Ruth Ripley, Johan Koskinen, Paulina Preciado, and Zsofia Boda,
# with some examples borrowed from Christian Steglich.
# Lines starting with # are not processed by R but treated as comments.
# The script has a lot of explanation of R possibilities that will be
# familiar for readers well acquainted with R, and can be skipped by them.
#
# There are various really easy online introductions to R. See, for example
#
# http://www.statmethods.net/
# http://www.burns-stat.com/pages/Tutor/hints_R_begin.html
# http://data.princeton.edu/R/gettingStarted.html
# https://stats.idre.ucla.edu/r/
#
# You can go to any of these sites to learn the basics of R
# or refresh your knowledge.
# There is a lot of documentation available at
# https://www.r-project.org/other-docs.html
# including some short introductions, handy reference cards,
# and introductions in many languages besides English.
#
#
# Some general points to note:
#
# R is case-sensitive. Be aware of capitalization!
#
# The left-arrow "<-" is very frequently used: it denotes an assignment,
# "a <- b" meaning that object a gets the value b.
# Often b is a complicated expression that has to be evaluated by R,
# and computes a result that then is stored as the object a.
#
# Help within R can be called by typing a question mark and the name of the
# function you need help for. For example
# ?library
# will bring up a file titled "loading/attaching and listing of packages".
# Comments are made at the end of commands after #,
# or in lines starting with # telling R to ignore everything beyond it.
# That is why everything up to now in this file is on lines starting with #.
#
# This session will be using the s50 data which are available in RSiena.
# You can also get them from
# http://www.stats.ox.ac.uk/~snijders/siena/siena_datasets.htm
#
# Any command in R is a function, and ends by parentheses
# that enclose the arguments of the function,
# or enclose nothing if no argument is needed, such as for the function q()
# In general the command syntax for calling R's functions is function(x) where
# function is an available function and x the name of the object operated on.
#
#################### - CALLING THE DATA AND PRELIMINARY MANIPULATIONS - ########
# The library command loads the packages needed during the session.
library(RSiena)
# Some additional packages are used by RSiena,
# the so-called required packages; these will be loaded automatically.
# You need to have INSTALLED all of them.
?install.packages
# Or click on the tab "Packages", "Install package(s)", then select a
# CRAN mirror close to you (e.g. Bristol if you are in the UK) and finally
# select from the list the package you wish to install.
# Where are you?
getwd()
# By something like
# setwd('C:/SienaTest')
# you can set the directory but note the quotes and forward slash.
# It is also possible to set the directory using the menus if you have them.
# On a windows machine, you can predetermine the working directory
# in the <Properties> of the shortcut to R;
# these are accessible by right-clicking the shortcut.
# If you want to set the working directory to for example
# "C:\Documents and Settings\johan\My Documents\RSiena_course"
# simply copy and paste from windows explorer or type
# setwd('C:/Documents and Settings/johan/My Documents/RSiena_course')
# or
# setwd('C:\\Documents and Settings\\johan\\My Documents\\RSiena_course')
#
# but note that "\" has to be changed; both '/' and '\\' work!!!
# What is in this directory?
list.files()
# What is available in RSiena?
?RSiena
# (these are .htlm HELP PAGES)
# At the bottom of this page, when you click on "Index",
# a list of all the available functions is shown in your browser.
# The same list is shown in the graphical user interface for R by requesting
library(help=RSiena)
# An extensive manual is at the Siena website
# at http://www.stats.ox.ac.uk/~snijders/siena/RSiena_Manual.pdf ;
# it is updated frequently.
# Each data is named (for example below we name it friend.data.w1)
# so that we can call it as an object within R.
# If you read an object straight into R, it will treat it as a
# dataset, or in R terminology a "data frame".
# Here this is not what we want, therefore on reading
# we will immediately convert it to a matrix.
# R will read in many data formats, the command to read them is read.table.
?read.table
# In the help page for read.table, look at the section "Value",
# which is there in every help page:
# it indicates the class of the object that is produced by the function.
# For read.table, the value is a data frame; below we see what this is.
#
# If we wished to read a .csv file we would have
# used the read.csv command.
# The pathnames must have forward slashes, or double backslashes.
# If single backslashes are used, one of the error messages will be:
# 1: '\R' is an unrecognized escape in a character string
#-----------------------------------------------------------------------------
# Quick start (Data assignment).
# Please make sure the s50 data set is in your working directory.
# The data set is on the Siena website ("Data sets" tab) and must be
# unzipped in your working directory.
# friend.data.w1 <- as.matrix(read.table("s50-network1.dat"))
# friend.data.w2 <- as.matrix(read.table("s50-network2.dat"))
# friend.data.w3 <- as.matrix(read.table("s50-network3.dat"))
# drink <- as.matrix(read.table("s50-alcohol.dat"))
# smoke <- as.matrix(read.table("s50-smoke.dat"))
# To make this script self-contained, you may instead run the commands:
library(RSiena)
friend.data.w1 <- s501
friend.data.w2 <- s502
friend.data.w3 <- s503
drink <- s50a
smoke <- s50s
# because s501 etc. are internal data objects of RSiena.
# Explanation of data structures and formats below
################# - DIFFERENT DATA FORMATS - ###################################
###
### The assignments here involve reading in data to a "data frame"
### data <- read.table("data.dat")
### reads your text file into a "data frame";
### check the class of the object "data"
### > class(data)
### [1] "data.frame"
### If your data is not a ".dat" file, alternative import methods are
### ----- ".csv" ---------------------------------------------------------------
### data <- read.table("c:/data.csv", header=TRUE,
### sep=",", row.names="iden")
### ---- ".csv" ----------------------------------------------------------------
### data <- read.csv("data.csv",header=T)
### look at ?read.csv and ?read.table to understand these function calls
### ---- ".spss" ---------------------------------------------------------------
### library(foreign)
### data <- read.spss("c:/data.spss")
### ---- ".dta" ----------------------------------------------------------------
### library(foreign)
### data <- read.dta("c:/data.dta")
###
################################################################################
################# - FROM DATA FRAME TO MATRIX - ################################
###
### ---- Data frame ------------------------------------------------------------
### The data frame is like a spreadsheet of cases by variables,
### where the variables are the columns, and these have the names
### names( data )
### a spreadsheet view of a data frame is given by the fix() command
### fix( data )
###
### All changes you make in this spreadsheet will be saved automatically,
### so beware.
###
### As an example create two vectors:
height <- c( 120, 150, 180, 200, 190, 177, 170, 125, 141, 157 )
weight <- c( 11, 14, 17, 18, 17, 18, 11, 12, 10, 15 )
### The function c() combines its argument into a vector
### (or into a list, but we are not concerned with that possibility now.).
### These two vectors can be collected as variables in a data frame
mydata <- data.frame( height, weight )
### and look at the results
mydata
### The columns of a data frame may be extracted
### using a "$" sign and their names.
### For example:
names( mydata )
mydata$height
### Or by "[]" and column number, e.g.
mydata[1]
mydata[ , 1]
mydata[ 1, ]
### If you wish to see the structure of an object, such as mydata, then request
str(mydata)
### Objects often have attributes:
attributes(mydata)
### ---- Matrix ----------------------------------------------------------------
### A "matrix" is a numeric object ordered like a matrix with dimensions
### ( dim() ) given by the number of rows and columns, e.g.
dim( friend.data.w1 )
### If you wish to play around with a copy of the matrix,
###e.g. having the name "mydata", you can make the copy by the command
mydata <- friend.data.w1
### The earlier object that we created with the name "mydata" now has been lost.
### Elements if matrices can be accessed by using square brackets.
### For example, the element of "mydata" in row 2 and column 3 is given by
mydata[ 2, 3 ]
### the first three rows of a matrix called mydata are given by
mydata[ 1:3, ]
### columns 2, 5, and 6 are given by
mydata[ , c( 2, 5, 6) ]
### Indeed there are a lot of zeros - networks tend to be sparse.
### ---- Converting data frame to matrix ---------------------------------------
###
### Most classes can be converted to other classes through "as.typeofclass ()",
### e.g., if "mydata" would be an object with a matrix-like structure,
### then it could be converted to the class "matrix" by the command
mydata <- as.matrix( mydata )
################################################################################
################## - EXAMPLE FOR ARC LIST - ####################################
###
### From the Siena website you can download the data set arclistdata.dat.
### The function read.table can be used with the internet location:
ArcList <-
read.table( "https://www.stats.ox.ac.uk/~snijders/siena/arclistdata.dat",
header=FALSE )
### Note the capitalization.
### Now ArcList is a data.frame
### (we saw this above in the help page for read.table).
### What are its dimensions?
dim(ArcList)
### You can get an impression of it by looking at the start of the object:
head(ArcList)
### This is a data file in arclist format, with columns:
### sender id, receiver id, value of tie, wave.
### Add names to the columns:
names(ArcList) <- c( "sid", "recid", "bff", "wid" )
### and again request
head(ArcList)
### The bff ("best friend") variable does not have much variability:
table(ArcList$bff)
### This tells us that non-ties are not included (they would have the value 0),
### and that there are no tie values other than 1.
### It may be nice to order the rows by sender, then by receiver, then by wave.
### The tedious way to do this is
ArcList <- ArcList[ order( ArcList$sid, ArcList$recid, ArcList$wid), ]
### To understand this, you may look at the help page for function order()
### The rows of Arclist are ordered first by ArcList$sid, then by ArcList$recid,
### and then by ArcList$wid;
### these reordered rows then are put into the object ArcList,
### thus overwriting the earlier contents of this object.
### This way is tedious because it is repeated all the time that the names
### sid, recid, wid refer to the ArcList object.
### The less tedious way uses the function "with".
### The with(a, b) function tells R that b must be calculated,
### while the otherwise unknown names refer to data set a:
ArcList <- with( ArcList, ArcList[ order( sid, recid, wid), ] )
### An arc list does not give information about the number of nodes,
### because isolates are not recorded.
### The set of non-isolates is
union(unique(ArcList$sid), unique(ArcList$recid))
### and, given that we have the information that there are 50 nodes
### labeled 1 to 50, the isolates are the following two nodes:
setdiff(1:50, union(unique(ArcList$sid), unique(ArcList$recid)))
### Now suppose we want to create a separate set of records for each wave.
### Select by wave:
SAff.1 <- with(ArcList, ArcList[ wid == 1, ] ) #extracts edges in wave 1
SAff.2 <- with(ArcList, ArcList[ wid == 2, ] ) #extracts edges in wave 2
SAff.3 <- with(ArcList, ArcList[ wid == 3, ] ) #extracts edges in wave 3
### This can be arranged more efficiently as
SAff <- lapply(1:3, function(m){ with(ArcList, ArcList[ wid == m, ] ) } )
### with the correspondence that SAff.1 is SAff[[1]], etc.
### For transforming an adjacency matrix, e.g., friend.data.w1, into an arclist,
### create indicator matrix of the non-zero entries:
ones <- !friend.data.w1 %in% 0
### create empty edge list of desired length
edges <- matrix(0, sum(ones), 3)
### fill the columns of the edge list
edges[, 1] <- row(friend.data.w1)[ones]
edges[, 2] <- col(friend.data.w1)[ones]
edges[, 3] <- friend.data.w1[ones]
### if desired, order edge list by senders and then receivers
edges <- edges[order(edges[, 1], edges[, 2]), ]
### What did we make:
edges
### For transforming an arclist into a matrix,
### first remove the fourth column indicating the wave, so that we are left
### with sender, receiver and value of the tie,
### and transform it into matrix format (at first it is a data.frame)
SAff.1.copy <- SAff.1[, 1:3]
SAff.1.copy <- as.matrix(SAff.1.copy)
### create empty adjacency matrix
adj <- matrix(0, 50, 50)
### put edge values in desired places
adj[SAff.1.copy[,1:2]] <- SAff.1.copy[, 3]
### Now adj is the adjacency matrix.
### Also see Section 4.1 of the Siena manual
### and the help page for sienaDependent.
################################################################################
################ - READING IN PAJEK DATA - #####################################
###
### Skip this section if handling Pajek data is of no concern to you.
### If you have data in Pajek format you can use the package "network" in order
### to convert it to a network object. This example is from ?read.paj
### require( network )
###
### par( mfrow = c( 2, 2 ) )
###
### test.net.1 <-
### read.paj("http://vlado.fmf.uni-lj.si/pub/networks/data/GD/gd98/A98.net" )
### plot( test.net.1,main = test.net.1$gal$title )
### test.net.2 <-
### read.paj("http://vlado.fmf.uni-lj.si/pub/networks/data/mix/USAir97.net" )
### plot( test.net.2,main = test.net.2$gal$title )
###
################################################################################
# Before we work with the data, we want to be sure it is correct. A simple way
# to check that our data is a matrix is the command class()
class( friend.data.w1 )
# to list the properties of an object attributes( friend.data.w1 )
# (different classes have different attributes)
# To check that all the data has been read in, we can use the dim() command.
# The adjacency matrix should have the same dimensions as the original data
# (here, 50 by 50).
dim(friend.data.w1)
dim(drink)
# To check the values are correct, including missing values, we can use
# the following commands to tabulate the variables.
table( friend.data.w1, useNA = 'always' )
table( friend.data.w2, useNA = 'always' )
table( friend.data.w3, useNA = 'always' )
table( drink, useNA = 'always' )
table( smoke, useNA = 'always' )
# NA is the R code for missing data (Not Available).
# This data set happens to have no missings
# (see the data description on the Siena website).
# If there are any missings,
# it is necessary to tell R about the missing data codes.
# Let us do as if the missing codes for the friendship network were 6 and 9.
# This leads to the following commands.
# (For new R users: the c() function used here as "c(6,9)" constructs
# a vector [c for column] consisting of the numbers 6 and 9.
# This function is used a lot in basic R.)
friend.data.w1[ friend.data.w1 %in% c(6,9) ] <- NA
friend.data.w2[ friend.data.w2 %in% c(6,9) ] <- NA
friend.data.w3[ friend.data.w3 %in% c(6,9) ] <- NA
# Commands for descriptive analysis are in the script: RscriptSNADescriptives.R
############## - SELECTING SUBSETS OF DATA - ###################################
# To select a subset of the data based on an actor variable, say,
# those who have the value 2 or 3 on drinking at time 1
# (the possibilities are endless, but hopefully this will serve as a pattern)
use <- drink[, 1] %in% c(2, 3)
# This creates a logical vector which is TRUE for the cases where the condition
# is satisfied. To view or check, display the vectors next to each other:
cbind(drink[ , 1 ], use)
data.frame(drink[ , 1 ], use)
# and the number of selected cases is displayed by
sum( use )
# or
table( use )
# Given this selection, submatrices can be formed in case the analyses
# are to be done for this subset only:
friend1.data.w1 <- friend.data.w1[ use, use ]
friend1.data.w2 <- friend.data.w2[ use, use ]
drink1 <- drink[ use, ]
# A useful option in R that allows you to save your workspace:
save.image("WorkspaceRscript01.RData")
# Later you can load this in a new session by
# load("WorkspaceRscript01.RData")
# If next time you would like to continue from here, you will not need to open
# and run this script again, since you will be able to load this current state
# of your workspace. But packages will have to be attached again.
################################################################################
###
### ---- PROCEED TO RscriptSNADescriptives.R FOR DESCRIPTIVE ANALYSIS ----------
###
################################################################################