-
Notifications
You must be signed in to change notification settings - Fork 2
/
AskaryHemmat_review_yawei.Rmd
201 lines (170 loc) · 8.68 KB
/
AskaryHemmat_review_yawei.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
title: "R_assignment"
output: html_document
---
## Review: You can find my comments start like this"##Review". Final files look good but you need to order them as numeric. right now there are some mistakes in ordering the postion I guess you ordered them as character. Overall comment: It looks good but needs some improvements and errors need to get fixed.
## R Markdown
This is yawei's R assignment.
My R assignment will consist of three parts:
Inspect and analysis two data files.
Additional analysis and visualization.
Reviewing two assignments from my two peers in class.
#loading fang_et_al_genotypes&snp_position
## Review- your original code was this:fang_et_al_genotypes <- read_tsv ("/Users/yaweili/yawei-R-Assignment/fang_et_al_genotypes.txt") snp_position <- read_tsv ("/Users/yaweili/yawei-R-Assignment/snp_position.txt") it's better to download the input files directly from the internet. Should not use absolute paths for input files. I chenged them so I can run it.
```{r}
library(tidyverse)
fang_et_al_genotypes <- read_tsv("https://github.com/EEOB-BioData/BCB546X-Fall2019/raw/master/assignments/UNIX_Assignment/fang_et_al_genotypes.txt")
snp_position <- read_tsv("https://github.com/EEOB-BioData/BCB546X-Fall2019/raw/master/assignments/UNIX_Assignment/snp_position.txt")
```
#Data inspection
#file.size
#the structure of the dataset
#number of columns, number of rows
## Review- Inspection looks good. just one thing, you should use another code for inspection of file size. Now it prints N/A
```{r}
file.size("fang_et_al_genotypes")
file.size("snp_position")
dim(fang_et_al_genotypes)
dim(snp_position)
colnames(fang_et_al_genotypes)
colnames(snp_position)
rownames(fang_et_al_genotypes)
rownames(snp_position)
```
#Data processing
#for the fang
#filter the group into both maize and teosinte
```{r}
maize_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMMLR" | Group == "ZMMIL" | Group == "ZMMMR")
teosinte_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMPBA" | Group == "ZMPIL" | Group == "ZMPJA")
```
#we need to remove the first three columnes of the two groups.
```{r}
maize_fang_et_al_genotypes <- maize_fang_et_al_genotypes[,-c(1,2,3)]
teosinte_fang_et_al_genotypes <- teosinte_fang_et_al_genotypes[,-c(1,2,3)]
```
#Transpose the dataset.
```{r}
maize_fang_et_al_genotypes <- t (maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- t (teosinte_fang_et_al_genotypes)
```
#for the snp
#we need to extract the columnes of (1,3,4) from the data of snp
```{r}
snp_position <- snp_position[,-c(2,5:15)]
```
#remove the first row of SNP
#Then we can combine both fang and snp files.
```{r}
maize_fang_et_al_genotypes <- cbind(snp_position, maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- cbind(snp_position, teosinte_fang_et_al_genotypes)
```
#Now we can start to creat files we want.
#For maize (Group = ZMMIL, ZMMLR, and ZMMMR in the third column of the fang_et_al_genotypes.txt file) we want 20 files in total:
#10 files (1 for each chromosome) with SNPs ordered based on increasing position values and with missing data encoded by this symbol: ?
## Review- till this part it looks like great but in the below chunk, I get this error: Error: All arguments must be named. And this error is for the last part of this chunk
```{r}
for (i in 1:10) {
maize <- filter(maize_fang_et_al_genotypes, Chromosome == i)
maize <- rename( maize,NA,"?")
maize <- arrange(maize, Position)
outpath <- "/Users/yaweili/yawei-R-Assignment/maize_increasing_data/"
nam <- sapply(
names(maize),function(x){
paste("maize_in", i, ".csv", sep='')
})
out_filePath <- sapply(nam, function(x){
paste(outpath, x, sep='/')})
write.csv(maize, file=out_filePath[i])
}
```
#10 files (1 for each chromosome) with SNPs ordered based on decreasing position values and with missing data encoded by this symbol: -
#1 file with all SNPs with unknown positions in the genome (these need not be ordered in any particular way)
## Review- getting another error:Error in filter(maize_ded, Chromosome == i) : object 'maize_ded' not found
```{r}
for (i in 1:10) {
maize <- filter(maize_ded, Chromosome == i)
maize <- arrange(maize, desc(Position))
outpath <- "/Users/yaweili/yawei-R-Assignment/maize_decreasing_data/"
nam <- sapply(
names(maize),function(x){
paste("maize_de", i, ".csv", sep='')
})
out_filePath <- sapply(nam, function(x){
paste(outpath, x, sep='/')})
write.csv(maize, file=out_filePath[i])
}
```
#1 file with all SNPs with multiple positions in the genome (these need not be ordered in any particular way)
#For teosinte (Group = ZMPBA, ZMPIL, and ZMPJA in the third column of the fang_et_al_genotypes.txt file) we want 20 files in total:
#10 files (1 for each chromosome) with SNPs ordered based on increasing position values and with missing data encoded by this symbol: ?
## Review: Error: All arguments must be named
```{r}
for (i in 1:10) {
teosinte <- filter(teosinte_fang_et_al_genotypes, Chromosome == i)
teosinte <- rename(teosinte,NA,"?")
teosinte <- arrange(teosinte, Position)
outpath <- "/Users/yaweili/yawei-R-Assignment/teosinte_increasing_data/"
nam <- sapply(
names(teosinte),function(x){
paste("teosinte_in", i, ".csv", sep='')
})
out_filePath <- sapply(nam, function(x){
paste(outpath, x, sep='/')})
write.csv(teosinte, file=out_filePath[i])
}
```
#10 files (1 for each chromosome) with SNPs ordered based on decreasing position values and with missing data encoded by this symbol: -
## Review: Error: Incomplete expression: for (i in 1:10) {teosinte <- filter(teosinte_ded, Chromosome == i)
teosinte <- arrange(teosinte, desc(Position))
outpath <- "Users/yaweili/yawei-R-Assignment/teosinte_decreasing_data/"
nam <- sapply(
names(teosinte),function(x){
paste("teosinte_dd", i, ".csv", sep='')
})
out_filePath <- sapply(nam, function(x){
paste(outpath, x, sep='/')})
write.csv(teosinte, file=out_filePath[i])
```{r}
for (i in 1:10) {
teosinte <- filter(teosinte_ded, Chromosome == i)
teosinte <- arrange(teosinte, desc(Position))
outpath <- "Users/yaweili/yawei-R-Assignment/teosinte_decreasing_data/"
nam <- sapply(
names(teosinte),function(x){
paste("teosinte_dd", i, ".csv", sep='')
})
out_filePath <- sapply(nam, function(x){
paste(outpath, x, sep='/')})
write.csv(teosinte, file=out_filePath[i])
```
#In conclusion, there are 40 files produced.
##Part II
#I need to use ggplot to visualize my data in this part.
#But first I need to reshape the original data (make it tidy) using the melt command in the reshape2 package before attempting this part.
## Review- Errors: Error: Negative column indexes in `[` must match number of columns: *`.data` has 3 columns * Position 2 equals -5 * Position 3 equals -6 * Position 4 equals -7 * Position 5 equals -8 * … and 7 more problems
```{r}
if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
if (!require("reshape2")) install.packages("reshape2")
library(reshape2)
maize_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMMLR" | Group == "ZMMIL" | Group == "ZMMMR")
teosinte_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMPBA" | Group == "ZMPIL" | Group == "ZMPJA")
maize_fang_et_al_genotypes <- maize_fang_et_al_genotypes[,-c(1,2,3)]#cut head
maize_fang_et_al_genotypes <- t (maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- teosinte_fang_et_al_genotypes[,-c(1,2,3)]#Guillotine
teosinte_fang_et_al_genotypes <- t (teosinte_fang_et_al_genotypes)
snp_position <- snp_position[,-c(2,5:15)]#Guillotine
arrange(snp_position, SNP_ID)
maize_fang_et_al_genotypes <- cbind(snp_position, maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- cbind(snp_position, teosinte_fang_et_al_genotypes)
rm(fang_et_al_genotypes)#clear up
rm(snp_position)#clear up
setwd("/Users/yaweili/yawei-R-Assignment/")
```
#SNPs per chromosome
Plot the total number of SNPs in our dataset on each chromosome. Also plot the distribution of SNPs on chromosomes.
#Missing data and amount of heterozygosity
Create a new column to indicate whether a particular site is homozygous (has the same nucleotide on both chromosomes (i.e., A/A, C/C, G/G, T/T) or heterozygous (otherwise)). Make a graph that shows the proportion of homozygous and heterozygous sites as well as missing data in each sample (you won't be able to see the sample names). Make another graph that shows the same data for each group. Normalize the height of individual bars using one of the ggplot "position adjustments" options.
#Your own visualization
Visualize one other feature of the dataset. The choice is up to you!