AskaryHemmat_review_yawei.Rmd


  
---
title: "R_assignment"
output: html_document
---
## Review: You can find my comments start like this"##Review". Final files look good but you need to order them as numeric. right now there are some mistakes in ordering the postion I guess you ordered them as character. Overall comment: It looks good but needs some improvements and errors need to get fixed.
## R Markdown
This is yawei's R assignment.
My R assignment will consist of three parts:
Inspect and analysis two data files. 
Additional analysis and visualization.
Reviewing two assignments from my two peers in class.
#loading fang_et_al_genotypes&snp_position
## Review- your original code was this:fang_et_al_genotypes <- read_tsv ("/Users/yaweili/yawei-R-Assignment/fang_et_al_genotypes.txt") snp_position <- read_tsv ("/Users/yaweili/yawei-R-Assignment/snp_position.txt") it's better to download the input files directly from the internet. Should not use absolute paths for input files. I chenged them so I can run it.
```{r}
library(tidyverse)

fang_et_al_genotypes <- read_tsv("https://github.com/EEOB-BioData/BCB546X-Fall2019/raw/master/assignments/UNIX_Assignment/fang_et_al_genotypes.txt")
snp_position <- read_tsv("https://github.com/EEOB-BioData/BCB546X-Fall2019/raw/master/assignments/UNIX_Assignment/snp_position.txt")
```
#Data inspection
#file.size
#the structure of the dataset
#number of columns, number of rows
## Review- Inspection looks good. just one thing, you should use another code for inspection of file size. Now it prints N/A
```{r}
file.size("fang_et_al_genotypes")
file.size("snp_position")
dim(fang_et_al_genotypes)
dim(snp_position)
colnames(fang_et_al_genotypes)
colnames(snp_position)
rownames(fang_et_al_genotypes)
rownames(snp_position)
```
#Data processing

#for the fang

#filter the group into both maize and teosinte
```{r}
maize_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMMLR" | Group == "ZMMIL" | Group == "ZMMMR")
teosinte_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMPBA" | Group == "ZMPIL" | Group == "ZMPJA")
```


#we need to remove the first three columnes of the two groups.
```{r}
maize_fang_et_al_genotypes <- maize_fang_et_al_genotypes[,-c(1,2,3)]
teosinte_fang_et_al_genotypes <- teosinte_fang_et_al_genotypes[,-c(1,2,3)]
```

#Transpose the dataset.
```{r}
maize_fang_et_al_genotypes <- t (maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- t (teosinte_fang_et_al_genotypes)
```

#for the snp
#we need to extract the columnes of (1,3,4) from the data of snp
```{r}
snp_position <- snp_position[,-c(2,5:15)]
```


#remove the first row of SNP 


#Then we can combine both fang and snp files.
```{r}
maize_fang_et_al_genotypes <- cbind(snp_position, maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- cbind(snp_position, teosinte_fang_et_al_genotypes)
```


#Now we can start to creat files we want.
#For maize (Group = ZMMIL, ZMMLR, and ZMMMR in the third column of the fang_et_al_genotypes.txt file) we want 20 files in total:

#10 files (1 for each chromosome) with SNPs ordered based on increasing position values and with missing data encoded by this symbol: ?
## Review- till this part it looks like great but in the below chunk, I get this error: Error: All arguments must be named. And this error is for the last part of this chunk
```{r}
for (i in 1:10) {
  maize <- filter(maize_fang_et_al_genotypes, Chromosome == i)
  maize <- rename( maize,NA,"?")
  maize <- arrange(maize, Position)
  outpath <- "/Users/yaweili/yawei-R-Assignment/maize_increasing_data/"
  nam <- sapply(
    names(maize),function(x){
                    paste("maize_in", i, ".csv", sep='')
    })
  out_filePath <- sapply(nam, function(x){
                     paste(outpath, x, sep='/')})
  write.csv(maize, file=out_filePath[i])
}
```


#10 files (1 for each chromosome) with SNPs ordered based on decreasing position values and with missing data encoded by this symbol: -


#1 file with all SNPs with unknown positions in the genome (these need not be ordered in any particular way)

## Review- getting another error:Error in filter(maize_ded, Chromosome == i) : object 'maize_ded' not found
```{r}
for (i in 1:10) {
  maize <- filter(maize_ded, Chromosome == i)
  maize <- arrange(maize, desc(Position))
  outpath <- "/Users/yaweili/yawei-R-Assignment/maize_decreasing_data/"
  nam <- sapply(
    names(maize),function(x){
                    paste("maize_de", i, ".csv", sep='')
    })
  out_filePath <- sapply(nam, function(x){
                     paste(outpath, x, sep='/')})
  write.csv(maize, file=out_filePath[i])
}
```

#1 file with all SNPs with multiple positions in the genome (these need not be ordered in any particular way)

#For teosinte (Group = ZMPBA, ZMPIL, and ZMPJA in the third column of the fang_et_al_genotypes.txt file) we want 20 files in total:

#10 files (1 for each chromosome) with SNPs ordered based on increasing position values and with missing data encoded by this symbol: ?

## Review: Error: All arguments must be named
```{r}
for (i in 1:10) {
  teosinte <- filter(teosinte_fang_et_al_genotypes, Chromosome == i)
  teosinte <- rename(teosinte,NA,"?")
  teosinte <- arrange(teosinte, Position)
  outpath <- "/Users/yaweili/yawei-R-Assignment/teosinte_increasing_data/"
  nam <- sapply(
    names(teosinte),function(x){
                    paste("teosinte_in", i, ".csv", sep='')
    })
  out_filePath <- sapply(nam, function(x){
                     paste(outpath, x, sep='/')})
  write.csv(teosinte, file=out_filePath[i])
}
```


#10 files (1 for each chromosome) with SNPs ordered based on decreasing position values and with missing data encoded by this symbol: -

## Review: Error: Incomplete expression: for (i in 1:10) {teosinte <- filter(teosinte_ded, Chromosome == i)
  teosinte <- arrange(teosinte, desc(Position))
  outpath <- "Users/yaweili/yawei-R-Assignment/teosinte_decreasing_data/"
  nam <- sapply(
    names(teosinte),function(x){
                    paste("teosinte_dd", i, ".csv", sep='')
    })
  out_filePath <- sapply(nam, function(x){
                     paste(outpath, x, sep='/')})
  write.csv(teosinte, file=out_filePath[i])
```{r}
for (i in 1:10) {
  teosinte <- filter(teosinte_ded, Chromosome == i)
  teosinte <- arrange(teosinte, desc(Position))
  outpath <- "Users/yaweili/yawei-R-Assignment/teosinte_decreasing_data/"
  nam <- sapply(
    names(teosinte),function(x){
                    paste("teosinte_dd", i, ".csv", sep='')
    })
  out_filePath <- sapply(nam, function(x){
                     paste(outpath, x, sep='/')})
  write.csv(teosinte, file=out_filePath[i])
```
#In conclusion, there are 40 files produced.
##Part II
#I need to use ggplot to visualize my data in this part. 
#But first I need to reshape the original data (make it tidy) using the melt command in the reshape2 package before attempting this part.

## Review- Errors: Error: Negative column indexes in `[` must match number of columns: *`.data` has 3 columns * Position 2 equals -5 * Position 3 equals -6 * Position 4 equals -7 * Position 5 equals -8 * … and 7 more problems
```{r}
if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)
if (!require("reshape2")) install.packages("reshape2")
library(reshape2)
maize_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMMLR" | Group == "ZMMIL" | Group == "ZMMMR")
teosinte_fang_et_al_genotypes <- filter(fang_et_al_genotypes, Group == "ZMPBA" | Group == "ZMPIL" | Group == "ZMPJA")
maize_fang_et_al_genotypes <- maize_fang_et_al_genotypes[,-c(1,2,3)]#cut head
maize_fang_et_al_genotypes <- t (maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- teosinte_fang_et_al_genotypes[,-c(1,2,3)]#Guillotine
teosinte_fang_et_al_genotypes <- t (teosinte_fang_et_al_genotypes)
snp_position <- snp_position[,-c(2,5:15)]#Guillotine
arrange(snp_position, SNP_ID)
maize_fang_et_al_genotypes <- cbind(snp_position, maize_fang_et_al_genotypes)
teosinte_fang_et_al_genotypes <- cbind(snp_position, teosinte_fang_et_al_genotypes)
rm(fang_et_al_genotypes)#clear up
rm(snp_position)#clear up
setwd("/Users/yaweili/yawei-R-Assignment/")
```
#SNPs per chromosome
Plot the total number of SNPs in our dataset on each chromosome. Also plot the distribution of SNPs on chromosomes.
#Missing data and amount of heterozygosity
Create a new column to indicate whether a particular site is homozygous (has the same nucleotide on both chromosomes (i.e., A/A, C/C, G/G, T/T) or heterozygous (otherwise)). Make a graph that shows the proportion of homozygous and heterozygous sites as well as missing data in each sample (you won't be able to see the sample names). Make another graph that shows the same data for each group. Normalize the height of individual bars using one of the ggplot "position adjustments" options.
#Your own visualization
Visualize one other feature of the dataset. The choice is up to you!