-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvisualise_implicon_methylation_consistency_human.Rmd
144 lines (106 loc) · 5 KB
/
visualise_implicon_methylation_consistency_human.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "Visualisation of methylation consistency in imprinted Amplicons"
output:
html_document: default
html_notebook: default
author: Felix Krueger
---
```{r}
library(tidyverse)
```
## Data Import
Data come in the following form:
```
readID sample allele implicon 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 BxC_2_clone_A5 CAST Impact 0 0 0 0 0 0 0 NA NA NA 0 0 0 0 0
2 BxC_2_clone_A5 CAST Prickle1 NA 1 1 1 1 1 1 0 1 1 1
3 BxC_2_clone_A5 CAST Commd1 NA 0 1 0 0 0 0 0 0 0 0 0 0
4 BxC_2_clone_A5 CAST Gnas NA 0 0 0 0 0 0 0 0 0
```
Each read comes annotated with its sample name, the implicon it overlaps, and the positions that were found methylated ('1') or unmethylated ('0). If a position of the implicon was not covered for any reason, a value of 'NA' was assigned. The read IDs are all unique.
```{r}
# This file is a collation of all reads that overlap targetted implicons for all samples
# Edit 26 March 2021: We had to change the input column type to "double", as a few columns were auto-detected as "character"
read_tsv("methylation_state_consistency_human.txt", col_types = cols(.default = "d", sample = col_character(), implicon = col_character())) -> input
colour.meth <- "#3e4444"
colour.unmeth <- "#82b74b"
colours <- c(colour.meth,colour.unmeth)
```
## Defining a function to do the following
The function takes in a certain Sample name as well as an Implicon name, and then:
- filters the comprehensive data table for a single Implicon
- filters for a single Sample
- transform the data to long format
- group the data by annotated CpG position, and exclude positions that were not covered at all (NA call)
- extract up to 5000 reads,
- sort reads from highly to lowly methylated, and finally
- plot reads per gene, per sample
```{r}
plot_implicons <- function(sample.name,gene.name){
#sample.name <- "Human_blood_F2"
#gene.name <- "H19"
print (paste("Now processing data set: ",sample.name,", Implicon name: ", gene.name))
# Filtering the Comprehensive Input table
input %>%
filter(implicon == gene.name ) %>%
filter(sample == sample.name) %>%
sample_n(min(5000,nrow(.))) -> filtered # random subset of 5000 rows
filtered
nrow(filtered) -> rows.filtered
# Transform the filtered data set to Tidyverse Long Format
filtered %>%
gather(key = position,value=state, -readID, -implicon, -sample) %>%
type_convert(cols(position=col_double())) -> filtered.tidy
# further filter tidied dataset to remove positions that were not covered at all and sort reads by readID
filtered.tidy %>%
group_by(position) %>%
mutate(keep = any(!is.na(state))) %>% # if a position contained only NAs,
# then FALSE is set for any () of the rows or that position
filter(keep) %>% # this removes positions from the dataset that were set to FALSE
select(-keep) %>% # remove the column 'keep' again
ungroup() %>% # remove grouping by position
group_by(readID) %>% # now group by readID
mutate(sum=sum(state,na.rm=T)) %>% # count up all methylated calls, and save in new column called 'sum'
ungroup() %>% # remove grouping by readID
arrange(readID) -> filtered.tidy2 # now sort by readID
# Now plot the data
filtered.tidy2 %>%
arrange(sum) %>% # reads are sorted from methylated to unmethylated
mutate(readID = factor(readID,levels=unique(readID))) %>% # set readID as level so they don't get sorted alphabetically
ggplot(aes(position, readID,fill=factor(state,levels=c("1","0")))) +
geom_tile() +
labs(title= paste0("Gene: ", gene.name),
subtitle = paste0("# reads: ",rows.filtered),
caption = paste0("Sample name: ", sample.name),
x = "\nCpG position") +
ylab(element_blank()) +
coord_cartesian(expand=F) + # this will prevent leaving empty values on either side of the plot
scale_fill_manual(values=colours, name="Methylation State",labels = c("meth", "unmeth")) +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_text(colour = "steelblue4", face = "bold"),
legend.text = element_text(face = "italic", colour="steelblue4"),
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.x = element_line()) -> plot.human
print (plot.human)
}
```
## Loop through all Implicon and sample names
```{r}
input %>%
distinct(sample) %>%
pull(sample) -> samplenames
# as_vector()-> samplenames ### as_vector() works as well.
input %>%
distinct(implicon) %>%
pull(implicon) -> implicons
# Looping through all Implicons, and then plotting all samples for each Implicon
for (imp in implicons){
# # print (imp)
invisible(sapply(samplenames, plot_implicons,imp))
}
# Just a single sample for testing purposes
# invisible(sapply("Tx", plot_implicons, "snrpn"))
```