-
Notifications
You must be signed in to change notification settings - Fork 7
/
01_encode.Rmd
339 lines (243 loc) · 10.3 KB
/
01_encode.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: "01_encode_portal"
author: "JR"
date: "10/18/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
Description:
Here we will be downloading data from the ENCODE portal. Specifically, the
"chromatin" interaction data, then DNA binding data, cell line HEPG2, "TF-Chip-seq".
We furhter selected "TF Chip seq", Control chip seq and histone chip seq.
We further selected several different read lengths to cover as many DNA binding
proteins (DBPs) as possible.
Read lengths: 100, 76, 75, 36
ONLY SINGLE END READS (this eliminates 54 samples)
We end up with a total of 554 biological samples, almost all have a biological replicate
and some have technical replicates, which we won't use.
So now we need to make a sample sheet that connects the file id to the DBP and
then use that for batch download from ENCODE.
First navigate to a directory in your computer of your choosing -- for example:
```
cd /scratch/rinnclass
```
This will take you to your home directory
The exact ENCODE report can be found here:
https://www.encodeproject.org/report/?type=Experiment&status=released&assay_slims=DNA+binding&biosample_ontology.term_name=HepG2&assay_title=TF+ChIP-seq&biosample_ontology.classification=cell+line&files.read_length=100&files.read_length=76&files.read_length=75&files.read_length=36&assay_title=Control+ChIP-seq&assay_title=Histone+ChIP-seq&files.run_type=single-ended
On the top of the site there is a "download TSV" click that and we can get started!
You now have a .TSV (tab seperated file or /t)
Next we need to get the FASTQ link for each of the replicates.
Go to the same ENCODE website selection where the experiments are listed and
click dowload.
To see these files in your director
```
ls -lah
# we will go over this more in next lecture
```
Teh text file is a list of URLs for the FASTQs. Let's use unix to grab a file
using curl and wget bash commands. First let's take a look at their manual. One
of the best things in bash is the manual page for each command! We recommend
checking the manual for any function (google will have them too :).
```
man curl
man wget
```
These two commands will go to the internet and download a file(s) into your current
directory.
Let's use this to start playing with unix commands. First delete the URLS so there
is are only two URLs to download (it would take a very long time otherwise)
Navigate to a directory using your terminal. We can make a files.txt document using
the command nano.
Let's take a peak
```
man nano
```
Basically a really easy to use text editor in BASH/unix
So let's make our first file copy and paste the url file you downloaded (with only 2 urls)
You can do this by these simple commands.
NANO is very useful and we will use a lot! Vi and EMACS are alternatives.
In the terminal type
```
nano files.txt
```
This automatically opens a nano window, but also creates a new file at same time
Now let's PASTE in the URL
To escape the window hit cntrol X, then return
Now let's see what happened:
```
nano files.txt
```
Voila it's there! Just remmeber if you accidently hit a key there is no ctrol Z :)
So now we have a text file in the directory we are in named files.txt
Let's paste in some URLs from the ENCODE download list (just 2 for now :)
Now let's paste these two URLs where we can download raw data files.
```
https://www.encodeproject.org/files/ENCFF212GYT/@@download/ENCFF212GYT.fastq.gz
https://www.encodeproject.org/files/ENCFF434BJG/@@download/ENCFF434BJG.fastq.gz
```
Let's read the file with cat.
```
cat files.txt
```
Do you see the URLS?
Now we can use CURL or WGET to read this file and go grab the actual file from WWW
we need to know some more BASH commands first:
If we only had one URL in the file we could just do
```
man curl
curl URL.com
```
But since we have a file with multiple lines we want to envoke the shell to
read all of the URLS and then pass all those onto the CURL command.
Let's take a look at xargs.
```
man xargs
```
Xargs invokes the shell to redirect the output of a command as the argument of another command
or in otherwords it passes whatever the computer is thinking about into the next
argument or command.
```
xargs -L 1 curl -O -J -L < files.txt
```
This may look a bit scary at first but let's break it down.
We almost want to read this right to left. Ultimately files.txt is put into
curl and that is put into the memory of the computer to go to URL and grab file.
xargs will ultimately be invoked to skip empty lines. Notice pipe is not needed
xargs is how the computer will think and or manipulate standard input.
Here we see that the -L flag will skip to next non empty line. Since there is
a header in the file we are saying "make the first line empyt and go to next"
So the file is inputed to curl which can check for updated links etc and then
xargs cleans it all up to ensure curl loops through all non-empty lines.
Let's take a look at what curl is thinking about.
```
xargs -L 1 curl -O -J -L < files.txt
```
Lets look at curl
```
man curl
```
Here we see that
# -O (output)
by default unix keeps everything in "standard output" or "memory" then it wants you
to tell it when to print that out etc. Standard output is a good term to remember in general.
Anywhoo we are using the -O flag to have curl print the file it is comminting
to standard output. So in short this ensures it will print the file after retreiving
it's contents and you can change the name of this output file.
# -J (replace string)
this makes sure after one URL is commited to standard input (the resulting file
to standard output) that it erases the previous standard input. If this was not
flagged then the next line would be appended to the previous and we would get
one monster compiled file of all the URLS!
# -L (location)
if the URL has been changed to a new one it will be sent forward to the new
location
NOTE these files are large so make sure you are working
somewhere and erase them after -- we won't be using these long term.
```
xargs -L 1 curl -O -J -L < files.txt
```
# What do you get in your directory?
This can be done many different ways and we chose an example that examplifies
how the computer thinks (xargs) and the importance (in BASH) of standard input
and standard output and how this information can be passed along.
#This could also be done with wget
```
man wget
```
here we see there is a flag for a "list" or -i
Try:
```
wget -i files.txt
```
Same result? This seems so much better but if urls changed or were updated etc
we would want to move back to more of an arugment with xargs.
In short there are so many ways to do the same thing!
# What if you were downloading a thousand files for a day or so? Do you turn you
power savings to never turn off?
# The bash solution: SCREEN
```
man screen
```
control A is for attach
D is for detach
screen -r #session# is to reattach (to session number below)
screen -list (tells you all the screen session you have runnign)
Ok, we now know how to access the WWW and download anything that is available!
#But how do we know if we downloaded the right file? Kinda scary right, what if the
file was missing a few lines or had some random internet glitch that made a gap
in the data? Yikes!
# Bash solution: m5sum
The original generator of this file will often provide a md5sum with the file you
want to download from them. This is a digital key that represents the exact nature
of the original file. md5sum is a command that can scan a file and produce this key
and if the files are identical md5sum logic will produce identical keys. Phew!
**** Please note how important this simple aspect is! What if you got new data from a
sequencing platform -- you typically download through and FTP site. Do you know
if your download was 100% successful? Not with out md5sum checks. Always request
an md5sum for any sequencing data you download ***
```
man md5sum
```
*** In fact the pre-run 17 of 1099 downloads failed to have the exact same file!
# So let's see if we downloaded the right files?
First let's get the md5sum values for the two URLS. Use "accession" in url
to search encode portal and see what the md5sum is for the two files.
you may have another name for it on your computer such as md5 on macosx. but
on most servers it will be md5sum
# so lets run it.
```
md5sum *.gz
```
note you can use md5sum on compressed files too such as .gz
check to see if your md5sum matches that on encode website.
# Would you want to do 1,099 times?
Probably not so luckily we can check a list! let's make a list with nano
```
nano md5sums.txt
```
The syntax for this is md5sum# " two spaces " and file.
For example:
```
4b3e7dc77448bc4971367a3b40196cc0 ENCFF434BJG.fastq.gz
```
First we need the file accession number which is embedded in the URL above
Paste these into md5sums.txt
```
ENCFF212GYT.fastq.gz
ENCFF434BJG.fastq.gz
```
Now we need the md5sum values from ENCODE website by searching these accession in
portal.
```
4b3e7dc77448bc4971367a3b40196cc0
14fcf34bf1846ba82c4341838bc6e1b4
```
paste in the two md5sums from the ENCODE website --- with two spaces ----,
ctrl x and enter now we have a file with two md5sums.
# Note this approach is not very useful for 1,000+ fastq files being downloaded in
full class data set. In the design file lecture we will use BASH and ENCODE API to
attach md5sums to each file automatically. Then download them all and check the
files. So don't worry we will get there in a way that doesn't involve cutting
and pasting :) We are simply exploring the importance of these basic principles
of data science.
```
md5sum -c md5sums.txt > md5sums_status.txt
# we will talk more about the > and printing files in next lecture
cat md5sums_status.txt
```
Typically this works out alright and if you want to just see if the number of
matched md5sum checks is same as number of files you can add: " | wc -l ""
this will read the standard output from the md5sum -- using the pipe " |"
kinda similar to xargs the pipe passes information from the left argument to
the right, which is wc (word count and -l means lines) this will tell you how
many lines of matched md5sums were found -- check the number is not less than
files downloaded.
```
md5sum -c md5sums.txt | wc -l
```
We will go over the pipe and lots of other basic bash commands in the next lecutre.
For now:
Congratulations -- all of ENCODE is now available ! Next we will continue
practicing BASH/unix in the .TSV file we downloaded earlier.