-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintro.Rmd
442 lines (312 loc) · 18.6 KB
/
intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
---
title: "E01 - Intro (Analysis of Gene Expression @ UCT Prague)"
author:
- Jiri Novotny jiri.novotny@img.cas.cz
- Michal Kolar kolarmi@img.cas.cz
- Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
rmdformats::readthedown:
highlight: "kate"
lightbox: true
thumbnails: true
gallery: true
toc_depth: 4
self_contained: true
number_sections: false
toc_collapsed: false
df_print: "paged"
date: "`r Sys.Date()`"
---
```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```
# Welcome to the Analysis of Gene Expression (AGE)!
Here you will find some introductory information, e.g. virtual machines, R programming tips etc.
You can find source code for these materials at <https://github.com/gorgitko/analysis_of_gene_expression>
## Contact
Michal Kolar: kolarmi@img.cas.cz
Jiri Novotny: jiri.novotny@img.cas.cz
***
# Virtual machines
We will work at Linux virtual machines (VM) provided by [Metacentrum](https://cloud2.metacentrum.cz/).
**The instructions how to set up your VM are in MS Teams
[class notebook](https://vscht.sharepoint.com/sites/143-O365-AGE/_layouts/15/Doc.aspx?sourcedoc={eda58aae-2580-4edd-b598-96db0c74fce2}&action=edit&wd=target%28_Content%20Library%2FOrganization.one%7C953d92bb-56b0-45a5-bd2d-409979e0959e%2FVirtual%20Machines%20%28VMs%5C%29%7Cb44057ea-406b-4556-bf9c-90dc38447daa%2F%29&wdorigin=703)**
(or in [this GitHub repository](https://github.com/bio-platform/bio-class-deb10) from Metacentrum Cloud Team).
Some you already have experience with VMs from *Genomics: Algorithms and Analysis* class.
If you are totally new to Linux shell, we recommend to look at the
[Shell Novice tutorial](https://swcarpentry.github.io/shell-novice/), which can be followed by
[Shell Extras](https://carpentries-incubator.github.io/shell-extras/).
## SSH config on your local machine
SSH config defined in `~/.ssh/config` <sup>1</sup> on your local machine provides you a simple management of SSH hosts.
In case of the VM for AGE, you can save it there as (replace `Hostname` and `User` by yours):
```
Host age
Hostname 192.168.8.153
User jirinovo
```
Then you can simply SSH to your VM using `ssh age`.
It will automatically use your private key in `~/.ssh`.
<sup>1</sup> `~` refers to your home directory and it is expanded to e.g. `/home/jirinovo`
## AGE files
We have two persistent storages mounted by `startNFS` command:
- `/data/shared` for shared data, and
- `/data/persistent/<user>` for your personal data.
To avoid data lose, I would recommend you to use the persistent storage.
For easier access from your home directory, create these symlinks:
```{bash}
ln -s /data/shared ~/shared
ln -s /data/persistent/$USER/ ~/persistent
```
Files for AGE are stored in `/data/shared/AGE_current`.
Because we all have the write permission in `/data/shared` and we don't want to overwrite work of others
`r emo::ji("slightly_smiling_face")`, please, create your personal directory in which you will
copy files on the beginning of each exercise:
```{bash}
mkdir -p ~/persistent/AGE/Exercises
```
And make a symlink in your home directory:
```{bash}
ln -s ~/persistent/AGE ~/AGE
```
Now, please, copy these files and directories to your personal directory:
```{bash}
cp -r ~/shared/AGE_current/Exercises/{_assets,E01-intro} ~/AGE/Exercises
```
And make a new RStudio project using the existing directory (`~/AGE/Exercises`):
![File > New Project...](`r here("E01-intro/_rmd_images/new_project_01.png")`)
![Choose "Existing Directory"](`r here("E01-intro/_rmd_images/new_project_02.png")`)
![Navigate to and select directory "~/AGE/Exercises"](`r here("E01-intro/_rmd_images/new_project_03.png")`)
## Accessing files on your VM
Basically you have three options:
1. RStudio file browser. Each file can be downloaded separately, multiple files are put into ZIP.
Files can be uploaded one by one or you can upload ZIP, which is then automatically extracted.
2. Some GUI file manager supporting SFTP, such as [Krusader](https://krusader.org/) or [WinSCP](https://winscp.net/eng/index.php).
3. [sshfs](https://github.com/libfuse/sshfs), which can mount a directory from a remote server.
That means you can use the remote directory just like your local one and you are only limited by your network bandwidth.
On Ubuntu (and derived) distros, you can install `sshfs` with `sudo apt install sshfs` (i.e. do that on your local machine).
It is recommended to use `sshfs` as a regular user, not root. For that, create a directory on your local machine, e.g.
`mkdir ~/age_vm`. Then you can mount a remote directory: `sshfs user@hostname:directory ~/age_vm`.
Or if you setup `~/.ssh/config` as described above, you can just write `sshfs age:/home/jirinovo ~/age_vm`
***
## [tmux](https://github.com/tmux/tmux/wiki) - terminal multiplexer
Normally, your SSH session will be lost if you turn off your computer or lost network connection.
[tmux](https://github.com/tmux/tmux/wiki) is a very helpful utility to manage multiple persistent shell sessions.
It is somewhat similar to quite old `screen`, but `tmux` is a better choice in my opinion.
`tmux` has a following hierarchy:
!["In sessions, you create windows which can be split to panes.](`r here("E01-intro/_rmd_images/tmux.png")`)
**The important thing is that `tmux` sessions are persistent until you restart the server.**
And so you can return back to your work after network outage, or run a long-term task and
quit your computer.
There are three ways how to issue commands inside a `tmux` session:
- **Shortcuts**: triggered by a prefix key followed by a command key.
By default, `tmux` uses `Ctrl + b` as the `Prefix` key.
For example, to create a new window you press `Ctrl + b` followed by `c` (in short: `Prefix + c`).
- **Command mode**: Enter command mode by pressing `Prefix` then `:`.
This will open a command prompt at the bottom of the screen, which will accept `tmux` commands.
For example, `kill-window` will kill the current window.
- **Command line**: Commands can also be entered directly to the command line within a `tmux` session.
Usually these commands are prefaced by `tmux`.
For example, `tmux kill-window` will do the same as in the command mode above.
Some basic commands:
- Start a new session: `tmux new -s my_session`
- Attach to an existing session: `tmux attach -t my_session`
- Create a new window in the current session: `tmux new-window` or `Prefix + c`
- Go to next window: `Prefix + n`
- Go to first/second/... window: `Prefix + 0`, `Prefix + 1`, ... (windows are numbered from zero)
- Detach from the current session: `tmux detach` or `Prefix + d`
- List windows in the current session: `Prefix + w`
- List sessions: `Prefix + s`
- Enter the copy mode (listing the window output): `Prefix + [`, exit with `q` or `Escape`
- In the copy mode you can scroll by keyboard arrows or page up/down.
- You can also search in output down or up by `Ctrl + s` and `Ctrl + r`, respectively.
When you type your search string and press `Enter`, you can simply go to next match by `n`.
> `Ctrl` and `Alt` keys are usually written as `C-` and `M-`, respectively.
You can read quick guides [here](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/) and
[here](https://www.linode.com/docs/networking/ssh/persistent-terminal-sessions-with-tmux/) (more useful shortcuts there).
Also, there is a nice [cheatsheet](https://tmuxcheatsheet.com/).
***
## [conda](https://docs.conda.io/en/latest/): virtual environment and package manager
You may have heard about [conda](https://docs.conda.io/en/latest/), which is already installed on your VM.
In case you need to install some additional software, `conda` is one of the easiest ways to do so.
Also on your personal computers (especially running Windows) it can install software, which is normally hard to get working.
In case of bioinformatics, `conda` has become the preferred way for distribution of various tools.
- [Tutorial](https://conda.io/docs/user-guide/getting-started.html)
- [Cheatsheet](https://conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf)
- [Nature Methods paper](https://www.nature.com/articles/s41592-018-0046-7)
### Virtual environments (venvs)
`conda` is able to create so-called "virtual environments" (venvs).
Those are environments isolated from system in which you can install software packages.
You can also export packages installed in a venv so everyone is able to recreate the venv with the same package versions.
Venvs are just directories with structure partially following the [UNIX file system hierarchy](https://www.geeksforgeeks.org/unix-file-system/)
(e.g. `/bin`, `/etc`, ...). The root venv is called `base` and is located in `conda` installation directory
(by default `~/miniconda3`, on your VM `/opt/bio-class/miniconda`).
Basically, when you activate or deactivate a venv, `conda` is taking care of changes in `PATH` environment variable,
which is used to search for executables. How is `conda` doing that? Simply by prepending `/path/to/venv/directory` to `PATH`,
so software in a venv is found earlier than, for example, your system wide software.
### conda packages
Conda packages contain precompiled software for specific OS, together with other data files like text, images etc.
Each package also knows its own dependencies, which are automatically installed.
Packages are available in channels ("directories"), which are maintained by users,
e.g. there are channels [bioconda](https://anaconda.org/bioconda), [intel](https://anaconda.org/intel), [R](https://anaconda.org/r) etc.
Packages are hosted in the [Anaconda cloud](https://anaconda.org/). Hosting is free for public packages, paid for private ones.
Because `conda` is normally installed locally in your home directory, you can use it on machines where you would normally need administrator access to install software.
That is very useful for computing on Metacentrum (and other computational grids)!
Fortunately at your VM, you have root permissions and most of the required software is already installed.
But still, `conda` makes installation of software, which requires compilation, super easy.
And also, by using venvs, you can install multiple versions of software to your VM.
### conda vs. Miniconda vs. Anaconda
- `conda`: the package manager itself.
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html): minimal distribution of `conda`.
- [Anaconda Distribution](https://www.anaconda.com/distribution/): scientific distribution of `conda` with included packages (all of them can be installed with `conda`).
### conda basics
There is no need to install `conda` on your VM, however, if you would need it somewhere else, I will recommend you to go with
[Miniconda](https://docs.conda.io/en/latest/miniconda.html).
To switch to `base` venv, use `startConda` command.
Now you can see `(base)` on the left side of your bash prompt.
**Create a new venv**
This will create an empty venv: `conda create -n my_new_venv`
You can also specify packages which will be immediately installed to a new venv, e.g. `conda create -n my_new_venv bwa bowtie star salmon`
**Switch to the new venv**
`conda activate my_new_venv`
**Deactivate current venv**
`conda deactivate my_new_venv`
**Install packages in venv**
`conda install fastqc salmon`
You can also specify the package version: `conda install fastqc=0.10.1`
**Update packages in venv**
`conda update salmon`
Or all packages at once: `conda update --all`
**Remove packages in venv**
`conda uninstall fastqc`
**List packages installed in the current venv**
`conda list`
**List venvs**
`conda info --envs`
**Update conda** (you must be in `base` venv)
`conda update conda`
**Channels**
After fresh installation, `conda` will search for packages in `defaults` channel, where are only basic packages.
For example, when you try to install `fastqc` on a fresh `conda` installation...
```
$ conda install fastqc
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- fastqc
Current channels:
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
To search for alternate channels that may provide the conda package you're looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
```
...it fails, because `fastqc` was not found in `defaults` channel.
However, `conda` on your VM has already several **very useful** channels added.
You can view current channels with:
```
$ conda config --show channels
channels:
- conda-forge
- bioconda
- defaults
```
- [conda-forge](https://conda-forge.org/) contains various packages, which are automatically built from GitHub repositories.
There are many popular tools and libraries, such as `graphicsmagick` or `libgfortran`.
- [bioconda](https://bioconda.github.io/) is channel for bioinformatics.
All the popular bioinfotools can be found there, such as previously shown `fastqc`.
Current channels are stored in `~/.condarc`.
Packages can have the same names, but can lie in different channels.
In such a case, package is installed from a more priority channel (top > down).
If you want to install package from a non-current channel,
you have to specify it: `conda install -c intel numpy`
***
## Fish shell
For a more comfortable experience in Linux shell, I can recommend [Fish shell](https://fishshell.com/),
the *f*riendly *i*nteractive *sh*ell, which is already installed on your VMs.
> Beware that Fish's syntax is not compatible with the standard sh or bash!
For example, command substitution is written as `(command)` instead of `$(command)`.
***
## micro editor
If you need to edit some file in Linux shell, I can recommend [micro](https://micro-editor.github.io/) editor,
which is also installed on your VMs. It's a lighweight alternative to the classic
[GNU nano](https://www.nano-editor.org/), with keybindings and features very similar
to editors with graphical user interface.
***
# Programming in R
We expect you to have a basic knowledge of [R](https://www.r-project.org/).
Here are some links to refresh or gain your skills.
## Basics
- [swirl](https://swirlstats.com/students.html) - interactive tutorial directly in R session. **Highly recommended**.
- [R Nuts and Bolts](https://bookdown.org/rdpeng/rprogdatascience/r-nuts-and-bolts.html)
- [Base R cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/base-r.pdf)
## More complete tutorials/books
- [Big Book of R](https://www.bigbookofr.com/index.html) - a database of R books/tutorials/materials.
- [Data Analysis for the Life Sciences](https://leanpub.com/dataanalysisforthelifesciences)
- [Modern Statistics for Modern Biology](https://www.huber.embl.de/msmb/index.html)
- [R for Data Science](https://r4ds.had.co.nz/)
- [R tutorial for UCT subject "Statistická analýza dat" (in Czech)](https://github.com/lich-uct/r_tutorial)
- [Bioconductor for Genomic Data Science](https://kasperdanielhansen.github.io/genbioconductor/)
- [Coding Club](https://ourcodingclub.github.io/tutorials.html)
- [STAT 545](https://stat545.com/) - mainly about [tidyverse](https://www.tidyverse.org/).
- [R tutorial from University of Georgia](https://www.cyclismo.org/tutorial/R/index.html)
- [Programming with R](http://swcarpentry.github.io/r-novice-inflammation/)
- [R for Reproducible Scientific Analysis](http://swcarpentry.github.io/r-novice-gapminder/)
- [RMarkdown for Scientists](https://rmd4sci.njtierney.com/)
## Advanced R
- [Advanced R](https://adv-r.hadley.nz/) and [solutions](http://advanced-r-solutions.rbind.io/index.html) for its exercises.
- [What They Forgot to Teach You About R](https://rstats.wtf/)
- [R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) - R pitfalls `r emo::ji("exploding_head")`
## Other useful links
- [Quick R reference with examples](https://onepager.togaware.com/survivor.html)
- [**Awesome list of cheatsheets**](https://rstudio.com/resources/cheatsheets/).
Some of them are also accessible in RStudio: `Help -> Cheatsheets`
- [RStudio cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/rstudio-ide.pdf)
- [rdrr.io](https://rdrr.io/) - a great database of package references from multiple sources (CRAN, Bioconductor, GitHub, etc.).
- [Heatmaps in R](https://www.biostars.org/p/205417/)
- Found here some interesting heatmap packages I hadn't know before:
[ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/),
[Superheat](https://rlbarter.github.io/superheat/index.html),
[Heatplus](https://bioconductor.org/packages/release/bioc/html/Heatplus.html),
[iheatmapr](https://docs.ropensci.org/iheatmapr/articles/full_vignettes/iheatmapr.html)
***
## Keyboard shortcuts in RStudio
You can view them in `Tools -> Keyboard Shortcuts Help`.
Personally I am using the most:
- `F1`: view help page of function under cursor.
- `F2`: go to definition of function under cursor.
- `Ctrl + Space`: autocompletion.
- `Alt + Enter`: run current line or selected text in console.
- `Alt + Up/Down`: move current line up/down.
- `Ctrl + D`: delete current line.
## What to do when RStudio is stuck
Sometimes something weird happens and RStudio is not responding. In most cases, R session is stuck and can't be simply restarted from RStudio.
First SSH to your VM and then you can try the following:
- `sudo service rstudio-server restart`. That will restart your RStudio Server. It should also kill the R session running under RStudio.
- `pkill -f rsession`. That will kill all running R sessions.
> **Be careful as unsaved files in RStudio will be lost!**
If RStudio still allows you to change files in tabs, Ctrl + A / Ctrl + V unsaved ones to your local text editor and after restart paste them back.
***
***
# HTML rendering
This chunk is not evaluated (`eval = FALSE`). Otherwise you will probably end up in recursive hell `r emo::ji("exploding_head")`
```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(conflicted)
library(knitr)
library(here)
library(emo)
if (!require(rmdformats)) {
BiocManager::install("rmdformats")
}
# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = FALSE)
rmarkdown::render(
here("E01-intro/intro.Rmd"),
output_file = here("E01-intro/intro.html"),
envir = new.env(),
knit_root_dir = here()
)
```