Exercises/E01-intro/intro.Rmd

---
title: "E01 - Intro (Analysis of Gene Expression @ UCT Prague)"
author:
  - Jiri Novotny jiri.novotny@img.cas.cz
  - Michal Kolar kolarmi@img.cas.cz
  - Studuj bioinformatiku! http://studuj.bioinformatiku.cz
institute: "Laboratory of Genomics and Bioinformatics @ Institute of Molecular Genetics of the ASCR"
output:
  rmdformats::readthedown:
    highlight: "kate"
    lightbox: true
    thumbnails: true
    gallery: true
    toc_depth: 4
    self_contained: true
    number_sections: false
    toc_collapsed: false
    df_print: "paged"
date: "`r Sys.Date()`"
---

```{r, child = here::here("_assets/custom.Rmd"), eval = TRUE}
```

# Welcome to the Analysis of Gene Expression (AGE)!

Here you will find some introductory information, e.g. virtual machines, R programming tips etc.

You can find source code for these materials at <https://github.com/gorgitko/analysis_of_gene_expression>

## Contact

Michal Kolar: kolarmi@img.cas.cz

Jiri Novotny: jiri.novotny@img.cas.cz

***

# Virtual machines

We will work at Linux virtual machines (VM) provided by [Metacentrum](https://cloud2.metacentrum.cz/).
**The instructions how to set up your VM are in MS Teams
[class notebook](https://vscht.sharepoint.com/sites/143-O365-AGE/_layouts/15/Doc.aspx?sourcedoc={eda58aae-2580-4edd-b598-96db0c74fce2}&action=edit&wd=target%28_Content%20Library%2FOrganization.one%7C953d92bb-56b0-45a5-bd2d-409979e0959e%2FVirtual%20Machines%20%28VMs%5C%29%7Cb44057ea-406b-4556-bf9c-90dc38447daa%2F%29&wdorigin=703)**
(or in [this GitHub repository](https://github.com/bio-platform/bio-class-deb10) from Metacentrum Cloud Team).
Some you already have experience with VMs from *Genomics: Algorithms and Analysis* class.

If you are totally new to Linux shell, we recommend to look at the
[Shell Novice tutorial](https://swcarpentry.github.io/shell-novice/), which can be followed by
[Shell Extras](https://carpentries-incubator.github.io/shell-extras/).

## SSH config on your local machine

SSH config defined in `~/.ssh/config` <sup>1</sup> on your local machine provides you a simple management of SSH hosts.
In case of the VM for AGE, you can save it there as (replace `Hostname` and `User` by yours):

```
Host age
  Hostname 192.168.8.153
  User jirinovo
```

Then you can simply SSH to your VM using `ssh age`.
It will automatically use your private key in `~/.ssh`.

<sup>1</sup> `~` refers to your home directory and it is expanded to e.g. `/home/jirinovo`

## AGE files

We have two persistent storages mounted by `startNFS` command:

- `/data/shared` for shared data, and
- `/data/persistent/<user>` for your personal data.

To avoid data lose, I would recommend you to use the persistent storage.
For easier access from your home directory, create these symlinks:

```{bash}
ln -s /data/shared ~/shared
ln -s /data/persistent/$USER/ ~/persistent
```

Files for AGE are stored in `/data/shared/AGE_current`.
Because we all have the write permission in `/data/shared` and we don't want to overwrite work of others
`r emo::ji("slightly_smiling_face")`, please, create your personal directory in which you will
copy files on the beginning of each exercise:

```{bash}
mkdir -p ~/persistent/AGE/Exercises
```

And make a symlink in your home directory:

```{bash}
ln -s ~/persistent/AGE ~/AGE
```

Now, please, copy these files and directories to your personal directory:

```{bash}
cp -r ~/shared/AGE_current/Exercises/{_assets,E01-intro} ~/AGE/Exercises
```

And make a new RStudio project using the existing directory (`~/AGE/Exercises`):

![File > New Project...](`r here("E01-intro/_rmd_images/new_project_01.png")`)

![Choose "Existing Directory"](`r here("E01-intro/_rmd_images/new_project_02.png")`)

![Navigate to and select directory "~/AGE/Exercises"](`r here("E01-intro/_rmd_images/new_project_03.png")`)

## Accessing files on your VM

Basically you have three options:

1. RStudio file browser. Each file can be downloaded separately, multiple files are put into ZIP.
   Files can be uploaded one by one or you can upload ZIP, which is then automatically extracted.
2. Some GUI file manager supporting SFTP, such as [Krusader](https://krusader.org/) or [WinSCP](https://winscp.net/eng/index.php).
3. [sshfs](https://github.com/libfuse/sshfs), which can mount a directory from a remote server.
   That means you can use the remote directory just like your local one and you are only limited by your network bandwidth.
   On Ubuntu (and derived) distros, you can install `sshfs` with `sudo apt install sshfs` (i.e. do that on your local machine).
   It is recommended to use `sshfs` as a regular user, not root. For that, create a directory on your local machine, e.g.
   `mkdir ~/age_vm`. Then you can mount a remote directory: `sshfs user@hostname:directory ~/age_vm`.
   Or if you setup `~/.ssh/config` as described above, you can just write `sshfs age:/home/jirinovo ~/age_vm`

***

## [tmux](https://github.com/tmux/tmux/wiki) - terminal multiplexer

Normally, your SSH session will be lost if you turn off your computer or lost network connection.
[tmux](https://github.com/tmux/tmux/wiki) is a very helpful utility to manage multiple persistent shell sessions.
It is somewhat similar to quite old `screen`, but `tmux` is a better choice in my opinion.

`tmux` has a following hierarchy:

!["In sessions, you create windows which can be split to panes.](`r here("E01-intro/_rmd_images/tmux.png")`)

**The important thing is that `tmux` sessions are persistent until you restart the server.**
And so you can return back to your work after network outage, or run a long-term task and
quit your computer.

There are three ways how to issue commands inside a `tmux` session:

- **Shortcuts**: triggered by a prefix key followed by a command key.
  By default, `tmux` uses `Ctrl + b` as the `Prefix` key.
  For example, to create a new window you press `Ctrl + b` followed by `c` (in short: `Prefix + c`).
- **Command mode**: Enter command mode by pressing `Prefix` then `:`.
  This will open a command prompt at the bottom of the screen, which will accept `tmux` commands.
  For example, `kill-window` will kill the current window.
- **Command line**: Commands can also be entered directly to the command line within a `tmux` session.
  Usually these commands are prefaced by `tmux`.
  For example, `tmux kill-window` will do the same as in the command mode above.

Some basic commands:

- Start a new session: `tmux new -s my_session`
- Attach to an existing session: `tmux attach -t my_session`
- Create a new window in the current session: `tmux new-window` or `Prefix + c`
- Go to next window: `Prefix + n`
- Go to first/second/... window: `Prefix + 0`, `Prefix + 1`, ... (windows are numbered from zero)
- Detach from the current session: `tmux detach` or `Prefix + d`
- List windows in the current session: `Prefix + w`
- List sessions: `Prefix + s`
- Enter the copy mode (listing the window output): `Prefix + [`, exit with `q` or `Escape`
  - In the copy mode you can scroll by keyboard arrows or page up/down.
  - You can also search in output down or up by `Ctrl + s` and `Ctrl + r`, respectively.
    When you type your search string and press `Enter`, you can simply go to next match by `n`.

> `Ctrl` and `Alt` keys are usually written as `C-` and `M-`, respectively.

You can read quick guides [here](https://www.hamvocke.com/blog/a-quick-and-easy-guide-to-tmux/) and
[here](https://www.linode.com/docs/networking/ssh/persistent-terminal-sessions-with-tmux/) (more useful shortcuts there).
Also, there is a nice [cheatsheet](https://tmuxcheatsheet.com/).

***

## [conda](https://docs.conda.io/en/latest/): virtual environment and package manager

You may have heard about [conda](https://docs.conda.io/en/latest/), which is already installed on your VM.
In case you need to install some additional software, `conda` is one of the easiest ways to do so.
Also on your personal computers (especially running Windows) it can install software, which is normally hard to get working.
In case of bioinformatics, `conda` has become the preferred way for distribution of various tools.

- [Tutorial](https://conda.io/docs/user-guide/getting-started.html)
- [Cheatsheet](https://conda.io/projects/conda/en/latest/_downloads/843d9e0198f2a193a3484886fa28163c/conda-cheatsheet.pdf)
- [Nature Methods paper](https://www.nature.com/articles/s41592-018-0046-7)

### Virtual environments (venvs)

`conda` is able to create so-called "virtual environments" (venvs).
Those are environments isolated from system in which you can install software packages.
You can also export packages installed in a venv so everyone is able to recreate the venv with the same package versions.

Venvs are just directories with structure partially following the [UNIX file system hierarchy](https://www.geeksforgeeks.org/unix-file-system/)
(e.g. `/bin`, `/etc`, ...). The root venv is called `base` and is located in `conda` installation directory
(by default `~/miniconda3`, on your VM `/opt/bio-class/miniconda`).

Basically, when you activate or deactivate a venv, `conda` is taking care of changes in `PATH` environment variable,
which is used to search for executables. How is `conda` doing that? Simply by prepending `/path/to/venv/directory` to `PATH`,
so software in a venv is found earlier than, for example, your system wide software.

### conda packages

Conda packages contain precompiled software for specific OS, together with other data files like text, images etc.
Each package also knows its own dependencies, which are automatically installed.
Packages are available in channels ("directories"), which are maintained by users,
e.g. there are channels [bioconda](https://anaconda.org/bioconda), [intel](https://anaconda.org/intel), [R](https://anaconda.org/r) etc.
Packages are hosted in the [Anaconda cloud](https://anaconda.org/). Hosting is free for public packages, paid for private ones.
Because `conda` is normally installed locally in your home directory, you can use it on machines where you would normally need administrator access to install software.
That is very useful for computing on Metacentrum (and other computational grids)!

Fortunately at your VM, you have root permissions and most of the required software is already installed.
But still, `conda` makes installation of software, which requires compilation, super easy.
And also, by using venvs, you can install multiple versions of software to your VM.

### conda vs. Miniconda vs. Anaconda

- `conda`: the package manager itself.
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html): minimal distribution of `conda`.
- [Anaconda Distribution](https://www.anaconda.com/distribution/): scientific distribution of `conda` with included packages (all of them can be installed with `conda`).

### conda basics

There is no need to install `conda` on your VM, however, if you would need it somewhere else, I will recommend you to go with
[Miniconda](https://docs.conda.io/en/latest/miniconda.html).

To switch to `base` venv, use `startConda` command.
Now you can see `(base)` on the left side of your bash prompt.

**Create a new venv**

This will create an empty venv: `conda create -n my_new_venv`

You can also specify packages which will be immediately installed to a new venv, e.g. `conda create -n my_new_venv bwa bowtie star salmon`

**Switch to the new venv**

`conda activate my_new_venv`

**Deactivate current venv**

`conda deactivate my_new_venv`

**Install packages in venv**

`conda install fastqc salmon`

You can also specify the package version: `conda install fastqc=0.10.1`

**Update packages in venv**

`conda update salmon`

Or all packages at once: `conda update --all`

**Remove packages in venv**

`conda uninstall fastqc`

**List packages installed in the current venv**

`conda list`

**List venvs**

`conda info --envs`

**Update conda** (you must be in `base` venv)

`conda update conda`

**Channels**

After fresh installation, `conda` will search for packages in `defaults` channel, where are only basic packages.
For example, when you try to install `fastqc` on a fresh `conda` installation...

```
$ conda install fastqc
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:

  - fastqc

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.
```

...it fails, because `fastqc` was not found in `defaults` channel.
However, `conda` on your VM has already several **very useful** channels added.
You can view current channels with:

```
$ conda config --show channels
channels:
  - conda-forge
  - bioconda
  - defaults
```

- [conda-forge](https://conda-forge.org/) contains various packages, which are automatically built from GitHub repositories.
  There are many popular tools and libraries, such as `graphicsmagick` or `libgfortran`.
- [bioconda](https://bioconda.github.io/) is channel for bioinformatics.
  All the popular bioinfotools can be found there, such as previously shown `fastqc`.

Current channels are stored in `~/.condarc`.
Packages can have the same names, but can lie in different channels.
In such a case, package is installed from a more priority channel (top > down).

If you want to install package from a non-current channel,
you have to specify it: `conda install -c intel numpy`

***

## Fish shell

For a more comfortable experience in Linux shell, I can recommend [Fish shell](https://fishshell.com/),
the *f*riendly *i*nteractive *sh*ell, which is already installed on your VMs.

> Beware that Fish's syntax is not compatible with the standard sh or bash!
  For example, command substitution is written as `(command)` instead of `$(command)`.

***

## micro editor

If you need to edit some file in Linux shell, I can recommend [micro](https://micro-editor.github.io/) editor,
which is also installed on your VMs. It's a lighweight alternative to the classic
[GNU nano](https://www.nano-editor.org/), with keybindings and features very similar
to editors with graphical user interface.

***

# Programming in R

We expect you to have a basic knowledge of [R](https://www.r-project.org/).
Here are some links to refresh or gain your skills.

## Basics

- [swirl](https://swirlstats.com/students.html) - interactive tutorial directly in R session. **Highly recommended**.
- [R Nuts and Bolts](https://bookdown.org/rdpeng/rprogdatascience/r-nuts-and-bolts.html)
- [Base R cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/base-r.pdf)

## More complete tutorials/books

- [Big Book of R](https://www.bigbookofr.com/index.html) - a database of R books/tutorials/materials.
- [Data Analysis for the Life Sciences](https://leanpub.com/dataanalysisforthelifesciences)
- [Modern Statistics for Modern Biology](https://www.huber.embl.de/msmb/index.html)
- [R for Data Science](https://r4ds.had.co.nz/)
- [R tutorial for UCT subject "Statistická analýza dat" (in Czech)](https://github.com/lich-uct/r_tutorial)
- [Bioconductor for Genomic Data Science](https://kasperdanielhansen.github.io/genbioconductor/)
- [Coding Club](https://ourcodingclub.github.io/tutorials.html)
- [STAT 545](https://stat545.com/) - mainly about [tidyverse](https://www.tidyverse.org/).
- [R tutorial from University of Georgia](https://www.cyclismo.org/tutorial/R/index.html)
- [Programming with R](http://swcarpentry.github.io/r-novice-inflammation/)
- [R for Reproducible Scientific Analysis](http://swcarpentry.github.io/r-novice-gapminder/)
- [RMarkdown for Scientists](https://rmd4sci.njtierney.com/)

## Advanced R

- [Advanced R](https://adv-r.hadley.nz/) and [solutions](http://advanced-r-solutions.rbind.io/index.html) for its exercises.
- [What They Forgot to Teach You About R](https://rstats.wtf/)
- [R Inferno](https://www.burns-stat.com/pages/Tutor/R_inferno.pdf) - R pitfalls `r emo::ji("exploding_head")`

## Other useful links

- [Quick R reference with examples](https://onepager.togaware.com/survivor.html)
- [**Awesome list of cheatsheets**](https://rstudio.com/resources/cheatsheets/).
  Some of them are also accessible in RStudio: `Help -> Cheatsheets`
- [RStudio cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/rstudio-ide.pdf)
- [rdrr.io](https://rdrr.io/) - a great database of package references from multiple sources (CRAN, Bioconductor, GitHub, etc.).
- [Heatmaps in R](https://www.biostars.org/p/205417/)
  - Found here some interesting heatmap packages I hadn't know before:
    [ComplexHeatmap](https://jokergoo.github.io/ComplexHeatmap-reference/book/),
    [Superheat](https://rlbarter.github.io/superheat/index.html),
    [Heatplus](https://bioconductor.org/packages/release/bioc/html/Heatplus.html),
    [iheatmapr](https://docs.ropensci.org/iheatmapr/articles/full_vignettes/iheatmapr.html)

***

## Keyboard shortcuts in RStudio

You can view them in `Tools -> Keyboard Shortcuts Help`.

Personally I am using the most:

- `F1`: view help page of function under cursor.
- `F2`: go to definition of function under cursor.
- `Ctrl + Space`: autocompletion.
- `Alt + Enter`: run current line or selected text in console.
- `Alt + Up/Down`: move current line up/down.
- `Ctrl + D`: delete current line.

## What to do when RStudio is stuck

Sometimes something weird happens and RStudio is not responding. In most cases, R session is stuck and can't be simply restarted from RStudio.
First SSH to your VM and then you can try the following:

- `sudo service rstudio-server restart`. That will restart your RStudio Server. It should also kill the R session running under RStudio.
- `pkill -f rsession`. That will kill all running R sessions.

> **Be careful as unsaved files in RStudio will be lost!**
  If RStudio still allows you to change files in tabs, Ctrl + A / Ctrl + V unsaved ones to your local text editor and after restart paste them back.

***
***

# HTML rendering

This chunk is not evaluated (`eval = FALSE`). Otherwise you will probably end up in recursive hell `r emo::ji("exploding_head")`

```{r, eval = FALSE, message = FALSE, warning = FALSE}
library(conflicted)
library(knitr)
library(here)
library(emo)

if (!require(rmdformats)) {
  BiocManager::install("rmdformats")
}

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = FALSE)
rmarkdown::render(
  here("E01-intro/intro.Rmd"),
  output_file = here("E01-intro/intro.html"),
  envir = new.env(),
  knit_root_dir = here()
)
```