Skip to content

Commit

Permalink
with vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
heike committed Feb 1, 2024
1 parent fa6731d commit 2dc11ea
Show file tree
Hide file tree
Showing 9 changed files with 69 additions and 7 deletions.
3 changes: 1 addition & 2 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
^README.*$
^visual_test$
^docs$
^vignettes$
^dont-build-vignettes$
^don't-build-vignettes$
^\.github$
^_pkgdown\.yml$
^pkgdown$
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@

*.Rproj
docs
inst/doc
4 changes: 3 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ Imports:
rlang
Suggests:
RColorBrewer,
knitr
knitr,
rmarkdown
Description: Visualize sequences in (modified) logo plots. The design choices
used by these logo plots allow sequencing data to be more easily analyzed.
Because it is integrated into the 'ggplot2' geom framework, these logo plots
Expand All @@ -39,3 +40,4 @@ Collate:
RoxygenNote: 7.2.3
Encoding: UTF-8
LazyData: true
VignetteBuilder: knitr
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ seq_info %>%
ggplot() +
geom_logo(aes(x = class, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position="classic") +
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom") +
facet_wrap(~position, ncol = 12)
Expand Down
4 changes: 2 additions & 2 deletions README.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,7 @@ seq_info %>%
ggplot() +
geom_logo(aes(x = class, y = info, group = element,
label = element, fill = interaction(Polarity, Water)),
alpha = 0.6, position="classic") +
alpha = 0.6) +
scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") +
theme(legend.position = "bottom") +
facet_wrap(~position, ncol = 12)
Expand Down
Binary file modified man/figures/unnamed-chunk-11-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions vignettes/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html
*.R
58 changes: 58 additions & 0 deletions vignettes/Design_considerations.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
title: "Considerations for re-designing the traditional logo sequence plot"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Design considerations}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

In a traditional logo sequence plot [@logo], sequences of nucleotides or amino acids are summarized by
creating a frequency break-down of letters used in each position across the sequences (using the assumption that the sequences are aligned, and have all the same length $L$).

The Shannon information $I$ [@shannon] is used as a measure of entropy (in bits) to describe the amount of mixture in each position.

Let all sequences have length $L$, with an alphabet $A$ of size $K$. The alphabet is the set of all symbols/characters in a sequence. For nucleotides, the alphabet $A$ has $K=4$ elements, and generally consists of the set of letters {A, C, G, T} (standing for **A**denine, **C**ystosine, **T**hymine) for DNA or the letters {A, C, G, U}, for RNA (**U**racil replaces thymine).

In the case of peptide sequences, each letter represents one of 21 amino acids (see e.g. [Wikipedia's codon chart](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables#/media/File:Aminoacids_table.svg)).

Let $f_a(p)$ be the (relative) frequency, with which letter $a \in A$ is observed in position $p$, $1 \le p \le L$ of a set of sequences of length $L$.

The **Shannon conservation index** $I$ in position $p$ is defined as
\[
I(p) := \log_2 (K) - \sum_{a \ \in \ A} f_{a}(p) \log_2\left(f_{a}(p)\right).
\]
By defining the expression $f \log_2 (f) := 0$ for $f = 0$, $I(p)$ is well defined for all frequencies $f \in [0,1]$. The measures reaches a minimum of 0 when $f_a = 1/K$ for all $a \in A$, while a maximum of $\log_2 (K)$ is reached when all frequencies $f_a$ are 0 except for one $a_0$ with $f_{a_0}=1$. (Note that the second term is a measure of information/entropy - depending on the choice of the base of the logarithm result in differently named entropy measures: base $e$ results in the natural entropy measured in 'nats', while base 10 is measured in 'dits').

For 21 amino acids, the maximal conservation is $-\log_2 (1/21) = 4.39$ bits, which is reached, if only a single amino acid is observed in the position, while perfect diversity/minimal conservation of 0 bits is reached, when all 21 amino acids are observed with the same frequency.

In the traditional logo sequence plot, a set of sequences is summarised, by scaling the heights of the letters corresponding to each amino acids (letter in the alphabet) by their contribution to a position's total conservation. The letters are then stacked by size, with the amino acid of the largest contribution on top.


## Shortcomings of the traditional logo plot




The **color choice** for representing the letters of amino acids is related to water solubility and polarity (e.g.\ hydrophobic, non-polar amino acids are shown in red) but this is not explicitly stated in a legend. Further, the use of letters to represent amino acids results in shapes of different visual dominance; the letter `I` is much less visually pronounced than for example `W`.

\hh{This has also the potential of leading to ambiguity in the representation: e.g.\ using the standard Helvetica representation, the letter F over a T is not (easily) distinguishable from a letter E over an I.}

**Non identified amino acids** are being ignored in the original plot -- it is of importance to keep track of at least the position and the frequency of these occurrences, as it might indicate a problem with the sequencing.

The plotting of **sequences of subfamilies** in separate logo plots does not facilitate a comparison of them. Researchers are in particular interested in differences in the conservation of amino acids between subfamilies.

The **number of sequences** in each of the subfamilies is not shown directly. This influences the inherent variability. It is therefore important to keep track of these numbers to be able to assess how and whether the size of each subfamily affects conservation.



```{r setup}
library(gglogo)
```

0 comments on commit 2dc11ea

Please sign in to comment.