diff --git a/.Rbuildignore b/.Rbuildignore index 715a82a..34fd92f 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -3,8 +3,7 @@ ^README.*$ ^visual_test$ ^docs$ -^vignettes$ -^dont-build-vignettes$ +^don't-build-vignettes$ ^\.github$ ^_pkgdown\.yml$ ^pkgdown$ diff --git a/.gitignore b/.gitignore index 8f42c41..d6f5d58 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ *.Rproj docs +inst/doc diff --git a/DESCRIPTION b/DESCRIPTION index 47ef33a..45bbe9f 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -24,7 +24,8 @@ Imports: rlang Suggests: RColorBrewer, - knitr + knitr, + rmarkdown Description: Visualize sequences in (modified) logo plots. The design choices used by these logo plots allow sequencing data to be more easily analyzed. Because it is integrated into the 'ggplot2' geom framework, these logo plots @@ -39,3 +40,4 @@ Collate: RoxygenNote: 7.2.3 Encoding: UTF-8 LazyData: true +VignetteBuilder: knitr diff --git a/README.Rmd b/README.Rmd index 366c8e7..95a8ddd 100644 --- a/README.Rmd +++ b/README.Rmd @@ -132,7 +132,7 @@ seq_info %>% ggplot() + geom_logo(aes(x = class, y = info, group = element, label = element, fill = interaction(Polarity, Water)), - alpha = 0.6, position="classic") + + alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom") + facet_wrap(~position, ncol = 12) diff --git a/README.html b/README.html index d841719..d5d2109 100644 --- a/README.html +++ b/README.html @@ -494,11 +494,11 @@

Implementation details

ggplot() + geom_logo(aes(x = class, y = info, group = element, label = element, fill = interaction(Polarity, Water)), - alpha = 0.6, position="classic") + + alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom") + facet_wrap(~position, ncol = 12) -

+

Available alphabets

diff --git a/README.md b/README.md index ec5b87b..992d314 100644 --- a/README.md +++ b/README.md @@ -176,7 +176,7 @@ seq_info %>% ggplot() + geom_logo(aes(x = class, y = info, group = element, label = element, fill = interaction(Polarity, Water)), - alpha = 0.6, position="classic") + + alpha = 0.6) + scale_fill_brewer("Amino Acid\nproperties", palette = "Paired") + theme(legend.position = "bottom") + facet_wrap(~position, ncol = 12) diff --git a/man/figures/unnamed-chunk-11-1.png b/man/figures/unnamed-chunk-11-1.png index 8f39897..7deda4f 100644 Binary files a/man/figures/unnamed-chunk-11-1.png and b/man/figures/unnamed-chunk-11-1.png differ diff --git a/vignettes/.gitignore b/vignettes/.gitignore new file mode 100644 index 0000000..097b241 --- /dev/null +++ b/vignettes/.gitignore @@ -0,0 +1,2 @@ +*.html +*.R diff --git a/vignettes/Design_considerations.Rmd b/vignettes/Design_considerations.Rmd new file mode 100644 index 0000000..e5b2b6e --- /dev/null +++ b/vignettes/Design_considerations.Rmd @@ -0,0 +1,58 @@ +--- +title: "Considerations for re-designing the traditional logo sequence plot" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Design considerations} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +In a traditional logo sequence plot [@logo], sequences of nucleotides or amino acids are summarized by +creating a frequency break-down of letters used in each position across the sequences (using the assumption that the sequences are aligned, and have all the same length $L$). + +The Shannon information $I$ [@shannon] is used as a measure of entropy (in bits) to describe the amount of mixture in each position. + +Let all sequences have length $L$, with an alphabet $A$ of size $K$. The alphabet is the set of all symbols/characters in a sequence. For nucleotides, the alphabet $A$ has $K=4$ elements, and generally consists of the set of letters {A, C, G, T} (standing for **A**denine, **C**ystosine, **T**hymine) for DNA or the letters {A, C, G, U}, for RNA (**U**racil replaces thymine). + +In the case of peptide sequences, each letter represents one of 21 amino acids (see e.g. [Wikipedia's codon chart](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables#/media/File:Aminoacids_table.svg)). + +Let $f_a(p)$ be the (relative) frequency, with which letter $a \in A$ is observed in position $p$, $1 \le p \le L$ of a set of sequences of length $L$. + +The **Shannon conservation index** $I$ in position $p$ is defined as +\[ +I(p) := \log_2 (K) - \sum_{a \ \in \ A} f_{a}(p) \log_2\left(f_{a}(p)\right). +\] +By defining the expression $f \log_2 (f) := 0$ for $f = 0$, $I(p)$ is well defined for all frequencies $f \in [0,1]$. The measures reaches a minimum of 0 when $f_a = 1/K$ for all $a \in A$, while a maximum of $\log_2 (K)$ is reached when all frequencies $f_a$ are 0 except for one $a_0$ with $f_{a_0}=1$. (Note that the second term is a measure of information/entropy - depending on the choice of the base of the logarithm result in differently named entropy measures: base $e$ results in the natural entropy measured in 'nats', while base 10 is measured in 'dits'). + +For 21 amino acids, the maximal conservation is $-\log_2 (1/21) = 4.39$ bits, which is reached, if only a single amino acid is observed in the position, while perfect diversity/minimal conservation of 0 bits is reached, when all 21 amino acids are observed with the same frequency. + +In the traditional logo sequence plot, a set of sequences is summarised, by scaling the heights of the letters corresponding to each amino acids (letter in the alphabet) by their contribution to a position's total conservation. The letters are then stacked by size, with the amino acid of the largest contribution on top. + + +## Shortcomings of the traditional logo plot + + + + +The **color choice** for representing the letters of amino acids is related to water solubility and polarity (e.g.\ hydrophobic, non-polar amino acids are shown in red) but this is not explicitly stated in a legend. Further, the use of letters to represent amino acids results in shapes of different visual dominance; the letter `I` is much less visually pronounced than for example `W`. + +\hh{This has also the potential of leading to ambiguity in the representation: e.g.\ using the standard Helvetica representation, the letter F over a T is not (easily) distinguishable from a letter E over an I.} + +**Non identified amino acids** are being ignored in the original plot -- it is of importance to keep track of at least the position and the frequency of these occurrences, as it might indicate a problem with the sequencing. + +The plotting of **sequences of subfamilies** in separate logo plots does not facilitate a comparison of them. Researchers are in particular interested in differences in the conservation of amino acids between subfamilies. + +The **number of sequences** in each of the subfamilies is not shown directly. This influences the inherent variability. It is therefore important to keep track of these numbers to be able to assess how and whether the size of each subfamily affects conservation. + + + +```{r setup} +library(gglogo) +```