- Introduction
- Installation
- Quick Input Data
- Visualizations
- Customization via Simple Inputs
- The ‘colLevels()’ Helper Function
- Session information
- References
dittoViz is a tool built to make analysis and visualization of data easier, faster, and more accessible. All included visualizations were built with color-vision impaired users and viewers in mind. Thus, it provides functions for jump-starting many useful visualization types, which all utilize red-green color-blindness optimized colors by default, and which allow sufficient customization, via discrete inputs, for out-of-the-box creation of publication-ready figures.
The intention is for dittoViz to be a generalized version of the popular yet scRNAseq focused dittoSeq package. Although we have started with simply pulling the plotting features out of all dittoSeq’s visualizations, the goal is to include many more plot types and to grow dittoViz into a comprehensive, data-type agnostic visualization suite. If there is a plot type missing that you would like to suggest be added, please do so by creating an issue on the github!
The default colors of this package are red-green color-blindness
friendly. To make it so, I used the suggested colors from (Wong 2011)
and adapted them slightly by appending darker and lighter versions to
create a 24 color vector. All plotting functions use these colors,
stored in dittoColors()
, by default.
Additionally:
- Shapes displayed in the legends are generally enlarged as this can be almost as helpful as the actual color choice for colorblind individuals.
- When sensible, dittoViz functions have a shape.by input for having groups displayed through shapes in addition to color. (But note: even as a red-green color impaired individual myself writing this vignette, I often still recommend using color and generally use shapes as either a duplication of the color-data or for showing additional groupings.)
- scatterPlots can be generated with letters overlaid (set do.letter = TRUE)
dittoViz is available through CRAN.
install.packages("dittoViz")
Alternatively, you can install from GitHub:
# Install remotes if needed
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
# Install dittoViz
remotes::install_github("dtm2451/dittoViz")
All dittoViz functions have a data_frame
input which takes a
data.frame-like object with observations / individual data points in
rows, and features of those observations in columns.
To get started here, we’ll make use of the ?dittoExampleData
documentation point which was created entirely in order to allow
single-line, fast, generation of an example data_frame
input:
library(dittoViz)
example("dittoExampleData", echo = FALSE)
head(example_df, 10)
## conditions timepoint SNP groups score gene1 gene2 gene3
## obs1 condition1 d0 TRUE D 0.5 2.000000 5.000000 2.584963
## obs2 condition1 d0 TRUE B 1.0 2.000000 4.754888 0.000000
## obs3 condition1 d0 TRUE A 1.5 2.584963 4.857981 2.321928
## obs4 condition1 d0 TRUE D 2.0 3.000000 4.754888 2.807355
## obs5 condition1 d0 TRUE B 2.5 2.584963 4.807355 2.584963
## obs6 condition1 d0 TRUE A 3.0 2.000000 5.523562 2.000000
## obs7 condition1 d0 TRUE C 3.5 3.321928 4.754888 1.000000
## obs8 condition1 d0 FALSE C 4.0 3.459432 5.491853 2.584963
## obs9 condition1 d0 FALSE B 4.5 1.584963 5.169925 2.807355
## obs10 condition1 d0 FALSE A 5.0 1.584963 4.700440 1.584963
## gene4 gene5 PC1 PC2 clustering sample category
## obs1 1.000000 3.584963 -0.3507501 -0.8289334 4 1 A
## obs2 0.000000 4.392317 -0.6020143 0.6392979 2 1 A
## obs3 1.000000 4.087463 0.5290334 -0.4395358 3 1 A
## obs4 1.584963 3.906891 -0.2842554 -0.3427118 4 1 A
## obs5 0.000000 3.000000 -0.9282349 0.3398934 2 1 A
## obs6 1.000000 4.392317 1.1998347 0.1273304 1 1 A
## obs7 1.000000 4.459432 2.3852965 1.3523117 1 1 A
## obs8 1.000000 4.247928 0.2386108 1.8155724 1 1 A
## obs9 1.000000 4.169925 1.5209987 -0.0205300 3 1 A
## obs10 2.000000 3.807355 -1.5091400 -0.1225641 4 1 A
## subcategory
## obs1 1
## obs2 1
## obs3 1
## obs4 1
## obs5 1
## obs6 1
## obs7 1
## obs8 1
## obs9 1
## obs10 1
summary(example_df)
## conditions timepoint SNP groups
## condition1:60 Length:120 Mode :logical Length:120
## condition2:60 Class :character FALSE:64 Class :character
## Mode :character TRUE :56 Mode :character
##
##
##
## score gene1 gene2 gene3
## Min. : 0.50 Min. :0.000 Min. :4.087 Min. :0.000
## 1st Qu.:15.38 1st Qu.:2.000 1st Qu.:4.755 1st Qu.:1.585
## Median :30.25 Median :2.585 Median :4.954 Median :2.322
## Mean :30.25 Mean :2.521 Mean :4.910 Mean :2.150
## 3rd Qu.:45.12 3rd Qu.:3.000 3rd Qu.:5.087 3rd Qu.:2.585
## Max. :60.00 Max. :3.585 Max. :5.728 Max. :3.459
## gene4 gene5 PC1 PC2
## Min. :0.000 Min. :3.000 Min. :-2.18928 Min. :-2.52977
## 1st Qu.:1.000 1st Qu.:3.907 1st Qu.:-0.53074 1st Qu.:-0.42993
## Median :1.585 Median :4.087 Median :-0.06990 Median :-0.03467
## Mean :1.412 Mean :4.110 Mean : 0.07182 Mean : 0.03301
## 3rd Qu.:2.000 3rd Qu.:4.322 3rd Qu.: 0.57025 3rd Qu.: 0.55958
## Max. :3.000 Max. :4.700 Max. : 2.38530 Max. : 2.54280
## clustering sample category subcategory
## Length:120 Min. : 1.00 Length:120 Length:120
## Class :character 1st Qu.: 3.75 Class :character Class :character
## Mode :character Median : 6.50 Mode :character Mode :character
## Mean : 6.50
## 3rd Qu.: 9.25
## Max. :12.00
There are many different types of dittoViz visualizations. Each has intuitive defaults which allow creation of immediately usable plots. Each also has many additional tweaks available through discrete inputs that can help ensure you can create precisely-tuned, deliberately-labeled, publication-quality plots directly ‘out-of-the-box’!
These show data points overlaid on a scatter plot, using two numeric
columns of the data_frame
as the axes and optionally additional
columns for color, shape, faceting, or other features.
# Simplest Form
scatterPlot(
data_frame = example_df,
x.by = "PC1", y.by = "PC2"
)
# Additionally colored by a discrete or numeric feature
scatterPlot(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "clustering")
scatterPlot(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "gene1")
# Additionally shaped or faceted by a discrete feature
scatterPlot(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "clustering",
shape.by = "clustering")
scatterPlot(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "clustering",
split.by = "conditions")
Various additional features can be overlaid on top of these plots.
Adding each is controlled by an input that starts with add.
or do.
such as:
do.label
do.ellipse
do.letter
do.contour
do.hover
add.trajectory.by.groups
add.trajectory.curves
Additional inputs that apply to and adjust these features will then
start with the XXXX part that comes after add.XXXX
or do.XXXX
, as
exemplified below. (Tab-completion friendly!)
A few examples:
# A bit noisy looking because we're adding a LOT of the extras here...
scatterPlot(
example_df, x.by = "PC1", y.by = "PC2", color.by = "clustering",
do.label = TRUE,
labels.repel = FALSE,
do.ellipse = TRUE,
do.contour = TRUE,
contour.color = "gray90",
add.trajectory.by.groups = list(
c(1,2),
c(1,4,3)),
trajectory.group.by = "clustering")
This plotter displays continuous data on a continuous-axis, grouped in the other axis by any discrete feature such as sample or condition.
Data can be represented with violin plots, box plots, jitter
representing individual data points, and/or ridge plots, and the plots
input controls which of these data representations are used. (The
function name can be a bit of a misnomer in that ridge plots will use
the x-axis for the continuous data, and the y-axis for your groupings.)
The group.by
input controls how the data are grouped in the x-axis.
The optional color.by
input controls the colors that fill in violin,
box, and ridge plots.
yPlot(
example_df, "gene1", group.by = "clustering"
)
yPlot()
is the main function, but ridgePlot()
, ridgeJitter()
, and
boxPlot()
are wrappers which just adjust the default for the plots
input from c(“vlnplot”, “boxplot”, “jitter”) to c(“ridgeplot”),
c(“ridgeplot”, “jitter”), or c(“boxplot”,“jitter”), respectively.
ridgePlot(example_df, "gene1", group.by = "clustering")
ridgeJitter(example_df, "gene1", group.by = "clustering")
boxPlot(example_df, "gene1", group.by = "clustering")
Tweaks to the individual data representation types can be made with discrete inputs, all of which start with the representation types’ name. For example…
yPlot(example_df, "gene1", group.by = "clustering",
plots = c("vlnplot", "jitter", "boxplot"), # <- order matters
# change the color and size of jitter points
jitter.color = "blue", jitter.size = 0.5, jitter.width = 1,
# change the outline color and width, and remove the fill of boxplots
boxplot.color = "white", boxplot.width = 0.1,
boxplot.fill = FALSE,
# change how the violin plot widths are normalized across groups
vlnplot.scaling = "count",
# also add lines at specified quantiles
vlnplot.quantiles = c(0.1, 0.9),
# add a dotted line representing some important cutoff
add.line = 0.9
)
These functions quantify and display frequencies of values of a discrete data per sample / clustering / some other discrete data.
For both, data can be represented as either percentages or counts, and
this is controlled by the scale
input.
# barPlot
barPlot(example_df, var = "sample", group.by = "clustering")
barPlot(example_df, var = "sample", group.by = "clustering",
scale = "count")
freqPlot separates each ‘var’-value into its own facet, and thus puts
more emphasis on each individual element. An additional sample.by
input controls splitting of cells within group.by
-groups into
individual samples.
# freqPlot
freqPlot(
data_frame = example_df,
var = "clustering",
sample.by = "sample",
group.by = "category")
An alternative to scatterPlot where instead of plotting each data point individually, observations are first binned into hexagonal regions, and then a summary from all observations of each bin is plotted.
The plot type is most handy for very densely packed data, and allows plotting 1) observation density alone via color, or 2) observation density via opacity in addition to some other feature via color.
# Simplest Form, Density by color
scatterHex(
data_frame = example_df,
x.by = "PC1", y.by = "PC2"
)
# Colored instead via a discrete or numeric feature
scatterHex(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "clustering")
scatterHex(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "gene1")
One key point of control lies how the color.by
data are summarized.
This is controlled with the color.method
input. As we saw above, this
input defaults to the “max” or most common value for discrete
color.by
-data, and to the median value for numeric color.by
data.
For discrete color.by
data, we also offer color.method = "max.prop"
for highlighting diversity among regions of the plot.
For continuous color.by
data, any named function which inputs a vector
of numbers and outputs a single numberic value can be used. E.g. mean,
min, max, sum, or sd (standard deviation) might all be useful in a given
context.
scatterHex(
data_frame = example_df,
x.by = "PC1", y.by = "PC2",
color.by = "gene1",
color.method = "sum")
All additional features of scatterPlot, that don’t rely on plotting of all individual data points, are available here as well:
do.label
do.ellipse
do.contour
do.hover
add.trajectory.by.groups
add.trajectory.curves
A few examples:
# A bit noisy looking because we're adding a LOT of the extras here...
scatterHex(
example_df, x.by = "PC1", y.by = "PC2", color.by = "clustering",
do.label = TRUE,
labels.repel = FALSE,
do.ellipse = TRUE,
do.contour = TRUE,
contour.color = "gray90",
add.trajectory.by.groups = list(
c(1,2),
c(1,4,3)),
trajectory.group.by = "clustering")
Many adjustments can be made with simple additional inputs. Here,
we’ll go through a few that are consistent across most dittoViz
functions, but there are many more. Be sure to check the function
documentation (e.g. ?scatterPlot
) to explore more! Often, there will
be a section towards the bottom of a function’s documentation dedicated
to its specific tweaks!
The data shown in a given plot can be adjusted with the rows.use
input. This can be provided as either a list of rownames to include, as
an integer vector with the indices of which rows to keep, or as a
logical vector that states whether each row should be included.
# Original
barPlot(example_df, var = "sample", group.by = "clustering", scale = "count")
# First 10 cells
barPlot(example_df, var = "sample", group.by = "clustering", scale = "count",
# String method
rows.use = c("obs1", "obs2", "obs3", "obs4", "obs5")
# Index method, which would achieve the same effect
# rows.use = 1:5
)
# "1"-cluster only
barPlot(example_df, var = "sample", group.by = "clustering", scale = "count",
# Logical method
rows.use = example_df$clustering == 1)
Most dittoViz plot types can be faceted into separate plots based on the
values of 1 or 2 discrete columns with a split.by
input.
yPlot(example_df, "gene1", group.by = "clustering",
split.by = "category",
sub = "faceted by category")
scatterPlot(example_df, x.by = "PC1", y.by = "PC2", color.by = "gene1",
split.by = c("category", "subcategory"),
sub = "faceted by category and subcategory")
Extra control over how this is done can be achieved with the
split.adjust
input. split.adjust
allows inputs to be passed through
to the ggplot functions used for achieving the faceting.
yPlot(example_df, "gene1", group.by = "clustering", split.by = "category",
split.adjust = list(scales = "free_y"))
When splitting is by only one metadata, the shape of the facet grid can
be controlled with split.ncol
and split.nrow
.
ridgePlot(example_df, "gene1", group.by = "clustering", split.by = "category",
split.ncol = 1)
Relevant inputs are generally main
, sub
, xlab
, ylab
, and
legend.title
.
barPlot(example_df, "sample", group.by = "clustering",
main = "Encounters",
sub = "By Type",
xlab = NULL, # NULL = remove
ylab = "Generation 1",
legend.title = "Types",
x.labels.rotate = FALSE)
Colors are normally set with color.panel
or max.color
and
min.color
. When color.panel is used (discrete data), an additional
input called colors
sets the order in which those are actually used to
make swapping around colors easy when nearby groups appear too similar
in scatter plots!
# original - discrete
scatterPlot(example_df, "PC1", "PC2", "clustering")
# swapped colors
scatterPlot(example_df, "PC1", "PC2", "clustering",
colors = 4:1)
# different colors
scatterPlot(example_df, "PC1", "PC2", "clustering",
color.panel = c("red", "orange", "purple", "yellow", "skyblue"))
# original - continuous
scatterPlot(example_df, "PC1", "PC2", "gene1")
# different colors
scatterPlot(example_df, "PC1", "PC2", "gene1",
max.color = "red", min.color = "gray90")
Simply add data.out = TRUE
to any of the individual plotters and a
representation of the underlying data will be output.
barPlot(example_df, "sample", group.by = "clustering",
data.out = TRUE)
## $p
##
## $data
## label grouping count label.count.total.per.facet percent
## 1 1 1 3 26 0.11538462
## 2 2 1 1 26 0.03846154
## 3 3 1 2 26 0.07692308
## 4 4 1 3 26 0.11538462
## 5 5 1 2 26 0.07692308
## 6 6 1 2 26 0.07692308
## 7 7 1 2 26 0.07692308
## 8 8 1 1 26 0.03846154
## 9 9 1 3 26 0.11538462
## 10 10 1 4 26 0.15384615
## 11 11 1 2 26 0.07692308
## 12 12 1 1 26 0.03846154
## 13 1 2 2 27 0.07407407
## 14 2 2 3 27 0.11111111
## 15 3 2 0 27 0.00000000
## 16 4 2 2 27 0.07407407
## 17 5 2 3 27 0.11111111
## 18 6 2 2 27 0.07407407
## 19 7 2 0 27 0.00000000
## 20 8 2 3 27 0.11111111
## 21 9 2 2 27 0.07407407
## 22 10 2 4 27 0.14814815
## 23 11 2 3 27 0.11111111
## 24 12 2 3 27 0.11111111
## 25 1 3 2 31 0.06451613
## 26 2 3 3 31 0.09677419
## 27 3 3 6 31 0.19354839
## 28 4 3 3 31 0.09677419
## 29 5 3 3 31 0.09677419
## 30 6 3 4 31 0.12903226
## 31 7 3 2 31 0.06451613
## 32 8 3 2 31 0.06451613
## 33 9 3 2 31 0.06451613
## 34 10 3 2 31 0.06451613
## 35 11 3 1 31 0.03225806
## 36 12 3 1 31 0.03225806
## 37 1 4 3 36 0.08333333
## 38 2 4 3 36 0.08333333
## 39 3 4 2 36 0.05555556
## 40 4 4 2 36 0.05555556
## 41 5 4 2 36 0.05555556
## 42 6 4 2 36 0.05555556
## 43 7 4 6 36 0.16666667
## 44 8 4 4 36 0.11111111
## 45 9 4 3 36 0.08333333
## 46 10 4 0 36 0.00000000
## 47 11 4 4 36 0.11111111
## 48 12 4 5 36 0.13888889
Many dittoViz functions can be supplied do.hover = TRUE
to have them
convert the output into an interactive plotly object that will display
additional data about each data point when the user hovers their cursor
on top.
Generally, a second input, hover.data
, is used for denoting what extra
data to display. This input takes in a vector of column names in the
order you wish for them to be displayed. However, when the types of
underlying data possible to be shown are constrained because the plot
pieces represent summary data (barPlot and freqPlot), the hover.data
input is not used.
# These can be finicky to render in knitting, but still, example code:
scatterPlot(example_df, "PC1", "PC2", "gene1",
do.hover = TRUE,
hover.data = c("gene1", "sample", "clustering", "timepoint"))
barPlot(example_df, "sample", group.by = "clustering",
do.hover = TRUE)
Sometimes, datasets have so many data points that working with each one
rendered individually becomes prohibitively computationally intensive,
especially when opening them in a vector-based graphics editor, such as
Illustrator. In such instances, it can be helpful to have the per-point
graphics layers flattened to a pixel representation. Generally, dittoViz
offers this capability via do.raster
and raster.dpi
inputs.
# Note: dpi gets re-set by the styling code of this vignette, so this is
# just a code example -- the rendered plot won't look quite as intended.
scatterPlot(example_df, "PC1", "PC2", "gene1",
do.raster = TRUE,
raster.dpi = 300)
dittoViz includes one helper functions that make it easier to determine the discrete values within a column of your data frame. There are base R functions which can achieve the same purpose, but the utility of this function is it works regardless of whether the target data is a character, numeric, or factor type.
This functionality is useful for certain inputs, ex: vars.use
of
freqPlot()
, for determining the full option set which you might want
to pull only a few options from.
# Tells the unique values of any discrete data column
# The 'conditions' column is a factor, while 'clustering' is a character vector
colLevels("conditions", example_df)
## [1] "condition1" "condition2"
colLevels("clustering", example_df)
## [1] "1" "2" "3" "4"
There are 2 addition inputs for the function, rows.use
& used.only
:
You can use the function along with standard dittoViz rows.use
subsetting to see if any groupings are removed by the given subsetting.
# We can use rows.use to achieve any subsetting prior to value/level summary.
colLevels(
"conditions", example_df,
rows.use = example_df$conditions!='condition1'
)
## [1] "condition2"
And if you wish to know the recorded levels of your data, regardless of
whether any data points exist for those levels, you can set used.only
to FALSE. used.only
is TRUE by default.
# Note: Set 'used.only' (default = TRUE) to FALSE to show unused levels
# of data that are already factors. By default, only the used options
# of the data are given.
colLevels("conditions", example_df,
rows.use = example_df$conditions!="condition1"
)
## [1] "condition2"
colLevels("conditions", example_df,
rows.use = example_df$conditions!="condition1",
used.only = FALSE
)
## [1] "condition1" "condition2"
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dittoViz_1.0.0 ggplot2_3.3.6
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 dplyr_1.1.2 compiler_4.3.2
## [4] highr_0.10 tidyselect_1.2.0 ggbeeswarm_0.7.2
## [7] scales_1.2.1 yaml_2.3.7 fastmap_1.1.1
## [10] lattice_0.22-5 hexbin_1.28.3 R6_2.5.1
## [13] labeling_0.4.2 generics_0.1.3 isoband_0.2.7
## [16] knitr_1.43 Cairo_1.6-0 MASS_7.3-60
## [19] tibble_3.2.1 munsell_0.5.0 pillar_1.9.0
## [22] rlang_1.1.1 utf8_1.2.3 xfun_0.39
## [25] cli_3.6.1 withr_2.5.0 magrittr_2.0.3
## [28] digest_0.6.31 grid_4.3.2 rstudioapi_0.15.0
## [31] cowplot_1.1.1 beeswarm_0.4.0 lifecycle_1.0.3
## [34] vipor_0.4.5 ggrastr_1.0.2 vctrs_0.6.3
## [37] evaluate_0.21 glue_1.6.2 farver_2.1.1
## [40] fansi_1.0.4 colorspace_2.1-0 rmarkdown_2.22
## [43] ggplot.multistats_1.0.0 tools_4.3.2 pkgconfig_2.0.3
## [46] htmltools_0.5.5 ggridges_0.5.4
Wong, Bang. 2011. “Points of View: Color Blindness.” Nature Methods 8 (6): 441–41. https://doi.org/10.1038/nmeth.1618.