Add new vignettes for pkgdown

- building out the articles for the pkgdown website - new articles on safe processes - train and test data splits - beginning the FAQs pages
SomaLogic · Dec 13, 2023 · 396408c · 396408c
1 parent d7ca7f7
commit 396408c
Show file tree

Hide file tree

Showing 4 changed files with 302 additions and 1 deletion.
diff --git a/_pkgdown.yml b/_pkgdown.yml
@@ -45,7 +45,7 @@ home:
 
 navbar:
   structure:
-    left: [intro, reference, articles, workflows, news]
+    left: [intro, reference, articles, workflows, FAQs, news]
     right: [search, github]
   components:
     workflows:
@@ -63,12 +63,33 @@ navbar:
         href: articles/stat-binary-classification.html
       - text: Linear Regression
         href: articles/stat-linear-regression.html
+
+    FAQs:
+      text: FAQs
+      menu:
+      - text: SomaScan FAQs
+      - text: Standard Process
+        href: articles/todo.html
+      - text: Limits of Detection (LoD)
+        href: articles/todo.html
+      - text: ---
+
+      - text: Best Practices
+      - text: Lifting SomaScan
+        href: articles/todo.html
+      - text: Non-Standard Matrices
+        href: articles/todo.html
+      - text: CSF
+        href: articles/todo.html
 
 articles:
   - title: Loading and Wrangling
     navbar: ~
     contents:
     - loading-and-wrangling
+    - safely-rename-df
+    - safely-map-values-join
+    - train-test-setup
 
   - title: Command Line Merge Tool
     navbar: ~

diff --git a/vignettes/safely-map-values-join.Rmd b/vignettes/safely-map-values-join.Rmd
@@ -0,0 +1,85 @@
+---
+title: "Safely Map Values via dplyr::left_join()"
+author: "Stu Field, SomaLogic Operating Co., Inc."
+description: >
+  How to safely and reliably map variable between
+  data frame columns.
+output:
+  rmarkdown::html_vignette:
+    fig_caption: yes
+vignette: >
+  %\VignetteIndexEntry{Safely Map Values via dplyr::left_join()}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+library(dplyr)
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+
+# Introduction
+
+Mapping values in one column to specific values in another (new) column 
+of a data frame is a common task in data science. Doing so *safely* is
+often a struggle. There are some existing methods in the `tidyverse`
+that are useful, but in my opinion come with some drawbacks:
+
+  * `dplyr::recode()`
+    + can be clunky to implement -> LHS/RHS syntax 
+      difficult (for me) to remember
+  * `dplyr::case_when()`
+    + complex syntax -> difficult to remember; overkill
+      for mapping purposes
+
+Below is what I see is a *safe* way to map (re-code) values in an existing 
+column to a new column.
+
+
+--------------
+
+
+## Mapping Example
+
+```{r map-values}
+# wish to map values of 'x'
+df <- withr::with_seed(101, {
+  data.frame(id    = 1:10L,
+             value = rnorm(10),
+             x     = sample(letters[1:3L], 10, replace = TRUE)
+  )
+})
+df
+
+# create a [n x 2] lookup-table (aka hash map)
+# n = no. values to map
+# x = existing values to map
+# new_x = new mapped values for each `x`
+map <- data.frame(x = letters[1:4L], new_x = c("cat", "dog", "bird", "turtle"))
+map
+
+# use `dplyr::left_join()`
+# note: 'turtle' is absent because `d` is not in `df$x` (thus ignored)
+dplyr::left_join(df, map)
+```
+
+
+## Un-mapped Values -> `NAs`
+
+Notice that `b` maps to `NA`. This is because the mapping object now
+lacks a mapping for `b` (compare to row 2 above).
+Using a slightly different syntax via `tibble::enframe()`.
+
+```{r unmapped-NA}
+# note: `b` is missing in the map
+map_vec <- c(a = "cat", c = "bird", d = "turtle")
+map2 <- tibble::enframe(map_vec, name = "x", value = "new_x")
+map2
+
+# note: un-mapped values generate NAs: `b -> NA`
+dplyr::left_join(df, map2, by = "x")
+```
diff --git a/vignettes/safely-rename-df.Rmd b/vignettes/safely-rename-df.Rmd
@@ -0,0 +1,106 @@
+---
+title: "Safely Rename Data Frames"
+author: "Stu Field, SomaLogic Operating Co., Inc."
+description: >
+  How to safely and reliably rename variable names of a
+  data frame (or `soma_adat`) in R.
+output:
+  rmarkdown::html_vignette:
+    fig_caption: yes
+vignette: >
+  %\VignetteIndexEntry{Safely Rename Data Frames}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+library(SomaDataIO)
+library(dplyr)
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+```
+
+
+# Introduction
+
+Renaming variables/features of a data frame (or `tibble`) 
+is a common task in data science. Doing so *safely* is often a struggle.
+This can be achieved *safely* via the `dplyr::rename()` function via 2
+steps:
+
+1. Set up the mapping in either a named vector
+1. Apply the `dplyr::rename()` function via `!!!` syntax
+1. Alternatively, roll-your-own `rename()` function
+
+* **Note**: all entries in the mapping (i.e. key) object *must* be
+  present as `names` in the data frame object.
+
+
+## Example with `mtcars`
+
+```{r rename-df}
+# Create map/key of the names to map
+key <- c(MPG = "mpg", CARB = "carb")   # named vector
+key
+
+# rename `mtcars`
+rename(mtcars, !!! key) |> head()
+```
+
+
+
+## A SomaScan example (`example_data`)
+
+Occasionally it might be required to
+rename `AptNames` (`seq.1234.56`) -> `SeqIds` (`1234-56`) when
+analyzing SomaScan data.
+
+```{r rename-sim}
+getAnalytes(example_data) |> 
+  head()
+
+# create map (named vector)
+key2 <- getAnalytes(example_data)  
+names(key2) <- getSeqId(key2)     # re-name `seq.XXXX` -> SeqIds
+key2 <- c(key2, ID = "SampleId")  # SampleId -> ID
+head(key2, 10L)
+
+# rename analytes of `example_data`
+getAnalytes(example_data) |>
+  head(10L)
+
+new <- rename(example_data, !!! key2)
+
+getAnalytes(new) |>
+  head(10L)
+```
+
+## Alternative to `dplyr`
+If you prefer to avoid the `dplyr` import/dependency, you can achieve a
+similar result with similar syntax by writing your own renaming function:
+
+```{r rename2}
+rename2 <- function (.data, ...) {
+  map <- c(...)
+  loc <- setNames(match(map, names(.data), nomatch = 0L), names(map))
+  loc <- loc[loc > 0L]
+  newnames <- names(.data)
+  newnames[loc] <- names(loc)
+  setNames(.data, newnames)
+}
+```
+
+Now, with *similar* syntax (but cannot use `!!!`):
+
+```{r rename-usage}
+# rename `mtcars` in-line
+rename2(mtcars, MPG = "mpg", CARB = "carb") |>
+  head()
+
+# rename `mtcars` via named `key`
+rename2(mtcars, key) |>
+  head()
+```
+
diff --git a/vignettes/train-test-setup.Rmd b/vignettes/train-test-setup.Rmd
@@ -0,0 +1,89 @@
+---
+title: "Common train-test data setups"
+author: "Stu Field, SomaLogic Operating Co., Inc."
+description: >
+  Simple code syntax for common machine learning (ML)
+  training vs test setups in R.
+output:
+  rmarkdown::html_vignette:
+    fig_caption: yes
+vignette: >
+  %\VignetteIndexEntry{Common train-test data setups}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+```{r setup, include = FALSE}
+library(dplyr)
+knitr::opts_chunk$set(
+  collapse = TRUE,
+  comment = "#>"
+)
+options(width = 90)
+```
+
+
+# Introduction
+
+Most machine learning (ML) analyses require a random split of original data
+into training/test data sets, where the model is fit on the training data
+and performance is evaluated on the test data set. The split proportions can
+vary, though 80/20 training/test is common. It often depends on the number of
+available samples and the class distribution in the splits.
+
+Among many alternatives, there are 3 common approaches, all are equally
+viable and depend on the analyst's weighing of pros/cons of each.
+I recommend one of these below:
+
+1. base R data frame indexing with [sample()] and `[`
+1. use `dplyr::slice_sample()` or `dplyr::sample_frac()` in
+   combination with `dplyr::anti_join()`
+1. use the [rsample](https://rsample.tidymodels.org) package (not demonstrated)
+
+---------
+
+
+## Original Raw Data
+
+In most analyses, you typically start with a raw original data set that
+you must split randomly into training and test sets.
+
+```{r raw-data}
+raw <- SomaDataIO::rn2col(head(mtcars, 10L), "CarModel") |>
+  SomaDataIO::add_rowid("id") |> # set up identifier variable for the join()
+  tibble::as_tibble()
+raw
+```
+
+
+---------
+
+
+## Option #1: `sample()`
+
+```{r train-test1}
+n     <- nrow(raw)
+idx   <- withr::with_seed(1, sample(1:n, floor(n / 2))) # sample random 50% (n = 5)
+train <- raw[idx, ]
+test  <- raw[-idx, ]
+train
+
+test
+```
+
+
+## Option #2: `anti_join()`
+
+```{r train-test2}
+# sample random 50% (n = 5)
+train <- withr::with_seed(1, dplyr::slice_sample(raw, n = floor(n / 2)))
+
+# or using `dplyr::sample_frac()`
+# train <- withr::with_seed(1, dplyr::sample_frac(raw, size = 0.5))
+
+# use anti_join() to get the sample setdiff
+test <- dplyr::anti_join(raw, train, by = "id")
+train
+
+test
+```