Skip to content

Commit

Permalink
Add new vignettes for pkgdown
Browse files Browse the repository at this point in the history
- building out the articles for the pkgdown
  website
- new articles on safe processes
- train and test data splits
- beginning the FAQs pages
  • Loading branch information
stufield committed Dec 13, 2023
1 parent d7ca7f7 commit 396408c
Show file tree
Hide file tree
Showing 4 changed files with 302 additions and 1 deletion.
23 changes: 22 additions & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ home:

navbar:
structure:
left: [intro, reference, articles, workflows, news]
left: [intro, reference, articles, workflows, FAQs, news]
right: [search, github]
components:
workflows:
Expand All @@ -63,12 +63,33 @@ navbar:
href: articles/stat-binary-classification.html
- text: Linear Regression
href: articles/stat-linear-regression.html

FAQs:
text: FAQs
menu:
- text: SomaScan FAQs
- text: Standard Process
href: articles/todo.html
- text: Limits of Detection (LoD)
href: articles/todo.html
- text: ---

- text: Best Practices
- text: Lifting SomaScan
href: articles/todo.html
- text: Non-Standard Matrices
href: articles/todo.html
- text: CSF
href: articles/todo.html

articles:
- title: Loading and Wrangling
navbar: ~
contents:
- loading-and-wrangling
- safely-rename-df
- safely-map-values-join
- train-test-setup

- title: Command Line Merge Tool
navbar: ~
Expand Down
85 changes: 85 additions & 0 deletions vignettes/safely-map-values-join.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: "Safely Map Values via dplyr::left_join()"
author: "Stu Field, SomaLogic Operating Co., Inc."
description: >
How to safely and reliably map variable between
data frame columns.
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Safely Map Values via dplyr::left_join()}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(dplyr)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```


# Introduction

Mapping values in one column to specific values in another (new) column
of a data frame is a common task in data science. Doing so *safely* is
often a struggle. There are some existing methods in the `tidyverse`
that are useful, but in my opinion come with some drawbacks:

* `dplyr::recode()`
+ can be clunky to implement -> LHS/RHS syntax
difficult (for me) to remember
* `dplyr::case_when()`
+ complex syntax -> difficult to remember; overkill
for mapping purposes

Below is what I see is a *safe* way to map (re-code) values in an existing
column to a new column.


--------------


## Mapping Example

```{r map-values}
# wish to map values of 'x'
df <- withr::with_seed(101, {
data.frame(id = 1:10L,
value = rnorm(10),
x = sample(letters[1:3L], 10, replace = TRUE)
)
})
df
# create a [n x 2] lookup-table (aka hash map)
# n = no. values to map
# x = existing values to map
# new_x = new mapped values for each `x`
map <- data.frame(x = letters[1:4L], new_x = c("cat", "dog", "bird", "turtle"))
map
# use `dplyr::left_join()`
# note: 'turtle' is absent because `d` is not in `df$x` (thus ignored)
dplyr::left_join(df, map)
```


## Un-mapped Values -> `NAs`

Notice that `b` maps to `NA`. This is because the mapping object now
lacks a mapping for `b` (compare to row 2 above).
Using a slightly different syntax via `tibble::enframe()`.

```{r unmapped-NA}
# note: `b` is missing in the map
map_vec <- c(a = "cat", c = "bird", d = "turtle")
map2 <- tibble::enframe(map_vec, name = "x", value = "new_x")
map2
# note: un-mapped values generate NAs: `b -> NA`
dplyr::left_join(df, map2, by = "x")
```
106 changes: 106 additions & 0 deletions vignettes/safely-rename-df.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
---
title: "Safely Rename Data Frames"
author: "Stu Field, SomaLogic Operating Co., Inc."
description: >
How to safely and reliably rename variable names of a
data frame (or `soma_adat`) in R.
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Safely Rename Data Frames}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(SomaDataIO)
library(dplyr)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```


# Introduction

Renaming variables/features of a data frame (or `tibble`)
is a common task in data science. Doing so *safely* is often a struggle.
This can be achieved *safely* via the `dplyr::rename()` function via 2
steps:

1. Set up the mapping in either a named vector
1. Apply the `dplyr::rename()` function via `!!!` syntax
1. Alternatively, roll-your-own `rename()` function

* **Note**: all entries in the mapping (i.e. key) object *must* be
present as `names` in the data frame object.


## Example with `mtcars`

```{r rename-df}
# Create map/key of the names to map
key <- c(MPG = "mpg", CARB = "carb") # named vector
key
# rename `mtcars`
rename(mtcars, !!! key) |> head()
```



## A SomaScan example (`example_data`)

Occasionally it might be required to
rename `AptNames` (`seq.1234.56`) -> `SeqIds` (`1234-56`) when
analyzing SomaScan data.

```{r rename-sim}
getAnalytes(example_data) |>
head()
# create map (named vector)
key2 <- getAnalytes(example_data)
names(key2) <- getSeqId(key2) # re-name `seq.XXXX` -> SeqIds
key2 <- c(key2, ID = "SampleId") # SampleId -> ID
head(key2, 10L)
# rename analytes of `example_data`
getAnalytes(example_data) |>
head(10L)
new <- rename(example_data, !!! key2)
getAnalytes(new) |>
head(10L)
```

## Alternative to `dplyr`
If you prefer to avoid the `dplyr` import/dependency, you can achieve a
similar result with similar syntax by writing your own renaming function:

```{r rename2}
rename2 <- function (.data, ...) {
map <- c(...)
loc <- setNames(match(map, names(.data), nomatch = 0L), names(map))
loc <- loc[loc > 0L]
newnames <- names(.data)
newnames[loc] <- names(loc)
setNames(.data, newnames)
}
```

Now, with *similar* syntax (but cannot use `!!!`):

```{r rename-usage}
# rename `mtcars` in-line
rename2(mtcars, MPG = "mpg", CARB = "carb") |>
head()
# rename `mtcars` via named `key`
rename2(mtcars, key) |>
head()
```

89 changes: 89 additions & 0 deletions vignettes/train-test-setup.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
title: "Common train-test data setups"
author: "Stu Field, SomaLogic Operating Co., Inc."
description: >
Simple code syntax for common machine learning (ML)
training vs test setups in R.
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Common train-test data setups}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(dplyr)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
options(width = 90)
```


# Introduction

Most machine learning (ML) analyses require a random split of original data
into training/test data sets, where the model is fit on the training data
and performance is evaluated on the test data set. The split proportions can
vary, though 80/20 training/test is common. It often depends on the number of
available samples and the class distribution in the splits.

Among many alternatives, there are 3 common approaches, all are equally
viable and depend on the analyst's weighing of pros/cons of each.
I recommend one of these below:

1. base R data frame indexing with [sample()] and `[`
1. use `dplyr::slice_sample()` or `dplyr::sample_frac()` in
combination with `dplyr::anti_join()`
1. use the [rsample](https://rsample.tidymodels.org) package (not demonstrated)

---------


## Original Raw Data

In most analyses, you typically start with a raw original data set that
you must split randomly into training and test sets.

```{r raw-data}
raw <- SomaDataIO::rn2col(head(mtcars, 10L), "CarModel") |>
SomaDataIO::add_rowid("id") |> # set up identifier variable for the join()
tibble::as_tibble()
raw
```


---------


## Option #1: `sample()`

```{r train-test1}
n <- nrow(raw)
idx <- withr::with_seed(1, sample(1:n, floor(n / 2))) # sample random 50% (n = 5)
train <- raw[idx, ]
test <- raw[-idx, ]
train
test
```


## Option #2: `anti_join()`

```{r train-test2}
# sample random 50% (n = 5)
train <- withr::with_seed(1, dplyr::slice_sample(raw, n = floor(n / 2)))
# or using `dplyr::sample_frac()`
# train <- withr::with_seed(1, dplyr::sample_frac(raw, size = 0.5))
# use anti_join() to get the sample setdiff
test <- dplyr::anti_join(raw, train, by = "id")
train
test
```

0 comments on commit 396408c

Please sign in to comment.