-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- building out the articles for the pkgdown website - new articles on safe processes - train and test data splits - beginning the FAQs pages
- Loading branch information
Showing
4 changed files
with
302 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
--- | ||
title: "Safely Map Values via dplyr::left_join()" | ||
author: "Stu Field, SomaLogic Operating Co., Inc." | ||
description: > | ||
How to safely and reliably map variable between | ||
data frame columns. | ||
output: | ||
rmarkdown::html_vignette: | ||
fig_caption: yes | ||
vignette: > | ||
%\VignetteIndexEntry{Safely Map Values via dplyr::left_join()} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r setup, include = FALSE} | ||
library(dplyr) | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
|
||
# Introduction | ||
|
||
Mapping values in one column to specific values in another (new) column | ||
of a data frame is a common task in data science. Doing so *safely* is | ||
often a struggle. There are some existing methods in the `tidyverse` | ||
that are useful, but in my opinion come with some drawbacks: | ||
|
||
* `dplyr::recode()` | ||
+ can be clunky to implement -> LHS/RHS syntax | ||
difficult (for me) to remember | ||
* `dplyr::case_when()` | ||
+ complex syntax -> difficult to remember; overkill | ||
for mapping purposes | ||
|
||
Below is what I see is a *safe* way to map (re-code) values in an existing | ||
column to a new column. | ||
|
||
|
||
-------------- | ||
|
||
|
||
## Mapping Example | ||
|
||
```{r map-values} | ||
# wish to map values of 'x' | ||
df <- withr::with_seed(101, { | ||
data.frame(id = 1:10L, | ||
value = rnorm(10), | ||
x = sample(letters[1:3L], 10, replace = TRUE) | ||
) | ||
}) | ||
df | ||
# create a [n x 2] lookup-table (aka hash map) | ||
# n = no. values to map | ||
# x = existing values to map | ||
# new_x = new mapped values for each `x` | ||
map <- data.frame(x = letters[1:4L], new_x = c("cat", "dog", "bird", "turtle")) | ||
map | ||
# use `dplyr::left_join()` | ||
# note: 'turtle' is absent because `d` is not in `df$x` (thus ignored) | ||
dplyr::left_join(df, map) | ||
``` | ||
|
||
|
||
## Un-mapped Values -> `NAs` | ||
|
||
Notice that `b` maps to `NA`. This is because the mapping object now | ||
lacks a mapping for `b` (compare to row 2 above). | ||
Using a slightly different syntax via `tibble::enframe()`. | ||
|
||
```{r unmapped-NA} | ||
# note: `b` is missing in the map | ||
map_vec <- c(a = "cat", c = "bird", d = "turtle") | ||
map2 <- tibble::enframe(map_vec, name = "x", value = "new_x") | ||
map2 | ||
# note: un-mapped values generate NAs: `b -> NA` | ||
dplyr::left_join(df, map2, by = "x") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
--- | ||
title: "Safely Rename Data Frames" | ||
author: "Stu Field, SomaLogic Operating Co., Inc." | ||
description: > | ||
How to safely and reliably rename variable names of a | ||
data frame (or `soma_adat`) in R. | ||
output: | ||
rmarkdown::html_vignette: | ||
fig_caption: yes | ||
vignette: > | ||
%\VignetteIndexEntry{Safely Rename Data Frames} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r setup, include = FALSE} | ||
library(SomaDataIO) | ||
library(dplyr) | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
``` | ||
|
||
|
||
# Introduction | ||
|
||
Renaming variables/features of a data frame (or `tibble`) | ||
is a common task in data science. Doing so *safely* is often a struggle. | ||
This can be achieved *safely* via the `dplyr::rename()` function via 2 | ||
steps: | ||
|
||
1. Set up the mapping in either a named vector | ||
1. Apply the `dplyr::rename()` function via `!!!` syntax | ||
1. Alternatively, roll-your-own `rename()` function | ||
|
||
* **Note**: all entries in the mapping (i.e. key) object *must* be | ||
present as `names` in the data frame object. | ||
|
||
|
||
## Example with `mtcars` | ||
|
||
```{r rename-df} | ||
# Create map/key of the names to map | ||
key <- c(MPG = "mpg", CARB = "carb") # named vector | ||
key | ||
# rename `mtcars` | ||
rename(mtcars, !!! key) |> head() | ||
``` | ||
|
||
|
||
|
||
## A SomaScan example (`example_data`) | ||
|
||
Occasionally it might be required to | ||
rename `AptNames` (`seq.1234.56`) -> `SeqIds` (`1234-56`) when | ||
analyzing SomaScan data. | ||
|
||
```{r rename-sim} | ||
getAnalytes(example_data) |> | ||
head() | ||
# create map (named vector) | ||
key2 <- getAnalytes(example_data) | ||
names(key2) <- getSeqId(key2) # re-name `seq.XXXX` -> SeqIds | ||
key2 <- c(key2, ID = "SampleId") # SampleId -> ID | ||
head(key2, 10L) | ||
# rename analytes of `example_data` | ||
getAnalytes(example_data) |> | ||
head(10L) | ||
new <- rename(example_data, !!! key2) | ||
getAnalytes(new) |> | ||
head(10L) | ||
``` | ||
|
||
## Alternative to `dplyr` | ||
If you prefer to avoid the `dplyr` import/dependency, you can achieve a | ||
similar result with similar syntax by writing your own renaming function: | ||
|
||
```{r rename2} | ||
rename2 <- function (.data, ...) { | ||
map <- c(...) | ||
loc <- setNames(match(map, names(.data), nomatch = 0L), names(map)) | ||
loc <- loc[loc > 0L] | ||
newnames <- names(.data) | ||
newnames[loc] <- names(loc) | ||
setNames(.data, newnames) | ||
} | ||
``` | ||
|
||
Now, with *similar* syntax (but cannot use `!!!`): | ||
|
||
```{r rename-usage} | ||
# rename `mtcars` in-line | ||
rename2(mtcars, MPG = "mpg", CARB = "carb") |> | ||
head() | ||
# rename `mtcars` via named `key` | ||
rename2(mtcars, key) |> | ||
head() | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
--- | ||
title: "Common train-test data setups" | ||
author: "Stu Field, SomaLogic Operating Co., Inc." | ||
description: > | ||
Simple code syntax for common machine learning (ML) | ||
training vs test setups in R. | ||
output: | ||
rmarkdown::html_vignette: | ||
fig_caption: yes | ||
vignette: > | ||
%\VignetteIndexEntry{Common train-test data setups} | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
```{r setup, include = FALSE} | ||
library(dplyr) | ||
knitr::opts_chunk$set( | ||
collapse = TRUE, | ||
comment = "#>" | ||
) | ||
options(width = 90) | ||
``` | ||
|
||
|
||
# Introduction | ||
|
||
Most machine learning (ML) analyses require a random split of original data | ||
into training/test data sets, where the model is fit on the training data | ||
and performance is evaluated on the test data set. The split proportions can | ||
vary, though 80/20 training/test is common. It often depends on the number of | ||
available samples and the class distribution in the splits. | ||
|
||
Among many alternatives, there are 3 common approaches, all are equally | ||
viable and depend on the analyst's weighing of pros/cons of each. | ||
I recommend one of these below: | ||
|
||
1. base R data frame indexing with [sample()] and `[` | ||
1. use `dplyr::slice_sample()` or `dplyr::sample_frac()` in | ||
combination with `dplyr::anti_join()` | ||
1. use the [rsample](https://rsample.tidymodels.org) package (not demonstrated) | ||
|
||
--------- | ||
|
||
|
||
## Original Raw Data | ||
|
||
In most analyses, you typically start with a raw original data set that | ||
you must split randomly into training and test sets. | ||
|
||
```{r raw-data} | ||
raw <- SomaDataIO::rn2col(head(mtcars, 10L), "CarModel") |> | ||
SomaDataIO::add_rowid("id") |> # set up identifier variable for the join() | ||
tibble::as_tibble() | ||
raw | ||
``` | ||
|
||
|
||
--------- | ||
|
||
|
||
## Option #1: `sample()` | ||
|
||
```{r train-test1} | ||
n <- nrow(raw) | ||
idx <- withr::with_seed(1, sample(1:n, floor(n / 2))) # sample random 50% (n = 5) | ||
train <- raw[idx, ] | ||
test <- raw[-idx, ] | ||
train | ||
test | ||
``` | ||
|
||
|
||
## Option #2: `anti_join()` | ||
|
||
```{r train-test2} | ||
# sample random 50% (n = 5) | ||
train <- withr::with_seed(1, dplyr::slice_sample(raw, n = floor(n / 2))) | ||
# or using `dplyr::sample_frac()` | ||
# train <- withr::with_seed(1, dplyr::sample_frac(raw, size = 0.5)) | ||
# use anti_join() to get the sample setdiff | ||
test <- dplyr::anti_join(raw, train, by = "id") | ||
train | ||
test | ||
``` |