Skip to content

Commit

Permalink
re-knit vignette Mds for github
Browse files Browse the repository at this point in the history
  • Loading branch information
sfirke committed Jul 17, 2018
1 parent 7227b0c commit 3cd8748
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 134 deletions.
4 changes: 2 additions & 2 deletions vignettes/janitor.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Overview of janitor functions
================
2018-07-16
2018-07-17

- [Major functions](#major-functions)
- [Cleaning](#cleaning)
Expand Down Expand Up @@ -156,7 +156,7 @@ excel_numeric_to_date(41103.01) # ignores decimal places, returns Date object
#> [1] "2012-07-13"
excel_numeric_to_date(41103.01, include_time = TRUE) # returns POSIXlt object
#> [1] "2012-07-13 00:14:24"
excel_numeric_to_date(41103, date_system = "mac pre-2011")
excel_numeric_to_date(41103.01, date_system = "mac pre-2011")
#> [1] "2016-07-14"
```

Expand Down
197 changes: 65 additions & 132 deletions vignettes/tabyls.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,40 @@
tabyls: a tidy, fully-featured approach to counting things
================
2018-06-15
2018-07-17

## Motivation: why tabyl?
Motivation: why tabyl?
----------------------

Analysts do a lot of counting. Indeed, it’s been said that “[data
science is mostly counting
things](https://twitter.com/joelgrus/status/833691273873600512).” But
the base R function for counting, `table()`, leaves much to be desired:
Analysts do a lot of counting. Indeed, it's been said that "[data science is mostly counting things](https://twitter.com/joelgrus/status/833691273873600512)." But the base R function for counting, `table()`, leaves much to be desired:

- It doesn’t accept data.frame inputs (and thus doesn’t play nicely
with the `%>%` pipe)
- It doesn’t output data.frames
- Its results are hard to format. Compare the look and formatting
choices of an R table to a Microsoft Excel PivotTable or even the
table formatting provided by SPSS.
- It doesn't accept data.frame inputs (and thus doesn't play nicely with the `%>%` pipe)
- It doesn't output data.frames
- Its results are hard to format. Compare the look and formatting choices of an R table to a Microsoft Excel PivotTable or even the table formatting provided by SPSS.

`tabyl()` is an approach to tabulating variables that addresses these
shortcomings. It’s part of the janitor package because counting is such
a fundamental part of data cleaning and exploration.
`tabyl()` is an approach to tabulating variables that addresses these shortcomings. It's part of the janitor package because counting is such a fundamental part of data cleaning and exploration.

`tabyl()` is tidyverse-aligned and is primarily built upon the dplyr and
tidyr packages.
`tabyl()` is tidyverse-aligned and is primarily built upon the dplyr and tidyr packages.

## How it works
How it works
------------

On its surface, `tabyl()` produces frequency tables using 1, 2, or 3
variables. Under the hood, `tabyl()` also attaches a copy of these
counts as an attribute of the resulting data.frame.
On its surface, `tabyl()` produces frequency tables using 1, 2, or 3 variables. Under the hood, `tabyl()` also attaches a copy of these counts as an attribute of the resulting data.frame.

The result looks like a basic data.frame of counts, but because it’s
also a `tabyl` containing this metadata, you can use `adorn_` functions
to add additional information and pretty formatting.
The result looks like a basic data.frame of counts, but because it's also a `tabyl` containing this metadata, you can use `adorn_` functions to add additional information and pretty formatting.

# Examples
Examples
========

This vignette demonstrates `tabyl` in the context of studying humans in
the `starwars` dataset from dplyr:
This vignette demonstrates `tabyl` in the context of studying humans in the `starwars` dataset from dplyr:

``` r
library(dplyr)
humans <- starwars %>%
filter(species == "Human")
```

## One-way tabyl
One-way tabyl
-------------

Tabulating a single variable is the simplest kind of tabyl:

Expand All @@ -64,10 +54,7 @@ t1
#> yellow 2 0.05714286
```

When `NA` values are present, `tabyl()` also displays “valid”
percentages, i.e., with missing values removed from the denominator. And
while `tabyl()` is built to take a data.frame and column names, you can
also produce a one-way tabyl by calling it directly on a vector:
When `NA` values are present, `tabyl()` also displays "valid" percentages, i.e., with missing values removed from the denominator. And while `tabyl()` is built to take a data.frame and column names, you can also produce a one-way tabyl by calling it directly on a vector:

``` r
x <- c("big", "big", "small", "small", "small", NA)
Expand All @@ -78,8 +65,7 @@ tabyl(x)
#> <NA> 1 0.1666667 NA
```

Most `adorn_` helper functions are built for 2-way tabyls, but those
that make sense for a 1-way tabyl do work:
Most `adorn_` helper functions are built for 2-way tabyls, but those that make sense for a 1-way tabyl do work:

``` r
t1 %>%
Expand All @@ -95,12 +81,10 @@ t1 %>%
#> Total 35 100.0%
```

## Two-way tabyl
Two-way tabyl
-------------

This is often called a “crosstab” or “contingency” table. Calling
`tabyl` on two columns of a data.frame produces the same result as the
common combination of `dplyr::count()`, followed by `tidyr::spread()` to
wide form:
This is often called a "crosstab" or "contingency" table. Calling `tabyl` on two columns of a data.frame produces the same result as the common combination of `dplyr::count()`, followed by `tidyr::spread()` to wide form:

``` r
t2 <- humans %>%
Expand All @@ -112,8 +96,7 @@ t2
#> male 9 1 12 1 1 2
```

Since it’s a `tabyl`, we can enhance it with `adorn_` helper functions.
For instance:
Since it's a `tabyl`, we can enhance it with `adorn_` helper functions. For instance:

``` r

Expand All @@ -126,13 +109,12 @@ t2 %>%
#> male 34.62% (9) 3.85% (1) 46.15% (12) 3.85% (1) 3.85% (1) 7.69% (2)
```

Adornments have options to control axes, rounding, and other relevant
formatting choices (more on that below).
Adornments have options to control axes, rounding, and other relevant formatting choices (more on that below).

## Three-way tabyl
Three-way tabyl
---------------

Just as `table()` accepts three variables, so does `tabyl()`, producing
a list of tabyls:
Just as `table()` accepts three variables, so does `tabyl()`, producing a list of tabyls:

``` r
t3 <- humans %>%
Expand All @@ -159,9 +141,7 @@ t3
#> yellow 0 0 0 1 0 1
```

If the `adorn_` helper functions are called on a list of data.frames -
like the output of a three-way `tabyl` call - they will call
`purrr::map()` to apply themselves to each data.frame in the list:
If the `adorn_` helper functions are called on a list of data.frames - like the output of a three-way `tabyl` call - they will call `purrr::map()` to apply themselves to each data.frame in the list:

``` r
library(purrr)
Expand Down Expand Up @@ -192,32 +172,21 @@ humans %>%
#> Total 15.4% (4) 50.0% (13) 19.2% (5) 3.8% (1) 7.7% (2) 3.8% (1)
```

This automatic mapping supports interactive data analysis that switches
between combinations of 2 and 3 variables. That way, if a user starts
with `humans %>% tabyl(eye_color, skin_color)`, adds some `adorn_`
calls, then decides to split the tabulation by gender and modifies their
first line to `humans %>% tabyl(eye_color, skin_color, gender`), they
don’t have to rewrite the subsequent adornment calls to use `map()`.
This automatic mapping supports interactive data analysis that switches between combinations of 2 and 3 variables. That way, if a user starts with `humans %>% tabyl(eye_color, skin_color)`, adds some `adorn_` calls, then decides to split the tabulation by gender and modifies their first line to `humans %>% tabyl(eye_color, skin_color, gender`), they don't have to rewrite the subsequent adornment calls to use `map()`.

However, if feels more natural to call these with `map()` or `lapply()`,
that is still supported. For instance, `t3 %>%
lapply(adorn_percentages)` would produce the same result as `t3 %>%
adorn_percentages`.
However, if feels more natural to call these with `map()` or `lapply()`, that is still supported. For instance, `t3 %>% lapply(adorn_percentages)` would produce the same result as `t3 %>% adorn_percentages`.

### Other features of tabyls

- When called on a factor, `tabyl` will show missing levels (levels
not present in the data) in the result
- This can be suppressed if not desired
- `NA` values can be displayed or suppressed
- `tabyls` print without displaying row numbers
- When called on a factor, `tabyl` will show missing levels (levels not present in the data) in the result
- This can be suppressed if not desired
- `NA` values can be displayed or suppressed
- `tabyls` print without displaying row numbers

## The `adorn_*` functions
The `adorn_*` functions
-----------------------

These modular functions build on a `tabyl` to approximate the
functionality of a PivotTable in Microsoft Excel. They print elegant
results for interactive analysis or for sharing in a report, e.g., with
`knitr::kable()`. For example:
These modular functions build on a `tabyl` to approximate the functionality of a PivotTable in Microsoft Excel. They print elegant results for interactive analysis or for sharing in a report, e.g., with `knitr::kable()`. For example:

``` r
humans %>%
Expand All @@ -231,61 +200,35 @@ humans %>%
```

| gender/eye\_color | blue | blue-gray | brown | dark | hazel | yellow | Total |
| :---------------- | :------- | :-------- | :------- | :----- | :------ | :----- | :-------- |
|:------------------|:---------|:----------|:---------|:-------|:--------|:-------|:----------|
| female | 33% (3) | 0% (0) | 56% (5) | 0% (0) | 11% (1) | 0% (0) | 100% (9) |
| male | 35% (9) | 4% (1) | 46% (12) | 4% (1) | 4% (1) | 8% (2) | 100% (26) |
| Total | 34% (12) | 3% (1) | 49% (17) | 3% (1) | 6% (2) | 6% (2) | 100% (35) |

### The adorn functions are:

- **`adorn_totals()`**: Add totals row, column, or both. Replaces the
older janitor functions `add_totals_row` and `add_totals_col`
- **`adorn_percentages()`**: Calculate percentages along either axis
or over the entire tabyl
- **`adorn_pct_formatting()`**: Format percentage columns, controlling
the number of digits to display and whether to append the `%` symbol
- **`adorn_rounding()`**: Round a data.frame of numbers (usually the
result of `adorn_percentages`), either using the base R `round()`
function or using janitor’s `round_half_up()` to round all ties up
([thanks,
StackOverflow](http://stackoverflow.com/a/12688836/4470365)).
- e.g., round 10.5 up to 11, consistent with Excel’s tie-breaking
behavior.
- This contrasts with rounding 10.5 down to 10 as in base R’s
`round(10.5)`.
- `adorn_rounding()` returns columns of class `numeric`, allowing
for graphing, sorting, etc. It’s a less-aggressive substitute
for `adorn_pct_formatting()`; these two functions should not be
called together.
- **`adorn_ns()`**: add Ns to a tabyl. These can be drawn from the
tabyl’s underlying counts, which are attached to the tabyl as
metadata, or they can be supplied by the user.
- **`adorn_title()`**: add a title to a tabyl (or other data.frame).
Options include putting the column title in a new row on top of the
data.frame or combining the row and column titles in the
data.frame’s first name slot.

These adornments should be called in a logical order, e.g., you probably
want to add totals before percentages are calculated. In general, call
them in the order they appear above.

Users of janitor version \<= 0.3.1 should replace the deprecated
`adorn_crosstab()` function with combinations of the above `adorn_`
functions.

## BYOt (Bring Your Own tabyl)

You can also call `adorn_` functions on other data.frames, not only the
results of calls to `tabyl()`. E.g., `mtcars %>% adorn_totals("col") %>%
adorn_percentages("col")` performs as expected, despite `mtcars` not
being a `tabyl`.

This can be handy when you have a data.frame that is not a simple
tabulation generated by `tabyl` but would still benefit from the
`adorn_` formatting functions.

A simple example: calculate the proportion of records meeting a certain
condition, then format the results.
- **`adorn_totals()`**: Add totals row, column, or both. Replaces the older janitor functions `add_totals_row` and `add_totals_col`
- **`adorn_percentages()`**: Calculate percentages along either axis or over the entire tabyl
- **`adorn_pct_formatting()`**: Format percentage columns, controlling the number of digits to display and whether to append the `%` symbol
- **`adorn_rounding()`**: Round a data.frame of numbers (usually the result of `adorn_percentages`), either using the base R `round()` function or using janitor's `round_half_up()` to round all ties up ([thanks, StackOverflow](http://stackoverflow.com/a/12688836/4470365)).
- e.g., round 10.5 up to 11, consistent with Excel's tie-breaking behavior.
- This contrasts with rounding 10.5 down to 10 as in base R's `round(10.5)`.
- `adorn_rounding()` returns columns of class `numeric`, allowing for graphing, sorting, etc. It's a less-aggressive substitute for `adorn_pct_formatting()`; these two functions should not be called together.
- **`adorn_ns()`**: add Ns to a tabyl. These can be drawn from the tabyl's underlying counts, which are attached to the tabyl as metadata, or they can be supplied by the user.
- **`adorn_title()`**: add a title to a tabyl (or other data.frame). Options include putting the column title in a new row on top of the data.frame or combining the row and column titles in the data.frame's first name slot.

These adornments should be called in a logical order, e.g., you probably want to add totals before percentages are calculated. In general, call them in the order they appear above.

Users of janitor version &lt;= 0.3.1 should replace the deprecated `adorn_crosstab()` function with combinations of the above `adorn_` functions.

BYOt (Bring Your Own tabyl)
---------------------------

You can also call `adorn_` functions on other data.frames, not only the results of calls to `tabyl()`. E.g., `mtcars %>% adorn_totals("col") %>% adorn_percentages("col")` performs as expected, despite `mtcars` not being a `tabyl`.

This can be handy when you have a data.frame that is not a simple tabulation generated by `tabyl` but would still benefit from the `adorn_` formatting functions.

A simple example: calculate the proportion of records meeting a certain condition, then format the results.

``` r
percent_above_165_cm <- humans %>%
Expand All @@ -301,10 +244,7 @@ percent_above_165_cm %>%
#> 2 male 100.0%
```

Here’s a more complex example. We’ll create a table containing the mean
of a 3rd variable when grouped by two other variables, then use `adorn_`
functions to round the values and append Ns. The first part is pretty
straightforward:
Here's a more complex example. We'll create a table containing the mean of a 3rd variable when grouped by two other variables, then use `adorn_` functions to round the values and append Ns. The first part is pretty straightforward:

``` r
library(tidyr) # for spread()
Expand All @@ -323,9 +263,7 @@ mpg_by_cyl_and_am
#> 3 8 15.0 15.4
```

Now to `adorn_` it. Since this is not the result of a `tabyl()` call, it
doesn’t have the underlying Ns stored in the `core` attribute, so we’ll
have to supply them:
Now to `adorn_` it. Since this is not the result of a `tabyl()` call, it doesn't have the underlying Ns stored in the `core` attribute, so we'll have to supply them:

``` r
mpg_by_cyl_and_am %>%
Expand All @@ -341,13 +279,8 @@ mpg_by_cyl_and_am %>%
#> 3 8 15.1 (12) 15.4 (2)
```

If needed, Ns can be manipulated in their own data.frame before they are
appended. E.g., if you have a tabyl with values of N in the thousands,
you could divide them by 1000, round, and append “k” before inserting
them with `adorn_ns`.
If needed, Ns can be manipulated in their own data.frame before they are appended. E.g., if you have a tabyl with values of N in the thousands, you could divide them by 1000, round, and append "k" before inserting them with `adorn_ns`.

### Questions? Comments?

File [an issue on GitHub](https://github.com/sfirke/janitor/issues) if
you have suggestions related to `tabyl()` and its `adorn_` helpers or
encounter problems while using them.
File [an issue on GitHub](https://github.com/sfirke/janitor/issues) if you have suggestions related to `tabyl()` and its `adorn_` helpers or encounter problems while using them.

0 comments on commit 3cd8748

Please sign in to comment.