Skip to content

Commit

Permalink
easier to do it this way.
Browse files Browse the repository at this point in the history
  • Loading branch information
svmiller committed Jan 11, 2024
1 parent 3b9b278 commit 7b89708
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 1,034 deletions.
303 changes: 1 addition & 302 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ cat(

<img src="http://svmiller.com/images/stevemisc-hexlogo.png" alt="My stevemisc hexlogo" align="right" width="200" style="padding: 0 15px; float: right;"/>

`{stevemisc}` is an R package that includes various functions and tools that I have written over the years to assist me in my research, teaching, and public presentations (i.e. stuff I put on my blog). I offer it here for a public release because 1) I am vain and think I want an entire, eponymous ecosystem in the R programming language (i.e. the "steveverse") and 2) I think there are tools here that are broadly useful for users that I'm trying to bundle with other things that I offer (prominently [`{steveproj}`](https://github.com/svmiller/steveproj)). Namely, `{stevemisc}` offers tools to assist in data organization, data presentation, data recoding, and data simulation. The usage section will elaborate some of its uses.
`{stevemisc}` is an R package that includes various functions and tools that I have written over the years to assist me in my research, teaching, and public presentations (i.e. stuff I put on my blog). I offer it here for a public release because 1) I am vain and think I want an entire, eponymous ecosystem in the R programming language (i.e. the "steveverse") and 2) I think there are tools here that are broadly useful for users that I'm trying to bundle with other things that I offer (prominently [`{steveproj}`](https://github.com/svmiller/steveproj)). Namely, `{stevemisc}` offers tools to assist in data organization, data presentation, data recoding, and data simulation.

## Installation

Expand All @@ -45,304 +45,3 @@ You can install the development version of `{stevemisc}` from Github via the `{d
```r
devtools::install_github("svmiller/stevemisc")
```

## Usage

The documentation files will include several of these as "examples." I offer them here as proofs of concept. There are lots of cool stuff in `{stevemisc}` and I cannot review all of them here. Instead, I'll offer what I think are the most important ones.

### `carrec()`: A Port of `car::recode()`

`carrec()` (phonetically: "car-wreck") is a simple port of `car::recode()` that I put in this package because of various function clashes in the `{car}` package. For those who cut their teeth on Stata, this package offers Stata-like recoding features that are tough to find in the R programming language. It comes with a shortcut as well, `carr()`.

For example, assume the following vector that is some variable of interest on a 1-10 scale. You want to code the variables that are 6 and above to be 1 and code the variables of 1-5 to be 0. Here's how you would do that.

```{r, cache=F, message=F}
library(tidyverse)
library(stevemisc)
x <- seq(1, 10)
x
carrec(x, "1:5=0;6:10=1")
carr(x, "1:5=0;6:10=1")
```

### `cor2data()`: Simulate Variables from a Standard Normal Distribution with Pre-Specified Correlations

`cor2data()` is great for instructional purposes for simulating data from a standard normal distribution in which the ensuing data are generated to approximate some pre-specified correlations. This is useful for teaching how statistical models are supposed to operate under ideal circumstances. For example, here's how [I used this function to teach about instrumental variable models](http://post8000.svmiller.com/lab-scripts/instrumental-variables-lab.html). Notice the correlations I devise and how they satisfy they assumptions of exclusion, exogeneity, and relevance.

```{r, cache=F, message=F}
vars = c("control", "treat", "instr", "e")
Cor <- matrix(cbind(1, 0, 0, 0,
0, 1, 0.85, -0.5,
0, 0.85, 1, 0,
0, -0.5, 0, 1),nrow=4)
rownames(Cor) <- colnames(Cor) <- vars
Fake <- as_tibble(cor2data(Cor, 1000, 8675309)) # Jenny I got your number...
Fake$y <- with(Fake, 5 + .5*control + .5*treat + e)
Fake
```

### `corvectors()`: Create Multivariate Data by Permutation

`corvectors()` is a port of `correlate()` from the `{correlate}` package. This package is no longer on CRAN, but it's wonderful for creating multivariate data with set correlations in which variables can be on any number of raw scales. I used this function to create fake data to mimic the API data in `{survey}`, which I make available as `fakeAPI` in the `{stevedata}` package. Here is a smaller version of that.

```{r, cache=F}
data(api, package="survey")
cormatrix <- cor(apipop %>%
select(meals, col.grad, full) %>% na.omit)
nobs <- 1e3
corvectors(cbind(runif(nobs, 0, 100),
rbnorm(nobs, 20.73, 14.14, 0, 100),
rbnorm(nobs, 87.52, 12.93, 0, 100)), cormatrix) %>%
as.data.frame() %>% as_tibble() %>%
rename(meals = V1, colgrad = V2, fullqual = V3)
```

### `db_lselect()`: Lazily Select Variables From Multiple Tables in a Relational Database

`db_lselect()` allows you to select variables from multiple tables in an SQL database. It returns a lazy query that combines all the variables together into one data frame (as a tibble). The user can choose to run `collect()` after this query if they see fit. [I wrote about this on my website in 2020](http://svmiller.com/blog/2020/11/smarter-ways-to-store-your-wide-data-with-sql-magic-purrr/) and how it applies to real-world problems. Here is a proof of concept of how this works.

```{r, cache=F}
library(DBI)
library(RSQLite)
set.seed(8675309)
A <- data.frame(uid = c(1:10),
a = rnorm(10),
b = sample(letters, 10),
c = rbinom(10, 1, .5))
B <- data.frame(uid = c(11:20),
a = rnorm(10),
b = sample(letters, 10),
c = rbinom(10, 1, .5))
C <- data.frame(uid = c(21:30), a = rnorm(10),
b = sample(letters, 10),
c = rbinom(10, 1, .5),
d = rnorm(10))
con <- dbConnect(SQLite(), ":memory:")
copy_to(con, A, "A",
temporary=FALSE)
copy_to(con, B, "B",
temporary=FALSE)
copy_to(con, C, "C",
temporary=FALSE)
# This returns no warning because columns "a" and "b" are in all tables
c("A", "B", "C") %>% db_lselect(con, c("uid", "a", "b"))
# This returns two warnings because column "d" is not in 2 of 3 tables.
# ^ this is by design. It'll inform the user about data availability.
c("A", "B", "C") %>% db_lselect(con, c("uid", "a", "b", "d"))
```

### `get_sims()`: Get Simulations from a Model Object (with New Data)

`get_sims()` is a function to simulate quantities of interest by way of a multivariate normal distribution for "new data" from a regression model. This coincides with an "informal Bayesian" approach to estimating quantities of interest that importantly also provide the user some idea of upper and lower bounds around an estimated quantity of interest.

It's flexible to linear models, generalized linear models, and their mixed model equivalents. Of note: the simulations from the mixed models omit (alternatively: "do not consider") the random intercepts. In my travels, this is because reviewers do not care about these quantities and just want to see quantities from the fixed effects in the model. If you'd like a more comprehensive simulation approach for those parameters in your mixed model, I recommend `{merTools}` for mixed models estimated in `{lme4}`.

Here is what this would look like for a linear model.

```{r, cache=F, warning=F}
library(stevedata)
M1 <- lm(immigsent ~ agea + female + eduyrs + uempla + hinctnta + lrscale, data=ESS9GB)
broom::tidy(M1)
library(modelr)
# Note: the DV must be in the "new data".
# It doesn't matter what value it is.
# It just needs to be there as a column.
ESS9GB %>%
data_grid(.model=M1, immigsent = 0,
lrscale = c(min(lrscale, na.rm=T),
max(lrscale, na.rm=T))) -> newdat
Sims <- get_sims(M1, newdat, 1000, 8675309)
Sims
```


### `get_var_info()`: Get Labelled Data from Your Variables

`get_var_info()` allows for what I like to term "peeking" at your labelled data. If you do not have a codebook handy, but you know the data are labelled, `get_var_info()` (and its shortcut: `gvi()`) will extract the pertinent information for you. `{stevemisc}` comes with a toy data set---`ess9_labelled`---in which there are two labelled variables for the country and internet consumption from the ninth round of the European Social Survey. You can extract that information with this package.

Do note that it assumes a pipe-based workflow. It's there for when you're having to sit down in an R session and recode data without the assistance of a dual-monitor setup or physical codebook.

```{r, cache=F}
ess9_labelled
# alternatively, below:
# ess9_labelled %>% gvi(netusoft)
# we'll do it this way, though...
ess9_labelled %>% get_var_info(netusoft)
```

### `jenny()`: Set the Only Reproducible Seed that Matters, and Get a Nice Message for It

There are infinite reproducible seeds. There is only one correct one. `jenny()` will set a reproducible seed of 8675309 for you and reward you with a nice message. It will get catty with you if try to use `jenny()` to set any other reproducible seed.


```{r, cache=F}
jenny() # good, seed set for 8675309
jenny(12345) # bad, and no seed set. Use set.seed() instead, you goon.
```


### `p_z()`: Convert the *p*-value you want to the *z*-value it is


I *loathe* how statistical instruction privileges obtaining a magical *p*-value by reference to an area underneath the standard normal curve, only to botch what the actual *z*-value is corresponding to the magical *p*-value. This simple function converts the *p*-value you want (typically .05, thanks to R.A. Fisher) to the *z*-value it actually is for the kind of claims we typically make in inferential statistics. If we're going to do inference the wrong way, let's at least get the *z*-value right.

```{r, cache=F}
p_z(.05)
p_z(c(.001, .01, .05, .1))
```

### `print_refs()`: Print and Format \code{.bib} Entries as References

`print_refs()` takes a `.bib` entry (or entries) and formats it as a reference (or set of references). This function is useful if you want to populate a syllabus with a reading list and have more agency over how it's formatted.

For example, here's a list of things you should read and cite, along with an illustration of the defaults by which the function works (American Political Science Association style, to LaTeX). `stevepubs`, in this package, contains an incomplete list of my publications.

Remember: *extremely Smokey Bear voice* "only YOU can jack my *h*-index to infinity."

```{r, cache=F, message = F, warning = F}
# Note, this function does spam with some messages/warnings.
# You can disable that in a chunk, as I do here.
library(bib2df)
print_refs(capture.output(df2bib(stevepubs)))
```

### `r1sd()` and `r2sd()`: Rescaling Data by One (or Two) Standard Deviations

`r1sd()` and `r2sd()` allow the user to rescale data by one or two standard deviations. What functions does what should be intuitive from the function name. Generally, regression modelers should center their regression inputs so that everything has a meaningful center (and that the *y*-intercept should be meaningful). The regression coefficients that emerge communicate something more interesting as well: magnitude effects. [Gelman (2008)](http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf) argues rescaling by two standard deviations has the added advantage of making binary inputs roughly comparable to anything that you standardized.

```{r}
x <- rnorm(50)
r1sd(x)
r2sd(x)
```

## `r2sd_at()`: Rescale Multiple Columns by Two Standard Deviations (and Rename)

`r2sd_at()` is a wrapper for `mutate_at()` and `rename_at()` in `{dplyr}`. It both rescales the supplied vectors to new vectors and renames the vectors to all have a prefix of `z_`. This is my preferred convention for these things.

```{r}
mtcars %>% tbl_df() %>%
select(mpg, disp, hp) %>%
r2sd_at(c("mpg", "hp", "disp"))
```

### `ps_btscs()` and `sbtscs()`: Create "Peace Years" or "Spells" by Cross-Sectional Unit

`sbtscs()` allows you to create spells ("peace years" in the international conflict context) between observations of some event. This will allow the researcher to better model temporal dependence in binary time-series cross-section ("BTSCS") models. Much of it is liberally copy-pasted from Dave Armstrong's `{DAMisc}` package. I just added some `{dplyr}` stuff underneath to speed it up and prevent it from choking when there are a lot of cross-sectional units without an "event" for a "spell."

I explain this in [this blog post from 2017](http://svmiller.com/blog/2017/06/quickly-create-peace-years-for-btscs-models-with-stevemisc/). It's incidentally the first thing I added to `{stevemisc}`. I offer, with it, the `usa_mids` data frame that has all militarized interstate disputes for the United States in non-directed dyad-year form from the Gibler-Miller-Little ("GML") data. `ps_btscs()` is a more general version of `sbtscs()` that performs well when NAs bracket the event data. The latter function features prominently in [`{peacesciencer}`](https://github.com/svmiller/peacesciencer).

```{r, cache=F}
# ?usa_mids
ps_btscs(usa_mids, midongoing, year, dyad)
sbtscs(usa_mids, midongoing, year, dyad)
```

### `revcode()`: Reverse Code a Numeric Variable (i.e. Invert the Scale)

`revcode()` allows you to reverse code a numeric variable. This is useful, say, if you have a Likert item that ranges from 1 ("strongly disagree") to 5 ("strongly agree"), but wants the 5s to be "strongly disagree" and the 1s to be "strongly agree." This function passes over NAs you may have in your variable. It assumes that the observed values include both the minimum and the maximum and that the increments between them are 1. This is usually the case in a discrete ordered-categorical variable (like a Likert item). Use this function with that in mind.

```{r}
tibble(x = c(1:10),
y = c(1:5, 1:5)) %>%
mutate(xrev = revcode(x),
yrev = revcode(y))
```

### `show_ranef()`: Get a Caterpillar Plot of the Random Effects from a Mixed Model

`show_ranef()` allows a user estimating a mixed model to quickly plot the random intercepts (with conditional variances) of a given random effect in a mixed model. In cases where there is a random slope over the intercept, the function plots the random slope as another caterpillar plot (as another facet). These are great for a quick visualization of the random intercepts.

```{r, cache=F}
library(lme4)
M1 <- lmer(Reaction ~ Days + (Days | Subject), data=sleepstudy)
show_ranef(M1, "Subject")
show_ranef(M1, "Subject", reorder=FALSE)
```

### `smvrnorm()`: Simulate from a Multivariate Normal Distribution

This is a simple port and rename of `mvrnorm()` from the `{MASS}` package. I do this because the `{MASS}` package conflicts with a lot of things in my workflow. This will be very handy doing so-called "informal Bayesian" approaches to generating quantities of interest from a regression model.

```{r}
M1 <- lm(immigsent ~ agea + female + eduyrs + uempla + hinctnta + lrscale, data=ESS9GB)
broom::tidy(M1)
as_tibble(smvrnorm(1000, coef(M1), vcov(M1)))
```

### `theme_steve()`, `theme_steve_web()`, `theme_steve_ms()`: Steve's Preferred `{ggplot2}` Themes

`theme_steve()` was a preferred theme of mine a few years ago. It is basically `theme_bw()` from `{ggplot2}` theme, but with me tweaking a few things. I've since moved to `theme_steve_web()` for most things now, prominently on my website. It incorporates the "Open Sans" and "Titillium Web" fonts that I like so much. `post_bg()` is for changing the backgrounds on plots to better match my website for posts that I write. `theme_steve_ms()` is a new addition that uses [the "Crimson Text" font](https://fonts.google.com/specimen/Crimson+Text) to match my plots to my LaTeX manuscripts. For those unaware, "Crimson Text" is basically what [`cochineal`](https://ctan.org/pkg/cochineal?lang=en) is.

```{r}
mtcars %>%
ggplot(.,aes(mpg, hp)) +
geom_point() +
theme_steve() +
labs(title = "A Plot with Steve's Preferred {ggplot2} Theme",
subtitle = "It's basically `theme_bw()` with some minor tweaks.")
mtcars %>%
ggplot(.,aes(mpg, hp)) +
geom_point() +
theme_steve_web() +
labs(title = "A Plot with Steve's Preferred {ggplot2} Theme",
subtitle = "I use `theme_steve_web()` for most things. It has nicer fonts.")
mtcars %>%
ggplot(.,aes(mpg, hp)) +
geom_point() +
theme_steve_ms() +
labs(title = "A Plot with Steve's Preferred {ggplot2} Theme",
subtitle = "I use `theme_steve_ms()` will not look pretty in this application, but will in LaTeX.")
mtcars %>%
ggplot(.,aes(mpg, hp)) +
geom_point() +
theme_steve_font(font = "Comic Sans MS") +
labs(title = "A Plot with Steve's Preferred {ggplot2} Theme",
subtitle = "I use `theme_steve_font()` for the occasional document that uses Palatino type fonts. Here: it's Comic Sans.")
```

### The Student-t Distribution (Location-Scale Version)

Finally, I added a few functions for extending the "standard" t-distribution in R into the three-parameter "location-scale" version. This generalizes the Student-t and is useful for getting acclimated with more general Student-t distributions, which are quite common in Bayesian analyses. `dst()` (density), `pst()` (distribution function), `qst()` (quantile), and `rst()` (random number generation) are available. Here, for example, is using `rst()` to simulate data from one of the most common Student-t distributions in the world of Bayesian priors: the one with three degrees of freedom, a mean of zero, and a standard deviation of ten.

```{r, cache=F}
dat <- tibble(x = rst(10000, 3, 0, 10))
dat %>%
ggplot(.,aes(x)) +
geom_density() +
theme_steve_web() +
labs(title = "Simulated Data from a Student-t (3,0,10) Distribution",
subtitle = "This prior is common in the world of Bayesian priors and used to be a common default prior in {brms}.")
```
Loading

0 comments on commit 7b89708

Please sign in to comment.