Skip to content

Commit

Permalink
document y_mis functionality
Browse files Browse the repository at this point in the history
  • Loading branch information
ConnorDonegan committed Apr 4, 2024
1 parent 59e0b33 commit 4433fff
Show file tree
Hide file tree
Showing 39 changed files with 1,640 additions and 1,445 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: geostan
Title: Bayesian Spatial Analysis
Version: 0.6.0
Date: 2024-03-01
Date: 2024-04-04
URL: https://connordonegan.github.io/geostan/
BugReports: https://github.com/ConnorDonegan/geostan/issues
Authors@R: c(
Expand Down
9 changes: 7 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# geostan 0.6.0

## New Additions
Updates:

1. Missing outcome data is now allowed in most models
2. A bug in `prep_icar_data` has been fixed

The model fitting functions (`stan_glm`, `stan_car`, etc.) now allow for missing data in the outcome variable. This is explained in the `geostan::stan_glm` documentation, next to the discussion of handling censored observations. When missing observations are present, there will (only) be a warning issued. This functionality is available for any GLM (`stan_glm`), any ESF model (`stan_esf`), and any model for count data (Poisson and binomial models including CAR and SAR models). The only models for which this functionality is not currently available are CAR and SAR models that are being been fit to continuous outcome variables.

The model fitting functions (`stan_glm`, `stan_car`, etc.) now allow for missing data in the outcome variable and a new vignette provides the details. This functionality is not available for auto-Gaussian models - that is, CAR and SAR models that have been fit to continuous outcome variables - but is available for all other available models (including eigenvector spatial filtering `stan_esf` models for continuous outcomes, and all models for count outcomes [binomial and Poisson models]).
The `prep_icar_data` function, which is used inside `stan_icar`, did not have the expected behavior in all cases - this has been fixed thanks to this [pull request](https://github.com/ConnorDonegan/geostan/pull/18).

# geostan 0.5.4

Expand Down
8 changes: 6 additions & 2 deletions R/prep-censored-data.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@



#' @description Return index of observed, censored y; elsewhere, use results to replace NAs with zeros
#' This will stop if there are missing values and the censor_point argument is not being used; outside of this call, must check that censor_point argument is only used with Poisson likelihood.
#' @description Return index of observed and missing y (elsewhere, these results hould be used to replace NAs with zeros or an indicator integer: Stan cannot accept NA values)
#' Outside of this call, must check that censor_point argument is only used with Poisson likelihood.
#'
#' @param censor the censor_point argument
#' @param frame from model.frame(formula, tmpdf, na.action = NULL)
Expand All @@ -15,6 +15,10 @@ handle_censored_y <- function(censor, frame) {
}
y_mis_idx <- which(is.na(y_raw))
y_obs_idx <- which(!is.na(y_raw))
n_mis <- length(y_mis_idx)
if ( n_mis > 0 ) {
message( paste0(n_mis, " NA values identified in the outcome variable\nFound in rows: ", paste0(y_mis_idx, collapse = ', ' )) )
}
return (list(n_mis = length(y_mis_idx),
n_obs = length(y_obs_idx),
y_mis_idx = y_mis_idx,
Expand Down
2 changes: 1 addition & 1 deletion R/stan_car.R
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
#'
#' ## Additional functionality
#'
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#'
#' @return An object of class class \code{geostan_fit} (a list) containing:
#' \describe{
Expand Down
2 changes: 1 addition & 1 deletion R/stan_esf.R
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@
#'
#' ## Additional functionality
#'
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#'
#' @return An object of class class \code{geostan_fit} (a list) containing:
#' \describe{
Expand Down
7 changes: 7 additions & 0 deletions R/stan_glm.R
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,13 @@
#' \deqn{x \sim Gauss(z, s^2)}
#' \deqn{logit(z) \sim student(\nu_z, \mu_z, \sigma_z)}
#'
#'
#' ### Missing data
#'
#' For most geostan models, missing (NA) observations are allowed in the outcome variable. However, there cannot be any missing covariate data. Models that can handle missing data are: any Poisson or binomial model (GLM, SAR, CAR, ESF, ICAR), all GLMs and ESF models. The only models that cannot handle missing outcome data are the CAR and SAR models when the outcome is a continuous variable (auto-normal/Gaussian models).
#'
#' When observations are missing, they will simply be ignored when calculating the likelihood in the MCMC sampling process (reflecting the absence of information). The estimated model parameters (including any covariates and spatial trend) will then be used to produce estimates or fitted values for the missing observations. The `fitted` and `posterior_predict` functions will work as normal in this case, and return values for all rows in your data.
#'
#' ### Censored counts
#'
#' Vital statistics systems and disease surveillance programs typically suppress case counts when they are smaller than a specific threshold value. In such cases, the observation of a censored count is not the same as a missing value; instead, you are informed that the value is an integer somewhere between zero and the threshold value. For Poisson models (`family = poisson())`), you can use the `censor_point` argument to encode this information into your model.
Expand Down
2 changes: 1 addition & 1 deletion R/stan_icar.R
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@
#' ```
#' ## Additional functionality
#'
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#' The CAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#'
#' @return An object of class class \code{geostan_fit} (a list) containing:
#' \describe{
Expand Down
2 changes: 1 addition & 1 deletion R/stan_sar.R
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@
#'
#' ## Additional functionality
#'
#' The SAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#' The SAR models can also incorporate spatially-lagged covariates, measurement/sampling error in covariates (particularly when using small area survey estimates as covariates), missing outcome data, and censored outcomes (such as arise when a disease surveillance system suppresses data for privacy reasons). For details on these options, please see the Details section in the documentation for \link[geostan]{stan_glm}.
#'
#'
#' @return An object of class class \code{geostan_fit} (a list) containing:
Expand Down
87 changes: 51 additions & 36 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -15,42 +15,20 @@ knitr::opts_chunk$set(
fig.align = 'center'
)
```

<img src="man/figures/logo.png" align="right" width="160" />

## geostan: Bayesian spatial analysis

The [**geostan**](https://connordonegan.github.io/geostan/) R package supports a complete spatial analysis workflow with Bayesian models for areal data, including a suite of functions for visualizing spatial data and model results. For demonstrations and discussion, see the package [help pages](https://connordonegan.github.io/geostan/reference/index.html) and [vignettes](https://connordonegan.github.io/geostan/articles/index.html) on spatial autocorrelation, spatial measurement error models, spatial regression with raster layers, and building custom spatial model in Stan.

The package is particularly suitable for public health research with spatial data, and complements the [**surveil**](https://connordonegan.github.io/surveil/) R package for time series analysis of public health surveillance data.

**geostan** models were built using [**Stan**](https://mc-stan.org), a state-of-the-art platform for Bayesian modeling.

[![DOI](https://joss.theoj.org/papers/10.21105/joss.04716/status.svg)](https://doi.org/10.21105/joss.04716)

### Disease mapping and spatial regression

Statistical models for data recorded across areal units like states, counties, or census tracts.

### Observational uncertainty
The [**geostan**](https://connordonegan.github.io/geostan/) R package supports a complete spatial analysis workflow with Bayesian models for areal data, including a suite of functions for visualizing spatial data and model results. **geostan** models were built using [**Stan**](https://mc-stan.org), a state-of-the-art platform for Bayesian modeling. The package is designed partly for public health research with spatial data, for which it complements the [**surveil**](https://connordonegan.github.io/surveil/) R package for time series analysis of public health surveillance data.

Incorporate information on data reliability, such as standard errors of American Community Survey estimates, into any **geostan** model.
Features include:

### Censored observations

Vital statistics and disease surveillance systems like CDC Wonder censor case counts that fall below a threshold number; **geostan** can model disease or mortality risk with censored observations.

### Spatial analysis tools

Tools for visualizing and measuring spatial autocorrelation and map patterns, for exploratory analysis and model diagnostics.

### The RStan ecosystem

Interfaces easily with many high-quality R packages for Bayesian modeling.

### Custom spatial models

Tools for building custom spatial models in [Stan](https://mc-stan.org/).
* **Disease mapping and spatial regression** Statistical models for data recorded across areal units like states, counties, or census tracts.
* **Spatial analysis tools** Tools for visualizing and measuring spatial autocorrelation and map patterns, for exploratory analysis and model diagnostics.
* **Observational uncertainty** Incorporate information on data reliability, such as standard errors of American Community Survey estimates, into any **geostan** model.
* **Missing and Censored observations** Vital statistics and disease surveillance systems like CDC Wonder censor case counts that fall below a threshold number; **geostan** can model disease or mortality risk for small areas with censored observations or with missing observations.
* **The RStan ecosystem** Interfaces easily with many high-quality R packages for Bayesian modeling.
* **Custom spatial models** Tools for building custom spatial models in [Stan](https://mc-stan.org/).

## Installation

Expand All @@ -62,26 +40,29 @@ install.packages("geostan")

## Support

All functions and methods are documented (with examples) on the website [reference](https://connordonegan.github.io/geostan/reference/index.html) page. See the package [vignettes](https://connordonegan.github.io/geostan/articles/index.html) for more on exploratory spatial data analysis, spatial measurement error models, and spatial regression with large raster layers.
All functions and methods are documented (with examples) on the website [reference](https://connordonegan.github.io/geostan/reference/index.html) page. See the package [vignettes](https://connordonegan.github.io/geostan/articles/index.html) for more on exploratory spatial analysis, spatial measurement error models, spatial regression with raster layers, and building custom spatial model in Stan.

To ask questions, report a bug, or discuss ideas for improvements or new features please visit the [issues](https://github.com/ConnorDonegan/geostan/issues) page, start a [discussion](https://github.com/ConnorDonegan/geostan/discussions), or submit a [pull request](https://github.com/ConnorDonegan/geostan/pulls).

## Usage

Load the package and the `georgia` county mortality data set (ages 55-64, years 2014-2018):
Load the package and the `georgia` county mortality data set:
```{r}
library(geostan)
data(georgia)
```

The `sp_diag` function provides visual summaries of spatial data, including a histogram, Moran scatter plot, and map:
This has county population and mortality data by sex for ages 55-64, and for the period 2014-2018. As is common for public access data, some of the observations missing because the CDC has censored them.

The `sp_diag` function provides visual summaries of spatial data, including a histogram, Moran scatter plot, and map. Here is a visual summary of crude female mortality rates (as deaths per 10,000):

```{r fig.width = 8}
A <- shape2mat(georgia, style = "B")
sp_diag(georgia$rate.female, georgia, w = A)
mortality_rate <- georgia$rate.female * 10e3
sp_diag(mortality_rate, georgia, w = A)
```

There are three censored observations in the `georgia` female mortality data, which means there were 9 or fewer deaths in those counties. The following code fits a spatial conditional autoregressive (CAR) model to female county mortality data. By using the `censor_point` argument we include our information on the censored observations to obtain results for all counties:
The following code fits a spatial conditional autoregressive (CAR) model to female county mortality data. These models are used for estimating disease risk in small areas like counties, and for analyzing covariation of health outcomes with other area qualities. The R syntax for fitting the models is similar to using `lm` or `glm`. We provide the population at risk (the denominator for mortality rates) as an offset term, using the log-transform. In this case, three of the observations are missing because they have been censored; per CDC criteria, this means that there were 9 or fewer deaths in those counties. By using the `censor_point` argument and setting it to `censor_point = 9`, the model will account for the censoring process when providing estimates of the mortality rates:

```{r}
cars <- prep_car_data(A)
Expand All @@ -93,7 +74,9 @@ fit <- stan_car(deaths.female ~ offset(log(pop.at.risk.female)),
cores = 4, # for multi-core processing
refresh = 0) # to silence some printing
```

Passing a fitted model to the `sp_diag` function will return a set of diagnostics for spatial models:

```{r fig.width = 8}
sp_diag(fit, georgia, w = A)
```
Expand All @@ -103,5 +86,37 @@ The `print` method returns a summary of the probability distributions for model
```{r}
print(fit)
```
More demonstrations can be found in the package [help pages](https://connordonegan.github.io/geostan/reference/index.html) and [vignettes](https://connordonegan.github.io/geostan/articles/index.html).

Applying the `fitted` method to the fitted model will return the fitted values from the model - in this case, the fitted values are the estimates of the county mortality rates. Multiplying them by 10,000 gives mortality rate per 10,000 at risk:

```{r}
mortality_est <- fitted(fit) * 10e3
county_name <- georgia$NAME
head( cbind(county_name, mortality_est) )
```

The mortality estimates are stored in the column named "mean", and the limits of the 95\% credible interval are found in the columns "2.5%" and "97.5%".

Details and demonstrations can be found in the package [help pages](https://connordonegan.github.io/geostan/reference/index.html) and [vignettes](https://connordonegan.github.io/geostan/articles/index.html).

## Citing geostan

If you use geostan in published work, please include a citation.

Donegan, Connor (2022) "geostan: An R package for Bayesian spatial analysis" *The Journal of Open Source Software*. 7, no. 79: 4716. [https://doi.org/10.21105/joss.04716](https://doi.org/10.21105/joss.04716).

[![DOI](https://joss.theoj.org/papers/10.21105/joss.04716/status.svg)](https://doi.org/10.21105/joss.04716)

```
@Article{,
title = {{geostan}: An {R} package for {B}ayesian spatial analysis},
author = {Connor Donegan},
journal = {The Journal of Open Source Software},
year = {2022},
volume = {7},
number = {79},
pages = {4716},
doi = {10.21105/joss.04716},
}
```

Loading

0 comments on commit 4433fff

Please sign in to comment.