WIP: Group summary stat #6066

teunbrand · 2024-08-28T13:03:00Z

This is my stab to fix #3501.

It is still WIP because we might exchange thoughts on the name of the stat, and/or how it is used, expand documentation and tests. Initially I'd welcome any feedback on the direction of the PR.

To recap #3501, we generally want a more flexible summary stat that does not (necessarily) group by x. This is stat_summarise() in this PR (name is up for discussion).
The 'twist' I've given this PR is that the default summary function is a bit dandy, which we'll go over as I narrate the examples.

By default, the stat does nothing to the input data and renders points.

devtools::load_all("~/packages/ggplot2")
#> ℹ Loading ggplot2

ggplot(mpg, aes(displ, hwy)) +
  stat_summarise()

In #3501, it was suggested that we can pass any function to the stat that returns a data.frame.
This is indeed how this stat works.

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(aes(colour = drv)) +
  stat_summarise(
    geom = "segment",
    fun = function(data) {
      data.frame(
        x = min(data$x) - 0.5, 
        xend = max(data$x) + 0.5, 
        y = mean(data$y))
    }
  )

However, the 'trick' in this PR is that the default summary function is an NSE function that evaluates expressions passed to fun.args. Hence, instead of being bothered to construct a separate function for every task, we can dynamically pass NSE expressions this way.

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(aes(colour = drv)) +
  stat_summarise(
    geom = "segment",
    fun.args = vars(
      xend = max(x) + 0.5, 
      x = min(x) - 0.5, 
      y = mean(y)
    )
  )

Like some other tidyverse functions, the expressions are evaluated sequentially, meaning that later expressions can use results of prior expressions. This allows us to, for example, compute a convex hull in a reasonably straightforward manner.

ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_point() +
  stat_summarise(
    geom = "polygon",
    fun.args = vars(
      hull = chull(x, y),
      x = x[hull],
      y = y[hull]
    ),
    fill = NA
  )

The thing that works differently relative to similarly flavoured functions (e.g. summarise, reframe, mutate in {dplyr}) is that not every evaluated expression needs to result in column-compatible data right away. This allows us to declare temporary variables, as long as they're explicity removed.

  ggplot(mpg, aes(displ, colour = drv)) +
  stat_summarise(
    geom = "line",
    fun.args = vars(
      dens = density(x), # temporary variable
      x = dens$x, y = dens$y,
      dens = NULL # delete variable
    )
  )

^{Created on 2024-08-28 with reprex v2.1.1}

thomasp85 · 2024-09-13T07:46:12Z

I'm not sure I think the very minor decrease in code length warrant the completely new mode of specifying a function, TBH

teunbrand · 2024-09-13T08:25:16Z

The stat just takes a regular fun argument like any other summary stat. The stat itself doesn't really have any special tricks up its sleeve.
It is the default fun dat allows you to specify how variables should be manipulated using NSE.

One can achieve something similar using stat_summary() too:

library(ggplot2)
library(rlang)

fun <- function(data, ...) {
  if (!is.data.frame(data)) {
    data <- data.frame(y = data)
  }
  exprs <- list2(...)
  if (length(exprs) < 1) {
    return(data)
  }
  data <- unclass(data)
  nms <- names(exprs)
  for (i in seq_along(exprs)) {
    data[[nms[[i]]]] <- eval_tidy(exprs[[i]], data)
  }
  vctrs::data_frame(!!!data[unique(names(exprs))])
}

ggplot(mpg, aes(drv, displ)) +
  stat_summary(
    fun.data = fun,
    fun.args = vars(
      ymin = min(y),
      ymax = max(y),
      y    = mean(y)
    ),
    geom = "pointrange"
  )

^{Created on 2024-09-13 with reprex v2.1.1}

thomasp85 · 2024-09-13T08:30:53Z

I get that, but this is not something we push people towards in stat_summary() so I don't think the comparison holds.

Thinking of it, this is basically giving the reigns to compute_group() to the user through the fun argument. I think it should be called stat_manual() and the default fun value should be identity. There is no reason to confine this to summarising functions in any way through the naming

teunbrand · 2024-09-13T08:41:16Z

I think it should be called stat_manual() and the default fun value should be identity.

I like this better than the current form

teunbrand · 2024-09-13T11:00:46Z

Closing in favour of #6103

draft summarise stat

3cf4243

teunbrand mentioned this pull request Sep 13, 2024

Manual stat #6103

Open

teunbrand closed this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Group summary stat #6066

WIP: Group summary stat #6066

teunbrand commented Aug 28, 2024

thomasp85 commented Sep 13, 2024

teunbrand commented Sep 13, 2024

thomasp85 commented Sep 13, 2024

teunbrand commented Sep 13, 2024

teunbrand commented Sep 13, 2024

WIP: Group summary stat #6066

WIP: Group summary stat #6066

Conversation

teunbrand commented Aug 28, 2024

thomasp85 commented Sep 13, 2024

teunbrand commented Sep 13, 2024

thomasp85 commented Sep 13, 2024

teunbrand commented Sep 13, 2024

teunbrand commented Sep 13, 2024