Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Group summary stat #6066

Closed
wants to merge 1 commit into from
Closed

Conversation

teunbrand
Copy link
Collaborator

This is my stab to fix #3501.

It is still WIP because we might exchange thoughts on the name of the stat, and/or how it is used, expand documentation and tests. Initially I'd welcome any feedback on the direction of the PR.

To recap #3501, we generally want a more flexible summary stat that does not (necessarily) group by x. This is stat_summarise() in this PR (name is up for discussion).
The 'twist' I've given this PR is that the default summary function is a bit dandy, which we'll go over as I narrate the examples.

By default, the stat does nothing to the input data and renders points.

devtools::load_all("~/packages/ggplot2")
#> ℹ Loading ggplot2

ggplot(mpg, aes(displ, hwy)) +
  stat_summarise()

In #3501, it was suggested that we can pass any function to the stat that returns a data.frame.
This is indeed how this stat works.

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(aes(colour = drv)) +
  stat_summarise(
    geom = "segment",
    fun = function(data) {
      data.frame(
        x = min(data$x) - 0.5, 
        xend = max(data$x) + 0.5, 
        y = mean(data$y))
    }
  )

However, the 'trick' in this PR is that the default summary function is an NSE function that evaluates expressions passed to fun.args. Hence, instead of being bothered to construct a separate function for every task, we can dynamically pass NSE expressions this way.

ggplot(mpg, aes(drv, hwy)) +
  geom_jitter(aes(colour = drv)) +
  stat_summarise(
    geom = "segment",
    fun.args = vars(
      xend = max(x) + 0.5, 
      x = min(x) - 0.5, 
      y = mean(y)
    )
  )

Like some other tidyverse functions, the expressions are evaluated sequentially, meaning that later expressions can use results of prior expressions. This allows us to, for example, compute a convex hull in a reasonably straightforward manner.

ggplot(mpg, aes(displ, hwy, colour = drv)) +
  geom_point() +
  stat_summarise(
    geom = "polygon",
    fun.args = vars(
      hull = chull(x, y),
      x = x[hull],
      y = y[hull]
    ),
    fill = NA
  )

The thing that works differently relative to similarly flavoured functions (e.g. summarise, reframe, mutate in {dplyr}) is that not every evaluated expression needs to result in column-compatible data right away. This allows us to declare temporary variables, as long as they're explicity removed.

  ggplot(mpg, aes(displ, colour = drv)) +
  stat_summarise(
    geom = "line",
    fun.args = vars(
      dens = density(x), # temporary variable
      x = dens$x, y = dens$y,
      dens = NULL # delete variable
    )
  )

Created on 2024-08-28 with reprex v2.1.1

@thomasp85
Copy link
Member

I'm not sure I think the very minor decrease in code length warrant the completely new mode of specifying a function, TBH

@teunbrand
Copy link
Collaborator Author

The stat just takes a regular fun argument like any other summary stat. The stat itself doesn't really have any special tricks up its sleeve.
It is the default fun dat allows you to specify how variables should be manipulated using NSE.

One can achieve something similar using stat_summary() too:

library(ggplot2)
library(rlang)

fun <- function(data, ...) {
  if (!is.data.frame(data)) {
    data <- data.frame(y = data)
  }
  exprs <- list2(...)
  if (length(exprs) < 1) {
    return(data)
  }
  data <- unclass(data)
  nms <- names(exprs)
  for (i in seq_along(exprs)) {
    data[[nms[[i]]]] <- eval_tidy(exprs[[i]], data)
  }
  vctrs::data_frame(!!!data[unique(names(exprs))])
}

ggplot(mpg, aes(drv, displ)) +
  stat_summary(
    fun.data = fun,
    fun.args = vars(
      ymin = min(y),
      ymax = max(y),
      y    = mean(y)
    ),
    geom = "pointrange"
  )

Created on 2024-09-13 with reprex v2.1.1

@thomasp85
Copy link
Member

I get that, but this is not something we push people towards in stat_summary() so I don't think the comparison holds.

Thinking of it, this is basically giving the reigns to compute_group() to the user through the fun argument. I think it should be called stat_manual() and the default fun value should be identity. There is no reason to confine this to summarising functions in any way through the naming

@teunbrand
Copy link
Collaborator Author

I think it should be called stat_manual() and the default fun value should be identity.

I like this better than the current form

@teunbrand teunbrand mentioned this pull request Sep 13, 2024
@teunbrand
Copy link
Collaborator Author

Closing in favour of #6103

@teunbrand teunbrand closed this Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can ggplot2 have a Stat that simply summarises data by group?
2 participants