Support configurable summary() #624

vorpalvorpal · 2020-11-02T04:03:30Z

Because I am fairly incompetent, I seem to keep introducing duplicate rows into my data frames. I was wonder if, in the initial data summary bit of the output of skim(), "duplicate rows" might be a useful additional metric.

elinw · 2020-11-04T13:29:53Z

Skimr is pretty column oriented and you're asking something row oriented. That said I think that sum(duplicated(x)) would give that number. Of course in many data sets it is expected that there will be repeats.

michaelquinn32 · 2020-11-05T19:36:28Z

I think we can go a bit further. The most useful place for this would be in the summary, i.e.

skimr/R/summary.R

Lines 12 to 14 in 22dfec2

    
           summary.skim_df <- function(object, ...) { 
        
             if (is.null(object)) { 
        
               stop("dataframe is null.")

I think the implementation depends on how far we should push this.

Should the summary function be customizable with an sfl?
How would that impact printing the summary?
Are we confident that only minimal number of functions would ever be needed there?

elinw · 2020-11-05T19:40:06Z

I was thinking the same thing, i.e. should we make it customizable because this might be the first of many requests to add things. I do think that for our user scenario of "someone gives you a data set and you're trying to understand it" it might be very useful. If there are a lot of duplicates it might be smart to store it in a way that reflects that.

elinw · 2022-01-01T03:10:16Z

@michaelquinn32 if we are fixing issues on summary we could think about this one.

michaelquinn32 · 2022-01-01T04:37:07Z

This is a little more than the current updates to the summary(), since we'll need to modify the skim object to store this information. I can get to it soon.

elinw · 2022-01-04T23:53:49Z

What I was thinking is that eventually when we have a more flexible summary that would really allow a user to do this.

michaelquinn32 · 2022-01-08T23:02:16Z

Could put this on the roadmap too.

Right now, the issue is that we generate all of the summary components as skimr attributes, which we then extract in the summary function.

For a 3.0, we could extend skim_with() to provide a custom summary function. We could store the result of this as a single attribute in the skim_df, and we might consider a custom print handling function (like in #667) or maybe we can simplify the output.

gt() handles grouping variables.
http://www.danieldsjoberg.com/gt-and-gtsummary-presentation/#11

So we could require a summary function to produce

[stat group type] [stat name] [value]

Which should give a value that is pretty similar to we currently generate.

You could even think of a summary interface that is similar to skimr, basically using sfl`s.

my_skim <- skim_with(
  .summary = skimr_summary_fun(
    metatadata = sfl(
      name = get_data_name,
      group_variables = dplyr::groups
    ),
    counts = sfl(
      number_of_rows = nrow,
      number_of_columns = length,
      number_of_duplicate_rows = ~ sum(duplicated(.))
    ),
    .include_column_types = TRUE
  )
)

The last part is set as a function argument, since counting column types is something we currently do on the skim_df result. The other option there would be to support name = function() values, where function returns something that can be coerced into stat name - value pairs, and the name value becomes the name for the group. That's a lot more flexible, and probably could support summary functions that tell you which columns are most similar or something like that.

What do you think?

elinw · 2023-01-02T11:59:33Z

I just reread this and yes I really think that an sfl for summary would be the way to go.

elinw added the enhancement label Nov 16, 2020

michaelquinn32 changed the title ~~[feature request] Duplicate rows~~ Support configurable summary() Jan 18, 2022

michaelquinn32 mentioned this issue Jan 18, 2022

partition() should have an argument to include the summary as a frame #685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support configurable summary() #624

Support configurable summary() #624

vorpalvorpal commented Nov 2, 2020

elinw commented Nov 4, 2020

michaelquinn32 commented Nov 5, 2020

elinw commented Nov 5, 2020

elinw commented Jan 1, 2022

michaelquinn32 commented Jan 1, 2022 •

edited

Loading

elinw commented Jan 4, 2022

michaelquinn32 commented Jan 8, 2022

elinw commented Jan 2, 2023

Support configurable summary() #624

Support configurable summary() #624

Comments

vorpalvorpal commented Nov 2, 2020

elinw commented Nov 4, 2020

michaelquinn32 commented Nov 5, 2020

elinw commented Nov 5, 2020

elinw commented Jan 1, 2022

michaelquinn32 commented Jan 1, 2022 • edited Loading

elinw commented Jan 4, 2022

michaelquinn32 commented Jan 8, 2022

elinw commented Jan 2, 2023

michaelquinn32 commented Jan 1, 2022 •

edited

Loading