Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated factor levels #746

Open
jarbet opened this issue Oct 22, 2024 · 5 comments
Open

Truncated factor levels #746

jarbet opened this issue Oct 22, 2024 · 5 comments

Comments

@jarbet
Copy link

jarbet commented Oct 22, 2024

By default, it seems skim truncates long factor levels. Is there an option to print the entire factor level?

suppressPackageStartupMessages(library(skimr));
suppressPackageStartupMessages(library(rchallenge));

data('german', package = 'rchallenge');

# notice the factor levels are truncated in the skim output
table(german$status);
#> 
#>                        no checking account 
#>                                        274 
#>                                 ... < 0 DM 
#>                                        269 
#>                           0<= ... < 200 DM 
#>                                         63 
#> ... >= 200 DM / salary for at least 1 year 
#>                                        394
table(german$credit_history);
#> 
#>             delay in paying off in the past 
#>                                          40 
#>    critical account/other credits elsewhere 
#>                                          49 
#> no credits taken/all credits paid back duly 
#>                                         530 
#>    existing credits paid back duly till now 
#>                                          88 
#>     all credits at this bank paid back duly 
#>                                         293
table(german$purpose);
#> 
#>              others           car (new)          car (used) furniture/equipment 
#>                 234                 103                 181                 280 
#>    radio/television domestic appliances             repairs           education 
#>                  12                  22                  50                   0 
#>            vacation          retraining            business 
#>                   9                  97                  12
skim(german)
Name german
Number of rows 1000
Number of columns 21
_______________________
Column type frequency:
factor 18
numeric 3
________________________
Group variables None

Data summary

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
status 0 1 FALSE 4 …: 394, no : 274, …: 269, 0<=: 63
credit_history 0 1 FALSE 5 no : 530, all: 293, exi: 88, cri: 49
purpose 0 1 FALSE 10 fur: 280, oth: 234, car: 181, car: 103
savings 0 1 FALSE 5 unk: 603, …: 183, …: 103, 100: 63
employment_duration 0 1 FALSE 5 1 <: 339, >= : 253, 4 <: 174, < 1: 172
installment_rate 0 1 TRUE 4 < 2: 476, 25 : 231, 20 : 157, >= : 136
personal_status_sex 0 1 FALSE 4 mal: 548, fem: 310, fem: 92, mal: 50
other_debtors 0 1 FALSE 3 non: 907, gua: 52, co-: 41
present_residence 0 1 TRUE 4 >= : 413, 1 <: 308, 4 <: 149, < 1: 130
property 0 1 FALSE 4 bui: 332, unk: 282, car: 232, rea: 154
other_installment_plans 0 1 FALSE 3 non: 814, ban: 139, sto: 47
housing 0 1 FALSE 3 ren: 714, for: 179, own: 107
number_credits 0 1 TRUE 4 1: 633, 2-3: 333, 4-5: 28, >= : 6
job 0 1 FALSE 4 ski: 630, uns: 200, man: 148, une: 22
people_liable 0 1 FALSE 2 0 t: 845, 3 o: 155
telephone 0 1 FALSE 2 no: 596, yes: 404
foreign_worker 0 1 FALSE 2 no: 963, yes: 37
credit_risk 0 1 FALSE 2 goo: 700, bad: 300

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
duration 0 1 20.90 12.06 4 12.0 18.0 24.00 72 ▇▇▂▁▁
amount 0 1 3271.25 2822.75 250 1365.5 2319.5 3972.25 18424 ▇▂▁▁▁
age 0 1 35.54 11.35 19 27.0 33.0 42.00 75 ▇▆▃▁▁

Created on 2024-10-22 with reprex v2.1.1

@jxu
Copy link

jxu commented Nov 25, 2024

This is handled by skimmer top_counts. Maybe you can change max_char

skimr/R/stats.R

Lines 65 to 78 in d5126aa

#' @describeIn stats Compute and collapse a contingency table into a single
#' character scalar. Wraps [sorted_count()].
#' @param max_levels The maximum number of levels to be displayed.
#' @export
top_counts <- function(x, max_char = 3, max_levels = 4) {
counts <- sorted_count(x)
if (length(counts) > max_levels) {
top <- counts[seq_len(max_levels)]
} else {
top <- counts
}
top_names <- substr(names(top), 1, max_char)
paste0(top_names, ": ", top, collapse = ", ")
}

@jarbet
Copy link
Author

jarbet commented Nov 25, 2024

top_counts <- function(x, max_char = 3, max_levels = 4) {
counts <- sorted_count(x)
if (length(counts) > max_levels) {
top <- counts[seq_len(max_levels)]
} else {
top <- counts
}
top_names <- substr(names(top), 1, max_char)
paste0(top_names, ": ", top, collapse = ", ")
}

How can I pass max_chr = 20 to skim?

@jxu
Copy link

jxu commented Nov 25, 2024

I guess you can make a custom skim function? I haven't tried it

@jarbet
Copy link
Author

jarbet commented Nov 26, 2024

I guess you can make a custom skim function? I haven't tried it

Hmm I looked at the source code for skim and it does not call the top_counts function directly, so I can't just write a top_counts2 and skim2 function. Do you know what other function I'd need to edit to get this to work? If not, I'll probably just need to try a different package.

@elinw
Copy link
Collaborator

elinw commented Nov 26, 2024

Yes you could definitely writew top_counts2 or any other additional function that you would want to include in a custom skim. The way skimr is written gives you complete flexibility for this rather than adding a million options to meet individual needs. That said, we did quite a bit of experimenting with the topcount function when we changed the skimr API. Skimr is "compact" on purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants