Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbers that should be zero are displayed in scientific notation #479

Open
jmgirard opened this issue Jul 21, 2019 · 19 comments
Open

Numbers that should be zero are displayed in scientific notation #479

jmgirard opened this issue Jul 21, 2019 · 19 comments

Comments

@jmgirard
Copy link

This may be an issue with R as opposed to skimr, but it always bothers me. I'd appreciate help if you have any. Basically, numbers that should be essentially zero are displayed in scientific notation. Is there some way to avoid this other than the round() function? Here is an example:

library(tibble)
library(dplyr)
library(skimr)
t <- tibble::tibble(
  x = c(7.250, 71.283, 7.925, 53.100, 8.050, 8.458, 51.862, 21.075, 11.133, 30.071)
)
t <- dplyr::mutate(t, 
  x_c = x - mean(x), 
  x_z = x_c / sd(x)
)
skim(t)
#> Skim summary statistics
#>  n obs: 10 
#>  n variables: 3 
#> 
#> -- Variable type:numeric -------------------------------------------------------
#>  variable missing complete  n     mean   sd     p0    p25    p50   p75  p100     hist
#>         x       0       10 10 27.02    23.6   7.25   8.15  16.1  46.41 71.28 ▇▂▂▁▁▃▁▂
#>       x_c       0       10 10 -8.9e-16 23.6 -19.77 -18.87 -10.92 19.39 44.26 ▇▂▂▁▁▃▁▂
#>       x_z       0       10 10 -3.3e-17  1    -0.84  -0.8   -0.46  0.82  1.88 ▇▂▂▁▁▃▁▂
@elinw
Copy link
Collaborator

elinw commented Jul 22, 2019

I can confirm this is also true in v2. This is odd to me given that we spent so much effort on thinking about how many digits to display. Does only displaying 4 for x make sense given that the values start with 3 digits to the right of the decimal? @michaelquinn32
It works as I would expect when knitting or when using kable() in the console.

@michaelquinn32
Copy link
Collaborator

Scientific notation appearing like that is a product of using floating point numbers. It's bigger than skimr or even R.
http://www.lahey.com/float.htm

I'll keep thinking about it, but I'm not sure of a way to get it right:

  • We don't want to round the underlying data, because that's adding computation and moving us further away from "truth," what the user really wanted to comput
  • We could round when printing, but what if small values like .00001 are meaningfully different from .000001? Rounding is going to mash that.

Thanks for bringing this up!

@elinw
Copy link
Collaborator

elinw commented Jul 23, 2019

I think we just need to show 0 in these cases, which is what happens when you knit. We don't want to change the underlying data that is passed out of skimr but I think we want to display the data in expected ways. Among other things it is strange to see just one number with scientific notation in the column.

@elinw
Copy link
Collaborator

elinw commented Aug 6, 2019

I guess the issue is that some numbers probably should appear in scientific notation and some shouldn't, then the question is how to control that. I'm wondering if this is a situation where the user should have to take some action, which could range from creating a custom statistic to having some kind of setting that could be defined. But I worry about opening the door to too many settings. Though I do agree that we don't want to force the user to make the underlying values rounded.

@elinw
Copy link
Collaborator

elinw commented Nov 9, 2019

I'm putting help wanted on this since we'd be willing to look at a contributed solution.

@michaelquinn32
Copy link
Collaborator

This is something that comes up in the pillar release notes:
https://www.tidyverse.org/blog/2018/03/pillar-1-2-1/

Small numbers can be meaningfully different. I believe that it's a mistake to round.

tiny <- c(1e-310, 1e-320, 1e-330)
tiny

#> [1] 1.000000e-310 9.999889e-321  0.000000e+00

tibble(tiny)

#> # A tibble: 3 x 1
#>         tiny
#>        <dbl>
#> 1  1.00e-310
#> 2 10.00e-321
#> 3  0.   

The more I think about it, the more I believe that we should be pushing as much printing-related logic as possible into pillar. This seems to be the general TV approach. And it also means that a PR in this case might not be appropriate.

@jmgirard
Copy link
Author

Small numbers can be meaningfully different. I believe that it's a mistake to round.

I think the key phrase there is "can be". So maybe add an optional argument specifying the threshold at which to round?

@michaelquinn32
Copy link
Collaborator

I think the option then is to open a pillar PR where you can control scientific notation. In base R, that would be something like options(scipen=999).
https://stat.ethz.ch/R-manual/R-devel/library/base/html/options.html

There is a separate discussion on this in the Rstudio forum:
https://community.rstudio.com/t/how-to-disable-scientific-notation-in-tibble/64006

@elinw
Copy link
Collaborator

elinw commented Oct 18, 2020

Part of the issue is if you only want to do this for print or if you really want to round the data in the skimr object. If you are not going to do further calculations with the object and therefore don't care about the accuracy you could write a custom skim that rounds.

@jmgirard
Copy link
Author

Part of the issue is if you only want to do this for print or if you really want to round the data in the skimr object. If you are not going to do further calculations with the object and therefore don't care about the accuracy you could write a custom skim that rounds.

This is a good point. For my use case, at least, I'd like the underlying data to be unchanged and just the printing to be rounded.

@ben-schwen
Copy link

@jmgirard thats exactly what #620 was supposed to do...

@jmgirard
Copy link
Author

@ben-schwen Were you waiting on a response from me on this?

@elinw
Copy link
Collaborator

elinw commented Jan 26, 2021

So I would not be in favor of adding arguments. That defeats the purpose of a simple function. One possibility would be to do something like add a special function the way we have with skim_without_charts() but I'm not totally sure how that would work, I think it would have to be print_without_scientific_notation since that's what we are really talking about.
In version 1 we had a way to do custom formatting, but we gave that up to be able to solve a lot of other problems and have a really usable object. I think I'd be less opposed to introducing an argument to print.skim. Is there a function that detects this? I think that would be the first challenge.

@jmgirard
Copy link
Author

I think something like print(skim(df), digits = 3) would be nice.

@elinw
Copy link
Collaborator

elinw commented Jan 27, 2021 via email

@jxu
Copy link

jxu commented Nov 22, 2024

Has there been any resolution to this? I have a bunch of variables that are percentages and I'd rather see them as 63.9% or 0.639 instead of 6.39e-1. This seems to be caused by other variables having very large values like 8.38e7.
I'm aware I can write a custom statistic function, but this seems like a common enough task that it should be built-in.

@elinw
Copy link
Collaborator

elinw commented Nov 24, 2024 via email

@jxu
Copy link

jxu commented Nov 25, 2024

I guess the simplest is making the object respect print digits options like @jmgirard suggested.
I solved my issue by omitting the variable columns with very large values, but I think the print options for each stat (like the mean) should be based on the range of each variable, not on the range of the stats for all variables.

@elinw
Copy link
Collaborator

elinw commented Nov 26, 2024

Structurally, skimr cannot work by modifying the individual functions with options. You get what you get with our opinionated defaults, and if you want something different you add your own functions in a custom skim. That's the architecture that we provide, you can do whatever you want, save your work and you don't need to make changes to the core. The skimr vignette shows how to add your own functions in which you can include your own options. I recommend saving your commonly used variation skimmers and then you an pull up what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants