Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel precision #308

Open
JanMarvin opened this issue Aug 25, 2022 · 5 comments
Open

Excel precision #308

JanMarvin opened this issue Aug 25, 2022 · 5 comments

Comments

@JanMarvin
Copy link
Owner

See this SO

@JanMarvin JanMarvin mentioned this issue Sep 9, 2022
8 tasks
@JanMarvin
Copy link
Owner Author

JanMarvin commented Sep 12, 2022

Well, this quickly became more complex than I'd hoped 😦

> library(openxlsx2)
> # this is what I want to get
> chr <- "433256.529740090366"
> chr
[1] "433256.529740090366"
>  # from this number
> num <- 433256.529740090366
> num
[1] 433256.5

What I had assumed

We write this tiny data frame and the value is just this. We open the file in Excel. We save it. The value magically changes. Solution: Excel is to blame. Unfortunately this is not entirely correct.

What actually happens

> wb <- wb_workbook()$add_worksheet()$add_data(x = num)
> wb$worksheets[[1]]$sheet_data$cc$v
[1] "433256.52974009"

This is a bit strange, why does this happen? Because we push the number through as.character().

data[sel] <- lapply(data[sel], as.character)

This is completely unaffected by openxlsx_options() and here the truncation happens. So far, so bad. This brings up the question, can this be circumvented? And the answer to this is unfortunately not quite simple. We can use e.g. format(num, nsmall = 12) to get the exact string representation of the number we initially added. But, and this is the onus, it will affect every value we push through. E.g. if we have integer values, they will be written as e.g. 1.000000.... And in a world with floating point stuff, all the other nice floats will create wonderful values.

What to do?

Hence, the least thing we should do, is document this behaviour in the package. Unless whoever writes to Excel, converts their values to strings with a specific precision, there is no way it will survive.

Another possible way would be to provide a "nsmall" option and use lapply(..., format, nsmall) when writing. I've added this in an example file here: From_R.xlsx After saving it in Excel on Mac it becomes this From_Excel.xlsx: 433256.52974009037.

What happens on the Excel side of things?

Luckily, Excel doesn't do any better. If I simply copy & paste num to a cell in Excel, the value written is also slightly off. It returns:

"433256.52974009002"

So there is some truncation going on in Excel as well (as stated by me in the initial SO). And I'm unable to return the final 2:

> wb$worksheets[[1]]$sheet_data$cc
   r row_r c_r c_s c_t c_cm c_ph c_vm                   v               f f_t f_ref f_ca f_si is
1 A1     1   A                         433256.52974009002                                       
2 B1     1   B   1                    4.3325652974009E+16 A1*100000000000     

Now, when converting this value into a number, the following happens in R:

> options(scipen = 200)
> as.numeric("433256.52974009002") * 1e12
[1] 433256529740090048

Ergo we are unable to recreate Excels precision and we should document this as well.

@jmbarbone
Copy link
Collaborator

You know it's complicated when there's an entire Wikipedia article on it: Numeric precision in Microsoft Excel.

For data going into Excel, can we just supply how R interprets the values and allow Excel to perform its own truncation? If we're writing these as strings, maybe leveraging that format() approach is better for maintaining as much information as possible.

For data coming out of Excel, maybe we can take from utils::read.table() which leverages utils::type.convert(numerals = "no.loss") to read in numbers as strings. If users are adjusting/using numeric data from Excel they should take caution in precision differences between software. However, an option to not touch the numbers may be good if what they're doing in {openxlsx2} doesn't require it.

@JanMarvin
Copy link
Owner Author

Well reading is not affected by this. If all you want to do is adding another sheet, there is no conversion to numeric. The conversion only occurs, if we convert the cell character to numeric. So truncation would only occur, if a cell is replaced with a numeric. I tend to not change reading from xlsx and converting to R data frame. The part where we write numbers can use format, that should be simple to integrate. I can open a pull request soon™

@JanMarvin
Copy link
Owner Author

format(..., nsmall = ..., drop0trailing = TRUE) might be what we want.

@JanMarvin
Copy link
Owner Author

I've pushed a pull request that uses formats, though I'm not convinced, that it is the right solution. As can be seen, we blow up our floating point data quite a bit, even with shorter digits. The underlying issue ofc is that with floating points it's always the question how many digits are required and from what point on is it all just fictional.

After all we need to make sure that the more digits we write actually serve a benefit and that we are not only blowing file size up to pad us on the back for our great precision, while opening the door for potential other corner cases.

> options(digits = 22)
> #' Convert number to character
> #' @param x input integer or numeric
> #' @noRd
> as_character <- function(x) {
+   is_na <- is.na(x) & !is.nan(x)
+   z <- format(x, trim = TRUE, digits = options()$digits, format = "g")
+   if (any(is_na)) z[is_na] <- NA_character_
+   z
+ }
> # want a bunch of different digits?
> sprintf("%.20f", as.numeric(openxlsx2:::as_character(3123231.32455432123)))
[1] "3123231.32455432135611772537"
> openxlsx2:::as_character(3123231.32455432123)
[1] "3123231.324554320424795"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants