Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point roundtrip and loss of precision #22

Open
edwintorok opened this issue Jun 1, 2023 · 3 comments
Open

Floating point roundtrip and loss of precision #22

edwintorok opened this issue Jun 1, 2023 · 3 comments

Comments

@edwintorok
Copy link

  • I would suggest to have a flag that enforces standard compliant JSON on output and raise an exception if not.
    Users that care about performance can keep that flag off, but users that care about correctness can turn that flag on to catch bugs in the application. (and even users who care about performance may want to turn that flag on during tests).

  • Also the floating point output format should be changed from the default 'string_of_float' to at least "%.17g" (if accuracy and performance is desired), or a dynamic choice betwen %.15g,%.16g and %.17g (shortest that roundtrips), or another algorithm that ensures full accuracy of floating point data is preserved. See also Owl_dataframe shouldn' t use 'string_of_float' owlbarn/owl#640 this is quite a common trap that serialization code falls into...

The following snippet reproduces the issue:

Ezjsonm.to_string (Ezjsonm.dict ["nantest", Ezjsonm.float Float.nan; "epsilon", Ezjsonm.float (1.+.Float.epsilon)] )|> print_endline;;
{"nantest":nan,"epsilon":1}

The JSON produced is not spec conformant and cannot be parsed by anything other than jsonm:

jq <nan.jq
parse error: Invalid literal at line 1, column 15

There is also a second problem here that outputting a float loses precision and doesn't retain the full range of an IEEE-754 double. That is the fault of OCaml's default 'string_of_float', but any serialization code should use well defined precision instead of relying on the default. "%.17g" should output enough digits to fully preserve floats (albeit they may look a bit "ugly" with more digits than required).
Elsewhere a dynamic choice between "%.15g" and "%.17g" is made to fully preserve the original float (the hexadecimal encoding of floats wouldn't be valid JSON), that might be an alternative.

Yojson by default would do the same for NaN:

Yojson.to_string (`Assoc ["testnan", `Float Float.nan; "eps", `Float (1. +. Float.epsilon)])|>print_endline;;
{"testnan":NaN,"eps":1.0000000000000002}

But it has a flag to force producing standard conforming JSON (and the application can then do the necessary encoding of NaN/Infinite beforehand):

Yojson.to_string ~std:true (`Assoc ["testnan", `Float Float.nan; "eps", `Float (1. +. Float.epsilon)])|>print_endline;;
Exception: Yojson.Json_error "NaN value not allowed in standard JSON".

It also fully preserves all the digits in 'eps'.

See also #12 (comment)

With these changes I may be able to switch back to using jsonm, which I'd still prefer due to its better input validation (for now we had to switch to using yojson to ensure spec compliant output)

@edwintorok
Copy link
Author

Please let me know whether the former flag belongs in 'jsonm' (or a higher layer such as 'ezjsonm'), and what the preferred option would be for preserving float accuracy.
I'd be happy to provide pull requests implementing either once the design is agreed on.

@dbuenzli
Copy link
Owner

dbuenzli commented Jun 1, 2023

So you really prefer not to do the check for nan or infinity on the floats you encode yourself with jsonm but rather let yojson happily let you input non-validated, possibly null byte holding, strings ?

Programmers' priorities will never cease to amaze me.

  • I would suggest to have a flag that enforces standard compliant JSON on output and raise an exception if not.

The problem with this is that most programs will never see the exception anyways until their program fails at some point on production. Which is akin to discover that you are generating ill formed jsonm. I'm not sure which is better but all together it looks rather equivalent to me.

(And of course doing the check by default will entail that you do the check twice since the client still needs to do it to decide what it does with these values)

@edwintorok
Copy link
Author

edwintorok commented Jun 1, 2023

We only discovered we were generating invalid JSON after about 10 years when we looked at why one of our product's client applications was carrying a patch to the C# JSON parser: it was so that it could parse JSON produced by jsonm, but instead of that team raising a bug with us they worked it around by patching their parser.
Had we found out sooner that we are producing invalid json we could've patched the application, or some other layer (e.g. ezjsonm).

However rather than duplicating those safety checks in all applications it would be useful to have a common layer to do that (whether jsonm, ezjsonm, or something else), otherwise each application has to rediscover this bug (perhaps after many years like we did).
NaN,Infinity and -Infinity are valid IEEE-754 floating point numbers, that JSON cannot represent them is a limitation of the JSON spec and unexpected (especially if an application has multiple serializers, JSON being just one of them, AFAICT this works fine in XML where encoding a 'nan' doesn't break the grammar).

So you really prefer not to do the check for nan or infinity on the floats you encode yourself with jsonm but rather let yojson happily let you input non-validated, possibly null byte holding, strings ?

that is why I'd like to move back to using jsonm.
I'll have to do some measurements but I don't think that the additional call to classify_float would slow down an application significantly, and being correct is more important than being slightly faster. I'd like to think that 'jsonm' values correctness and safety more than absolute performance.
(if one wants something that sacrifices speed over correctness then you can use Yojson as you mention it doesn't do input validation, which saves time but can cause issues later)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants