Replace invalid unicode with replacement chars #12

michalmuskala · 2017-12-13T18:28:28Z

This could be an optional mode for the parser.

vorce · 2018-02-27T18:41:45Z

Hey @michalmuskala I'm interested in trying to create a PR for this. Do you have any implementation ideas or other pointers?

michalmuskala · 2018-03-13T14:08:36Z

I think the best way to handle this would be to expand the string_decode function we're passing around in the decoder. Right now it's only passed the final string at the end of everything, but we could be taking the entire string/7 function there and having couple versions of it. We'd probably need 4 versions - regular, copying, one that replaces unicode but does not copy when not necessary and one that does copy. It will be a bit of duplication, but it should be mostly copy-paste and that way should be the most performant. We should avoid additional conditionals in the main decode loop.

As to how to handle the replacement char, there are actually two ways - one is to replace every invalid byte with one replacement char, another to collect all invalid bytes and use just one replacement char in place of them. I've seen references on the internet to both methods. It would likely require some research to see which one is the most compliant with the standard. We should also probably treat invalid escapes as replacement chars and not errors in that case - basically turn every string decoding error into replacement chars.

Moosieus · 2023-08-14T17:39:19Z

This seems an infrequent but challenging issue people still encounter.

Shortly after the last post on this issue, the Unicode Standard was updated to promote W3C's standard for consistent substitution (Seen here, under the heading "U+FFFD Substitution of Maximal Subparts").

The basic gist is:

Valid-but-truncated code sequences get replaced with one "�".
All other illegal bytes are replaced with one "�" each.

I couldn't find anything in Elixir or Erlang that did this, so I wrote my own. Ideally though Elixir or OTP would provide a native solution (as other languages do).

Moosieus · 2023-12-20T23:39:34Z

Would it be possible to integrate the matches from String.replace_invalid/2 into the existing Jason.Decoder.string functions?

michalmuskala added enhancement help wanted labels Jan 26, 2018

kamenlitchev mentioned this issue Apr 18, 2019

Propagate 'opts'; Implement ':decimals' decoding mode #76

Closed

josevalim mentioned this issue Oct 2, 2023

Support for escape: :binary_safe #174

Closed

Moosieus mentioned this issue Oct 8, 2023

U+FFFD Substitution of Maximal Subparts elixir-unicode/unicode#7

Closed

josevalim mentioned this issue Oct 31, 2023

Support UUID livebook-dev/kino_db#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace invalid unicode with replacement chars #12

Replace invalid unicode with replacement chars #12

michalmuskala commented Dec 13, 2017

vorce commented Feb 27, 2018

michalmuskala commented Mar 13, 2018

Moosieus commented Aug 14, 2023

Moosieus commented Dec 20, 2023

Replace invalid unicode with replacement chars #12

Replace invalid unicode with replacement chars #12

Comments

michalmuskala commented Dec 13, 2017

vorce commented Feb 27, 2018

michalmuskala commented Mar 13, 2018

Moosieus commented Aug 14, 2023

Moosieus commented Dec 20, 2023