Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace invalid unicode with replacement chars #12

Open
michalmuskala opened this issue Dec 13, 2017 · 4 comments
Open

Replace invalid unicode with replacement chars #12

michalmuskala opened this issue Dec 13, 2017 · 4 comments

Comments

@michalmuskala
Copy link
Owner

This could be an optional mode for the parser.

@vorce
Copy link

vorce commented Feb 27, 2018

Hey @michalmuskala I'm interested in trying to create a PR for this. Do you have any implementation ideas or other pointers?

@michalmuskala
Copy link
Owner Author

I think the best way to handle this would be to expand the string_decode function we're passing around in the decoder. Right now it's only passed the final string at the end of everything, but we could be taking the entire string/7 function there and having couple versions of it. We'd probably need 4 versions - regular, copying, one that replaces unicode but does not copy when not necessary and one that does copy. It will be a bit of duplication, but it should be mostly copy-paste and that way should be the most performant. We should avoid additional conditionals in the main decode loop.

As to how to handle the replacement char, there are actually two ways - one is to replace every invalid byte with one replacement char, another to collect all invalid bytes and use just one replacement char in place of them. I've seen references on the internet to both methods. It would likely require some research to see which one is the most compliant with the standard. We should also probably treat invalid escapes as replacement chars and not errors in that case - basically turn every string decoding error into replacement chars.

@Moosieus
Copy link

This seems an infrequent but challenging issue people still encounter.

Shortly after the last post on this issue, the Unicode Standard was updated to promote W3C's standard for consistent substitution (Seen here, under the heading "U+FFFD Substitution of Maximal Subparts").

The basic gist is:

  • Valid-but-truncated code sequences get replaced with one "�".
  • All other illegal bytes are replaced with one "�" each.

I couldn't find anything in Elixir or Erlang that did this, so I wrote my own. Ideally though Elixir or OTP would provide a native solution (as other languages do).

@Moosieus
Copy link

Would it be possible to integrate the matches from String.replace_invalid/2 into the existing Jason.Decoder.string functions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants