Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not discard byte that is not \ #1056

Closed
wants to merge 1 commit into from

Conversation

djmitche
Copy link
Contributor

When parsing escaped utf-16 surrogates, they must be in the form \uHHHH\uLLLL for high and low surrogates, respectively. If the character following the last H is not \, then the parse is invalid.

In that case, do not consume the character that is not \. In this implementation, this bug is of no consequence because the invalid parse results in an error that is returned to the caller. However, serde_json_lenient can optionally replace the invalid character with REPLACEMENT CHARACTER, in which case it is important to account for consumed and unconsumed bytes correctly.

This mirrors the fix in
google/serde_json_lenient#12.

This continues to pass all tests. I have not added any new tests since the incorrect behavior has no external effect.

When parsing escaped utf-16 surrogates, they must be in the form
`\uHHHH\uLLLL` for high and low surrogates, respectively. If the
character following the last `H` is not `\`, then the parse is invalid.

In that case, do not consume the character that is not `\`. In this
implementation, this bug is of no consequence because the invalid parse
results in an error that is returned to the caller. However,
`serde_json_lenient` can optionally replace the invalid character with
REPLACEMENT CHARACTER, in which case it is important to account for
consumed and unconsumed bytes correctly.

This mirrors the fix in
google/serde_json_lenient#12.
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct as written, and incorrect after this PR. The error function on the line after the removed line sets the error location based on read.position(), not read.peek_position(). Before this PR, you'd correctly get an error pointing to the character that was not a backslash but was required to be a backslash. After this PR, the error would be on the character before, which is not right.

fn main() {
    println!("{:?}", serde_json::from_str::<String>(r#""\ud800...""#));
}

Correct behavior:

Err(Error("unexpected end of hex escape", line: 1, column: 8))

Column 8 is:

"\ud800..."
       ^

@djmitche
Copy link
Contributor Author

Hm, I guess the distinction is that the error points to the character just read, whereas the replacement in serde_json_lenient is assuming the invalid character wasn't read. Let me see if I can find a different way to implement the REPLACEMENT CHARACTER thing, then.

@djmitche djmitche closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants