Do not discard byte that is not `\` #1056

djmitche · 2023-08-17T17:00:29Z

When parsing escaped utf-16 surrogates, they must be in the form \uHHHH\uLLLL for high and low surrogates, respectively. If the character following the last H is not \, then the parse is invalid.

In that case, do not consume the character that is not \. In this implementation, this bug is of no consequence because the invalid parse results in an error that is returned to the caller. However, serde_json_lenient can optionally replace the invalid character with REPLACEMENT CHARACTER, in which case it is important to account for consumed and unconsumed bytes correctly.

This mirrors the fix in
google/serde_json_lenient#12.

This continues to pass all tests. I have not added any new tests since the incorrect behavior has no external effect.

When parsing escaped utf-16 surrogates, they must be in the form `\uHHHH\uLLLL` for high and low surrogates, respectively. If the character following the last `H` is not `\`, then the parse is invalid. In that case, do not consume the character that is not `\`. In this implementation, this bug is of no consequence because the invalid parse results in an error that is returned to the caller. However, `serde_json_lenient` can optionally replace the invalid character with REPLACEMENT CHARACTER, in which case it is important to account for consumed and unconsumed bytes correctly. This mirrors the fix in google/serde_json_lenient#12.

dtolnay

I think this is correct as written, and incorrect after this PR. The error function on the line after the removed line sets the error location based on read.position(), not read.peek_position(). Before this PR, you'd correctly get an error pointing to the character that was not a backslash but was required to be a backslash. After this PR, the error would be on the character before, which is not right.

fn main() {
    println!("{:?}", serde_json::from_str::<String>(r#""\ud800...""#));
}

Correct behavior:

Err(Error("unexpected end of hex escape", line: 1, column: 8))

Column 8 is:

"\ud800..."
       ^

djmitche · 2023-08-17T18:31:01Z

Hm, I guess the distinction is that the error points to the character just read, whereas the replacement in serde_json_lenient is assuming the invalid character wasn't read. Let me see if I can find a different way to implement the REPLACEMENT CHARACTER thing, then.

djmitche mentioned this pull request Aug 17, 2023

Do not discard the byte after a high surrogate google/serde_json_lenient#12

Merged

dtolnay requested changes Aug 17, 2023

View reviewed changes

djmitche closed this Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not discard byte that is not `\` #1056

Do not discard byte that is not `\` #1056

djmitche commented Aug 17, 2023

dtolnay left a comment

djmitche commented Aug 17, 2023

Do not discard byte that is not \ #1056

Do not discard byte that is not \ #1056

Conversation

djmitche commented Aug 17, 2023

dtolnay left a comment

Choose a reason for hiding this comment

djmitche commented Aug 17, 2023

Do not discard byte that is not `\` #1056

Do not discard byte that is not `\` #1056