Parsing syntax like Rust's raw string literals #441

Cookie04DE · 2023-06-08T12:29:46Z

Cookie04DE
Jun 8, 2023

First of all, thank you very much for creating chumsky.
I unsuccessfully tried a few times to parse my syntax (using handwritten parsers and other crates) but with chumsky I finally managed to it!
I'm stuck on the following problem though:
In Rust there's the raw string literal:

// I can use double quotes here without terminating the string literal
r#"{"example": "json"}"#
// If I want a literal `"#` inside my string I just use one additional hash sign:
r##"this is the "coding"#trending page"##

I want to parse this syntax with chumsky.
Here's what I got:

fn raw_string_literal() -> impl Parser<char, String, Error = Simple<char>> + Clone {
    let start = just("r")
        .ignore_then(just("#").repeated())
        .then_ignore(just("\""));
    let end = just("\"").then(just("#").repeated().exactly(?));
    end
        .not()
        .repeated()
        .delimited_by(start, end)
        .map(|chars| {
            chars.into_iter().fold(String::new(), |mut s, c| {
                s.push(c);
                s
            })
        })
}

The problem is that I can't know at parser build time how many hash signs have been used to start the string literal and thus how many are needed to terminate it again.

Answered by Zij-IT

Jun 8, 2023

What you are looking for is the Parser::then_with method, which allows you to define a second parser based on the result of the first. In this case, it allows us to get the amount of '#'s that the start parser was able to parse. Here is what I came up with based on what you described:

fn raw_str_lit() -> impl Parser<char, String, Error = Simple<char>> + Clone {
    just("r")
        .ignore_then(
            // This `Parser::map` saves us any allocations because `Vec` doesn't allocate for
            // ZSTs, and we only need the length anyhow.
            // This isn't necessary, and premature optimization is the root of all evil, so sue me for it ;D
            just('#').map(|_| ()).rep…

View full answer

Zij-IT · 2023-06-08T13:30:05Z

Zij-IT
Jun 8, 2023

What you are looking for is the Parser::then_with method, which allows you to define a second parser based on the result of the first. In this case, it allows us to get the amount of '#'s that the start parser was able to parse. Here is what I came up with based on what you described:

fn raw_str_lit() -> impl Parser<char, String, Error = Simple<char>> + Clone {
    just("r")
        .ignore_then(
            // This `Parser::map` saves us any allocations because `Vec` doesn't allocate for
            // ZSTs, and we only need the length anyhow.
            // This isn't necessary, and premature optimization is the root of all evil, so sue me for it ;D
            just('#').map(|_| ()).repeated().collect::<Vec<_>>(),
        )
        .then_ignore(just('"'))
        .then_with(|start| {
            let end = just('"').ignore_then(just('#').repeated().exactly(start.len()));
            end.not().repeated().collect()
        })
}


#[cfg(test)]
mod tests {
    use super::raw_str_lit;
    use chumsky::Parser;

    #[test]
    fn empty_raw() {
        let empty_raw = r###"r##""##"###;
        assert_eq!(raw_str_lit().parse(empty_raw), Ok("".into()));
    }

    #[test]
    fn non_empty_raw() {
        let non_empty_raw = r##"r#"hi""you"#""##;
        assert_eq!(raw_str_lit().parse(non_empty_raw), Ok("hi\"\"you".into()));
    }

    #[test]
    fn nested_raw() {
        let non_empty = r#####"r###"r##"hello there world"##"###"#####;
        assert_eq!(
            raw_str_lit().parse(non_empty),
            Ok("r##\"hello there world\"##".into())
        );
    }

    #[test]
    fn json_raw() {
        let json = r##"r#"{"example": "json"}"#"##;
        assert_eq!(
            raw_str_lit().parse(json),
            Ok("{\"example\": \"json\"}".into())
        );
    }

    #[test]
    fn coding_trend_raw() {
        let json = r###"r##"this is the "coding"#trending page"##"###;
        assert_eq!(
            raw_str_lit().parse(json),
            Ok("this is the \"coding\"#trending page".into())
        );
    }
}

Does this seem to be what you are after?

5 replies

Cookie04DE Jun 8, 2023
Author

Thanks, that was the missing piece!

zesterer Jun 12, 2023
Maintainer

FWIW, then_with has been removed in 1.0 because we want to avoid creating parsers on the fly (it's potentially quite expensive and harms the static analysis capabilities of chumsky). In 1.0, you have two options:

The new context-sensitive combinators (.configure, .then_with_ctx, etc.)
The custom combinator, which can allow you to drop down to imperative-style code to easily implement something like this

mbund Jun 4, 2024

Would you be able to provide a 1.0 combinator (using one of the new context sensitive combinators or a custom combinator)? I had the same question as OP but I'm using the newer API which dropped support for .then_with.

zesterer Jun 4, 2024
Maintainer

You can use custom, or you can make use of chumsky's new context-sensitive parsers.

(note: I'm writing this from memory, there might be minor errors)

let hashes = just('#').repeated();
// The start of a raw string is some number of hashes, and then an opening quote
let start = hashes.count().then_ignore(just('"'));
// The end of a string is a closing quote, and then the exact number of hashes that came at the start
let end = just('"').then(hashes.configure(|cfg, ctx| cfg.exactly(*ctx)));
// The inside of the string is any number of characters that do not form part of the end of the string
// (we also use `to_slice` here because we're only interested in the inner part of the raw string for the sake of our AST)
let inner = any().and_is(end.not()).repeated().to_slice();

// Put it all together to make a parser
// (`ignore_with_ctx` will feed the output of `start` into the second parser as context)
let raw_string = start.ignore_with_ctx(inner.then_ignore(end));

mbund Jun 5, 2024

Works great, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing syntax like Rust's raw string literals #441

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Parsing syntax like Rust's raw string literals #441

Cookie04DE Jun 8, 2023

Replies: 1 comment · 5 replies

Zij-IT Jun 8, 2023

Cookie04DE Jun 8, 2023 Author

zesterer Jun 12, 2023 Maintainer

mbund Jun 4, 2024

zesterer Jun 4, 2024 Maintainer

mbund Jun 5, 2024

Cookie04DE
Jun 8, 2023

Replies: 1 comment 5 replies

Zij-IT
Jun 8, 2023

Cookie04DE Jun 8, 2023
Author

zesterer Jun 12, 2023
Maintainer

zesterer Jun 4, 2024
Maintainer