diff --git a/Cargo.toml b/Cargo.toml index a87b239..89b7583 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "yaml-rust2" -version = "0.5.0" +version = "0.6.0" authors = [ "Yuheng Chen ", "Ethiraric " diff --git a/README.md b/README.md index cb6d4c7..4b5df49 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Add the following to the Cargo.toml of your project: ```toml [dependencies] -yaml-rust2 = "0.5" +yaml-rust2 = "0.6" ``` Use `yaml_rust2::YamlLoader` to load YAML documents and access them as `Yaml` objects: diff --git a/documents/2024-03-15-FirstRelease.md b/documents/2024-03-15-FirstRelease.md new file mode 100644 index 0000000..2c193f0 --- /dev/null +++ b/documents/2024-03-15-FirstRelease.md @@ -0,0 +1,153 @@ +# `yaml-rust2`'s first real release +If you are not interested in how this crate was born and just want to know what differs from `yaml-rust`, scroll down to +["This release" or click here](#this-release). + +## The why +Sometime in August 2023, an ordinary developer (that's me) felt the urge to start scribbling about an OpenAPI linter. I +had worked with the OpenAPI format and tried different linters, but none of them felt right. And me needing 3 different +linters to lint my OpenAPI was a pain to me. Like any sane person would do, I would write my own (author's note: you are +not not sane if you wouldn't). In order to get things started, I needed a YAML parser. + +On August 14th 2023, I forked `yaml-rust` and started working on it. The crate stated that some YAML features were not +yet available and I felt that was an issue I could tackle. I started by getting to know the code, understanding it, +adding warnings, refactoring, tinkering, documenting, ... . Anything I could do that made me feel that codebase was +better, I would do it. I wanted this crate to be as clean as it could be. + +## Fixing YAML compliance +In my quest to understand YAML better, I found [the YAML test suite](https://github.com/yaml/yaml-test-suite/): a +compilation of corner cases and intricate YAML examples with their expected output / behavior. Interestingly enough, +there was an [open pull request on yaml-rust](https://github.com/chyh1990/yaml-rust/pull/187) by +[tanriol](https://github.com/tanriol) which integrated the YAML test suite as part of the crate tests. Comments mention +that the maintainer wasn't around anymore and that new contributions would probably never be accepted. + +That, however, was a problem for future-past-me, as I was determined (somehow) to have `yaml-rust` pass every single +test of the YAML test suite. Slowly, over the course of multiple months (from August 2023 to January 2024), I would +sometimes pick a test from the test suite, fix it, commit and start again. On the 23rd of January, the last commit +fixing a test was created. + +According to the [YAML test matrix](https://matrix.yaml.info/), there is to this day only 1 library that is fully +compliant (aside from the Perl parser generated by the reference). This would make `yaml-rust2` the second library to be +fully YAML-compliant. You really wouldn't believe how much you have to stretch YAML so that it's not valid YAML anymore. + +## Performance +With so many improvements, the crate was now perfect!.. Except for performance. Adding conditions for every little bit +of compliance has lead the code to be much more complex and branch-y, which CPUs hate. I was around 20% slower than the +code was when I started. + +For a bit over 3 weeks, I stared at flamegraphs and made my CPU repeat the same instructions until it could do it +faster. There have been a bunch of improvements for performance since `yaml-rust`'s last commit. Here are a few of them: + +* Avoid putting characters in a `VecDeque` buffer when we can push them directly into a `String`. +* Be a bit smarter about reallocating temporaries: it's best if we know the size in advance, but when we don't we can + sometimes avoid pushing characters 1 at a time. +* The scanner skips over characters one at a time. When skipping them, it needs to check whether they're a linebreak to + update the location. Sometimes, we know we skip over a letter (which is not a linebreak). Several "skip" functions + have been added for specific uses. + +And the big winner, for an around 15% decrease in runtime was: use a statically-sized buffer instead of a dynamically +allocated one. (Almost) Every character goes from the input stream into the buffer and then gets read from the buffer. +This means that `VecDeque::push` and `VecDeque::pop` were called very frequently. The former always has to check for +capacity. Using an `ArrayDeque` removed the need for constant capacity checks, at the cost of a minor decrease in +performance if a line is deeply indented. Hopefully, nobody has 42 nested YAML objects. + +Here is in the end the performance breakdown: + +![Comparison of the performance between `yaml-rust`, `yaml-rust2` and the C `libfyaml`. `yaml-rust2` is faster in every +test than `yaml-rust`, but `libfyaml` remains faster overall.](./img/benchmarks-v0.6.svg) + +Here is a shot description of what the files contain: + + * `big`: A large array of records with few fields. One of the fields is a description, a large text block scalar + spanning multiple lines. Most of the scanning happens in block scalars. + * `nested`: Very short key-value pairs that nest deeply. + * `small_objects`: A large array of 2 key-value mappings. + * `strings_array`: A large array of lipsum one-liners (~150-175 characters in length). + +As you can see, `yaml-rust2` performs better than `yaml-rust` on every benchmark. However, when compared against the C +[`libfyaml`](https://github.com/pantoniou/libfyaml), we can see that there is still much room for improvement. + +I'd like to end this section with a small disclaimer: I am not a benchmark expert. I tried to have an heterogenous set +of files that would highlight how the parser performs when stressed different ways. I invite you to take a look at [the +code generating the YAML files](https://github.com/Ethiraric/yaml-rust2/tree/master/tools/gen_large_yaml) and, if you +are more knowledgeable than I am, improve upon them. `yaml-rust2` performs better with these files because those are the +ones I could work with. If you find a fil with which `yaml-rust2` is slower than `yaml-rust`, do file an issue! + +## This release +### Improvements from `yaml-rust` +This release should improve over `yaml-rust` over 3 major points: + + * Performance: We all love fast software. I want to help you achieve it. I haven't managed to make this crate twice as + fast, but you should notice a 15-20% improvement in performance. + * Compliance: You may not notice it, since I didn't know most of the bugs I fixed were bugs to begin with, but this + crate should now be fully YAML-comliant. + * Documentation: The documentation of `yaml-rust` is unfortunately incomplete. Documentation here is not exhaustive, + but most items are documented. Notably, private items are documented, making it much easier to understand where + something happens. There are also in-code comments that help figure out what is going on under the hood. + +Also, last but not least, I do plan on keeping this crate alive as long as I can. Nobody can make promises on that +regard, of course, but I have poured hours of work into this, and I would hate to see this go to waste. + +### Switching to `yaml-rust2` +This release is `v0.6.0`, chosen to explicitly differ in minor from `yaml-rust`. `v0.4.x` does not exist in this crate +to avoid any confusion between the 2 crates. + +Switching to `yaml-rust2` should be a very simple process. Change your `Cargo.toml` to use `yaml-rust2` instead of +`yaml-rust`: + +```diff +-yaml-rust = "0.4.4" ++yaml-rust2 = "0.6.0" +``` + +As for your code, you have one of two solutions: + + * Changing your imports from `use yaml_rust::Yaml` to `use yaml_rust2::Yaml` if you import items directly, or change + occurences of `yaml_rust` to `yaml_rust2` if you use fully qualified paths. + * Alternatively, you can alias `yaml_rust2` with `use yaml_rust2 as yaml_rust`. This would keep your code working if + you use fully qualified paths. + +Whichever you decide is up to you. + +#### What about API breakage? +Most of what I have changed is in the implementation details. You might notice more documentation appearing on your LSP, +but documentation isn't bound by the API. There is only one change I made that could lead to compile errors. It is +unlikely you used that feature, but I'd hate to leave this undocumented. + +If you use the low-level event parsing API (`Parser`, +`EventReceiver` / `MarkedEventReceiver`) and namely the `yaml_rust::Event` enumeration, there is one change that might +break your code. This was needed for tests in the YAML test suite. In `yaml-rust`, YAML tags are not forwarded from the +lower-level `Scanner` API to the low-level `Parser` API. + +Here is the change that was made in the library: + +```diff + pub enum Event { + // ... +-SequenceStart(usize), +-MappingStart(usize), ++SequenceStart(usize, Option), ++MappingStart(usize, Option), + // ... + } +``` + +This means that you may now see YAML tags appearing in your code. + +## Closing words +YAML is hard. Much more than I had anticipated. If you are exploring dark corners of YAML that `yaml-rust2` supports but +`yaml-rust` doesn't, I'm curious to know what it is. + +Work on this crate is far from over. I will try and match `libfyaml`'s performance. Today is the first time I benched +against it, and I wouldn't have guessed it to outperform `yaml-rust2` that much. + +If you're interested in upgrading your `yaml-rust` crate, please do take a look at [davvid](https://github.com/davvid)'s +[fork of `yaml-rust`](https://github.com/davvid/yaml-rust). Very recent developements on this crate sparked from an +[issue on advisory-db](https://github.com/rustsec/advisory-db/issues/1921) about the unmaintained state of `yaml-rust`. +I hope it will be that YAML in Rust will improve following this issue. + +Thank you for reading through this. If you happen to have issues with `yaml-rust2` or suggestions, do [drop an +issue](https://github.com/Ethiraric/yaml-rust2/issues)! + +If however you wanted an OpenAPI linter, I'm afraid you're out of luck. Just as much as I'm out of time ;) + +-Ethiraric diff --git a/documents/img/2024-03-15-benchmarks.csv b/documents/img/2024-03-15-benchmarks.csv new file mode 100644 index 0000000..685b6cc --- /dev/null +++ b/documents/img/2024-03-15-benchmarks.csv @@ -0,0 +1,5 @@ +,yaml-rust2,yaml-rust,libfyaml +big.yaml,1644933464,2097747837,1642761913 +nested.yaml,1186706803,1461738560,1104480120 +small_objects.yaml,5459915062,5686715239,4402878726 +strings_array.yaml,1698194153,2044921291,924246153 diff --git a/documents/img/benchmarks-v0.6.svg b/documents/img/benchmarks-v0.6.svg new file mode 100644 index 0000000..2e9ddd7 --- /dev/null +++ b/documents/img/benchmarks-v0.6.svg @@ -0,0 +1,69 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + big + nested + small_objects + strings_array + 0 + 1000000 + 2000000 + 3000000 + 4000000 + 5000000 + 6000000 + + + + yaml-rust + yaml-rust2 + libfyaml + Time in ms (less is better) + \ No newline at end of file diff --git a/src/lib.rs b/src/lib.rs index 9c13e9a..7c01b66 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -11,7 +11,7 @@ //! //! ```toml //! [dependencies] -//! yaml-rust2 = "0.5.0" +//! yaml-rust2 = "0.6.0" //! ``` //! //! # Examples diff --git a/tools/bench_compare/Cargo.toml b/tools/bench_compare/Cargo.toml index 7c7f97c..4ca9b33 100644 --- a/tools/bench_compare/Cargo.toml +++ b/tools/bench_compare/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "bench_compare" -version = "0.5.0" +version = "0.6.0" authors = [ "Ethiraric " ] diff --git a/tools/gen_large_yaml/Cargo.toml b/tools/gen_large_yaml/Cargo.toml index 54b6b3c..0fe3eac 100644 --- a/tools/gen_large_yaml/Cargo.toml +++ b/tools/gen_large_yaml/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "gen_large_yaml" -version = "0.5.0" +version = "0.6.0" authors = [ "Ethiraric " ] @@ -11,7 +11,7 @@ readme = "README.md" edition = "2018" [dependencies] -yaml-rust2 = { version = "0.5.0", path = "../../" } +yaml-rust2 = { version = "0.6.0", path = "../../" } rand = "0.8.5" lipsum = "0.9.0"