Add bench_compare tool.

Ethiraric · Mar 15, 2024 · 35b22df · 35b22df
1 parent 3f6544b
commit 35b22df
Show file tree

Hide file tree

Showing 6 changed files with 363 additions and 0 deletions.
diff --git a/.cargo/config.toml b/.cargo/config.toml
@@ -1,2 +1,3 @@
 [alias]
 gen_large_yaml = "run --profile=release-lto --package gen_large_yaml --bin gen_large_yaml --manifest-path tools/gen_large_yaml/Cargo.toml --"
+bench_compare = "run --package bench_compare --bin bench_compare --manifest-path tools/bench_compare/Cargo.toml --"
diff --git a/justfile b/justfile
@@ -6,3 +6,9 @@ before_commit:
   cargo test
   cargo test --release
   cargo build --profile=release-lto --package gen_large_yaml --bin gen_large_yaml --manifest-path tools/gen_large_yaml/Cargo.toml
+
+ethi_bench:
+  cargo build --release --all-targets
+  cd ../Yaml-rust && cargo build --release --all-targets
+  cd ../libfyaml/build && ninja
+  cargo bench_compare run_bench
diff --git a/tools/README.md b/tools/README.md
@@ -4,10 +4,15 @@ Due to dependency management, only some of them are available as binaries from t
 
 | Tool | Invocation |
 |------|------------|
+| `bench_compare` | `cargo bench_compare` |
 | `dump_events` | `cargo run --bin dump_events -- [...]` |
 | `gen_large_yaml` | `cargo gen_large_yaml` |
+| `run_bench` | `cargo run --bin run_bench -- [...]` |
 | `time_parse` | `cargo run --bin time_parse -- [...]` |
 
+## `bench_compare`
+See the [dedicated README file](./bench_compare/README.md).
+
 ## `dump_events`
 This is a debugging helper for the parser. It outputs events emitted by the parser for a given file. This can be paired with the `YAMLRUST2_DEBUG` environment variable to have an in-depth overview of which steps the scanner and the parser are taking.
 
@@ -171,6 +176,42 @@ All generated files are meant to be between 200 and 250 MiB in size.
 
 This tool depends on external dependencies that are not part of `yaml-rust2`'s dependencies or `dev-dependencies` and as such can't be called through `cargo run` directly. A dedicated `cargo gen_large_yaml` alias can be used to generate the benchmark files.
 
+## `run_bench`
+This is a benchmarking helper that runs the parser on the given file a given number of times and is able to extract simple metrics out of the results. The `--output-yaml` flag can be specified to make the output a YAML file that can be fed into other tools.
+
+This binary is made to be used by `bench_compare`.
+
+Synopsis: `run_bench input.yaml <iterations> [--output-yaml]`
+
+### Examples
+```sh
+$> cargo run --release --bin run_bench -- bench_yaml/big.yaml 10
+Average: 1.631936191s
+Min: 1.629654651s
+Max: 1.633045284s
+95%: 1.633045284s
+
+$> cargo run --release --bin run_bench -- bench_yaml/big.yaml 10 --output-yaml
+parser: yaml-rust2
+input: bench_yaml/big.yaml
+average: 1649847674
+min: 1648277149
+max: 1651936305
+percentile95: 1651936305
+iterations: 10
+times:
+  - 1650216129
+  - 1649349978
+  - 1649507018
+  - 1648277149
+  - 1649036548
+  - 1650323982
+  - 1650917692
+  - 1648702081
+  - 1650209860
+  - 1651936305
+```
+
 ## `time_parse`
 This is a benchmarking helper that times how long it takes for the parser to emit all events. It calls the parser on the given input file, receives parsing events and then immediately discards them. It is advised to run this tool with `--release`.
 

diff --git a/tools/bench_compare/Cargo.toml b/tools/bench_compare/Cargo.toml
@@ -0,0 +1,21 @@
+[package]
+name = "bench_compare"
+version = "0.5.0"
+authors = [
+  "Ethiraric <ethiraric@gmail.com>"
+]
+license = "MIT OR Apache-2.0"
+description = "Run multiple YAML parsers and compare their times"
+repository = "https://github.com/Ethiraric/yaml-rust2"
+readme = "README.md"
+edition = "2018"
+
+[dependencies]
+anyhow = { version = "1.0.81", features = ["backtrace"] }
+serde = { version = "1.0.197", features = ["derive"] }
+serde_yaml = "0.9.32"
+toml = "0.8.11"
+
+[profile.release-lto]
+inherits = "release"
+lto = true
diff --git a/tools/bench_compare/README.md b/tools/bench_compare/README.md
@@ -0,0 +1,120 @@
+# `bench_compare`
+This tool helps with comparing times different YAML parsers take to parse the same input.
+
+## Synopsis
+```
+bench_compare time_parse
+bench_compare run_bench
+```
+
+This will run either `time_parse` or `run_bench` (described below) with the given set of parsers from the configuration file.
+
+## Parsers requirements
+Parsers are expected to be event-based. In order to be fair to this crate's benchmark implementation, parsers should:
+
+* Load the file into memory (a string, `mmap`, ...) **prior** to starting the clock
+* Initialize the parser, if needed
+* **Start the clock**
+* Read events from the parser while the parser has not finished parsing
+* Discard events as they are received (dropping them, `free`ing them or anything similar) so as to not grow their memory consumption too high, and allowing the parser to reuse event structures
+* **Stop the clock**
+* Destroy the resources, if needed/wanted (parser, file buffer, ...). The kernel will reap after the process exits.
+
+
+## Parsers required binaries
+This tool recognizes 2 binaries: `time_parse` and `run_bench`.
+
+### `time_parse`
+Synopsis:
+```
+time_parse file.yaml [--short]
+```
+
+The binary must run the aforementioned steps and display on its output the time the parser took to parse the given file.
+With the `--short` option, the binary must only output the benchmark time in nanoseconds.
+
+```sh
+# This is meant to be human-readable.
+# The example below is what this crate implements.
+$> time_parse file.yaml
+Loaded 200MiB in 1.74389s.
+
+# This will be read by this tool.
+# This must output ONLY the time, in nanoseconds.
+$> time_parse file.yaml --short
+1743892394
+```
+
+This tool will always provide the `--short` option.
+
+### `run_bench`
+Synopsis:
+```
+run_bench file.yaml <iterations> [--output-yaml]
+```
+
+The binary is expected to run `<iteration>` runs of the aforementioned steps and display on its output relevant information.
+The `--output-yaml` instructs the binary to output details about its runs in YAML on its standard output.
+The binary may optionally perform some warmup runs prior to running the benchmark. The time it took the binary to run will not be evaluated.
+
+```sh
+# This is meant to be human-readable.
+# The example below is what this crate implements.
+$> run_bench file.yaml 100
+Average: 1.589485s
+Min    : 1.583078s
+Max    : 1.597028s
+95%    : 1.593219s
+
+# This will be read by this tool.
+# This must output a YAML as described below.
+$> run_bench ../file.yaml 10 --output-yaml
+parser: yaml-rust2
+input: ../file.yaml
+average: 1620303590
+min: 1611632108
+max: 1636401896
+percentile95: 1636401896
+iterations: 10
+times:
+  - 1636401896
+  - 1623914538
+  - 1611632108
+  - 1612973608
+  - 1617748930
+  - 1615419514
+  - 1612172250
+  - 1620791346
+  - 1629339306
+  - 1622642412
+```
+
+The expected fields are (all times in nanoseconds):
+
+* `parser`: The name of the parser (in case of a mistake renaming files)
+* `input`: The path to the input file as given to the binary arguments
+* `average`: The average time it took to run the parser
+* `min`: The shortest time it took to run the parser
+* `max`: The longest time it took to run the parser
+* `percentile95`: The 95th percentile time of the runs
+* `iterations`: The number of times the parser was run (`<iterations>`)
+* `times`: An array of `iterations` times, one for each run, in the order they were run (first run first)
+
+## Configuration
+`bench_compare` is configured through a `bench_compare.toml` file. This file must be located in the current directory.
+As of now, default values are unsupported and all fields must be set. The following fields are required:
+```toml
+yaml_input_dir = "bench_yaml" # The path to the directory containing the input yaml files
+iterations = 10               # The number of iterations, if using `run_bench`
+yaml_output_dir = "yaml_output" # The directory in which `run_bench`'s yamls are saved
+csv_output = "benchmark.csv"  # The CSV output aggregating times for each parser and file
+
+[[parsers]]                   # A parser, can be repeated as many times as there are parsers
+name = "yaml-rust2"           # The name of the parser (used for logging)
+path = "target/release/"      # The path in which the parsers' `run_bench` and `time_parse` are
+
+# If there is another parser, another block can be added
+# [[parsers]]
+# name = "libfyaml"
+# path = "../libfyaml/build"
+```
diff --git a/tools/bench_compare/src/main.rs b/tools/bench_compare/src/main.rs
@@ -0,0 +1,174 @@
+use std::{fs::File, io::BufWriter, io::Write, path::Path};
+
+use anyhow::Error;
+use serde::{Deserialize, Serialize};
+
+fn main() {
+    if let Err(e) = entrypoint() {
+        eprintln!("{e:?}");
+        std::process::exit(1);
+    }
+}
+
+fn entrypoint() -> Result<(), Error> {
+    let config: Config =
+        toml::from_str(&std::fs::read_to_string("bench_compare.toml").unwrap()).unwrap();
+    if config.parsers.is_empty() {
+        println!("Please add at least one parser. Refer to the README for instructions.");
+        return Ok(());
+    }
+    let args: Vec<_> = std::env::args().collect();
+    if args.len() != 2
+        || (args.len() == 2 && !["time_parse", "run_bench"].contains(&args[1].as_str()))
+    {
+        println!("Usage: bench_compare <time_parse|run_bench>");
+        return Ok(());
+    }
+    match args[1].as_str() {
+        "run_bench" => run_bench(&config)?,
+        "time_parse" => unimplemented!(),
+        _ => unreachable!(),
+    }
+    Ok(())
+}
+
+/// Run the `run_bench` binary on the given parsers.
+fn run_bench(config: &Config) -> Result<(), Error> {
+    // Create output directory
+    std::fs::create_dir_all(&config.yaml_output_dir)?;
+
+    let inputs = list_input_files(config)?;
+    let iterations = format!("{}", config.iterations);
+    let mut averages = vec![];
+
+    // Inputs are ordered, so are parsers.
+    for input in &inputs {
+        let input_basename = Path::new(&input).file_name().unwrap().to_string_lossy();
+        let mut input_times = vec![];
+
+        // Run each input for each parser.
+        for parser in &config.parsers {
+            println!("Running {input_basename} against {}", parser.name);
+            // Run benchmark
+            let path = Path::new(&parser.path).join("run_bench");
+            let output = std::process::Command::new(path)
+                .arg(input)
+                .arg(&iterations)
+                .arg("--output-yaml")
+                .output()?;
+            // Check exit status.
+            if output.status.code().unwrap_or(1) == 0 {
+                let s = String::from_utf8_lossy(&output.stdout);
+                // Get output as yaml.
+                match serde_yaml::from_str::<BenchYamlOutput>(&s) {
+                    Ok(output) => {
+                        // Push average into our CSV-to-be.
+                        input_times.push(output.average);
+                        // Save the YAML for later.
+                        serde_yaml::to_writer(
+                            BufWriter::new(File::create(format!(
+                                "{}/{}-{}",
+                                config.yaml_output_dir, parser.name, input_basename
+                            ))?),
+                            &output,
+                        )?;
+                    }
+                    Err(e) => {
+                        // Yaml is invalid, use 0 as "didn't run properly".
+                        println!("Errored: Invalid YAML output: {e}");
+                        input_times.push(0);
+                    }
+                }
+            } else {
+                // An error happened, use 0 as "didn't run properly".
+                println!("Errored: process did exit non-zero");
+                input_times.push(0);
+            }
+        }
+        averages.push(input_times);
+    }
+
+    // Finally, save a CSV.
+    save_run_bench_csv(config, &inputs, &averages)
+}
+
+/// General configuration structure.
+#[derive(Serialize, Deserialize)]
+struct Config {
+    /// The path to the directory containing the input yaml files.
+    yaml_input_dir: String,
+    /// Number of iterations to run, if using `run_bench`.
+    iterations: u32,
+    /// The parsers to run.
+    parsers: Vec<Parser>,
+    /// The path to the directory in which `run_bench`'s yamls are saved.
+    yaml_output_dir: String,
+    /// The path to the CSV output aggregating times for each parser and file.
+    csv_output: String,
+}
+
+/// A parser configuration.
+#[derive(Serialize, Deserialize)]
+struct Parser {
+    /// The name of the parser.
+    name: String,
+    /// The path in which the parser's `run_bench` and `time_parse` are located.
+    path: String,
+}
+
+/// Ourput of running `run_bench` on a given parser.
+#[derive(Serialize, Deserialize)]
+struct BenchYamlOutput {
+    /// The name of the parser.
+    parser: String,
+    /// The file taken as input.
+    input: String,
+    /// Average parsing time (ns).
+    average: u64,
+    /// Shortest parsing time (ns).
+    min: u64,
+    /// Longest parsing time (ns).
+    max: u64,
+    /// 95th percentile of parsing times (ns).
+    percentile95: u64,
+    /// Number of iterations.
+    iterations: u64,
+    /// Parsing times for each run.
+    times: Vec<u64>,
+}
+
+/// Save a CSV file with all averages from `run_bench`.
+fn save_run_bench_csv(
+    config: &Config,
+    inputs: &[String],
+    averages: &[Vec<u64>],
+) -> Result<(), Error> {
+    let mut csv = BufWriter::new(File::create(&config.csv_output)?);
+    for parser in &config.parsers {
+        write!(csv, ",{}", parser.name,)?;
+    }
+    writeln!(csv)?;
+    for (path, averages) in inputs.iter().zip(averages.iter()) {
+        let filename = Path::new(path).file_name().unwrap().to_string_lossy();
+        write!(csv, "{}", filename)?;
+        for avg in averages {
+            write!(csv, ",{avg}")?;
+        }
+        writeln!(csv)?;
+    }
+
+    Ok(())
+}
+
+/// Returns the paths to the input yaml files.
+fn list_input_files(config: &Config) -> Result<Vec<String>, Error> {
+    Ok(std::fs::read_dir(&config.yaml_input_dir)?
+        .filter_map(Result::ok)
+        .map(|entry| entry.path().to_string_lossy().to_string())
+        .filter(|path| {
+            Path::new(path)
+                .extension()
+                .map_or(false, |ext| ext.eq_ignore_ascii_case("yaml"))
+        })
+        .collect())
+}