Evaluate Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) usage #80

zamazan4ik · 2024-04-13T19:16:46Z

zamazan4ik
Apr 13, 2024

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are available here. According to the tests, PGO can help with achieving better performance in many cases for many applications including compilers and static analyzers. Since pylyzer cares about performance I think pylyzer optimization with such optimization techniques will be an interesting idea.

I already did some benchmarks and want to share my results here.

Test environment

Fedora 39
Linux kernel 6.8.4
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.74 and Clang 17 (for the tree-sitter build), CFLAGS are -O3
pylyzer version: the latest for now from the main branch on commit 70c23905ae768ab554000abeefab36fe48ab54f4
Disabled Turbo boost

Benchmark

For benchmark purposes, I used a scenario from the README file - pylyzer tests/test.py. All PGO and PLO optimizations are done with cargo-pgo.

For Release built-in benchmarks were tested with cargo bench -p benches. PGO instrumentation phase is done with cargo pgo bench -- -p benches, PGO optimized benches are done with cargo pgo optimize bench -- -p benches.

All tests are done on the same machine, done multiple times (with hyperfine), with the same background "noise" (as much as I can guarantee of course) - the results are consistent across runs.

LTO build is done by adding the following lines to the Cargo.toml:

[profile.release]
codegen-units = 1
lto = true

Results

The results:

hyperfine --warmup 200 --min-runs 1000 -i 'taskset -c 0 ./pylyzer_release ../tests/test.py' 'taskset -c 0 ./pylyzer_release_with_lto ../tests/test.py' 'taskset -c 0 ./pylyzer_optimized ../tests/test.py' 'taskset -c 0 ./pylyzer_pgo_and_bolt_optimized ../tests/test.py' 'taskset -c 0 ./pylyzer_lto_and_bolt_optimized ../tests/test.py'
Benchmark 1: taskset -c 0 ./pylyzer_release ../tests/test.py
  Time (mean ± σ):      18.2 ms ±   0.3 ms    [User: 10.6 ms, System: 7.4 ms]
  Range (min … max):    17.6 ms …  19.5 ms    1000 runs

  Warning: Ignoring non-zero exit code.

Benchmark 2: taskset -c 0 ./pylyzer_release_with_lto ../tests/test.py
  Time (mean ± σ):      17.1 ms ±   0.3 ms    [User: 9.6 ms, System: 7.3 ms]
  Range (min … max):    16.4 ms …  18.9 ms    1000 runs

  Warning: Ignoring non-zero exit code.

Benchmark 3: taskset -c 0 ./pylyzer_optimized ../tests/test.py
  Time (mean ± σ):      16.7 ms ±   0.3 ms    [User: 9.2 ms, System: 7.3 ms]
  Range (min … max):    16.1 ms …  18.7 ms    1000 runs

  Warning: Ignoring non-zero exit code.

Benchmark 4: taskset -c 0 ./pylyzer_pgo_and_bolt_optimized ../tests/test.py
  Time (mean ± σ):      15.8 ms ±   0.1 ms    [User: 8.9 ms, System: 6.7 ms]
  Range (min … max):    15.4 ms …  17.4 ms    1000 runs

  Warning: Ignoring non-zero exit code.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 5: taskset -c 0 ./pylyzer_lto_and_bolt_optimized ../tests/test.py
  Time (mean ± σ):      15.6 ms ±   0.2 ms    [User: 8.8 ms, System: 6.6 ms]
  Range (min … max):    15.1 ms …  17.2 ms    1000 runs

  Warning: Ignoring non-zero exit code.
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./pylyzer_lto_and_bolt_optimized ../tests/test.py ran
    1.01 ± 0.02 times faster than taskset -c 0 ./pylyzer_pgo_and_bolt_optimized ../tests/test.py
    1.07 ± 0.02 times faster than taskset -c 0 ./pylyzer_optimized ../tests/test.py
    1.10 ± 0.02 times faster than taskset -c 0 ./pylyzer_release_with_lto ../tests/test.py
    1.17 ± 0.02 times faster than taskset -c 0 ./pylyzer_release ../tests/test.py

where:

pylyzer_release - Release build
pylyzer_release_with_lto - Release + LTO build
pylyzer_optimized - Release build + PGO optimization
pylyzer_pgo_and_bolt_optimized - Release build + PGO optimization + PLO optimization (via LLVM BOLT)
pylyzer_lto_and_bolt_optimized - Release build + LTO + PLO optimization with LLVM BOLT

According to the tests above, I see measurable improvements from LTO, PGO, and PLO.

For reference, I post performance results in the PGO and PLO (with and without LTO) instrumentation phases:

PGO instrumented run:

hyperfine --warmup 50 --min-runs 200 -i './pylyzer_instrumented ../tests/test.py'
Benchmark 1: ./pylyzer_instrumented ../tests/test.py
  Time (mean ± σ):      36.5 ms ±   0.8 ms    [User: 15.4 ms, System: 17.4 ms]
  Range (min … max):    34.2 ms …  38.8 ms    200 runs

LTO enabled + PGO instrumented run:

hyperfine --warmup 10 --min-runs 50 -i 'taskset -c 0 ./pylyzer_lto_instrumented ../tests/test.py'
Benchmark 1: taskset -c 0 ./pylyzer_lto_instrumented ../tests/test.py
  Time (mean ± σ):      28.0 ms ±   0.5 ms    [User: 12.0 ms, System: 13.1 ms]
  Range (min … max):    27.1 ms …  30.4 ms    103 runs

LLVM BOLT instrumented run:

hyperfine --warmup 10 --min-runs 50 -i 'taskset -c 0 ./pylyzer_bolt_insrtumented ../tests/test.py'
Benchmark 1: taskset -c 0 ./pylyzer_bolt_insrtumented ../tests/test.py
  Time (mean ± σ):     668.0 ms ±   6.7 ms    [User: 152.5 ms, System: 497.4 ms]
  Range (min … max):   660.8 ms … 698.1 ms    50 runs

LTO enabled + LLVM BOLT instrumented run:

hyperfine --warmup 10 --min-runs 50 -i 'taskset -c 0 ./pylyzer_lto_bolt_instrumented ../tests/test.py'
Benchmark 1: taskset -c 0 ./pylyzer_lto_bolt_instrumented ../tests/test.py
  Time (mean ± σ):     556.7 ms ±   5.9 ms    [User: 132.6 ms, System: 406.3 ms]
  Range (min … max):   547.4 ms … 584.9 ms    50 runs

Further steps

I can suggest the following action points:

Perform more PGO and PLO benchmarks on pylyzer. If it shows improvements - add a note to the documentation about possible improvements in the project performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize pylyzer according to their workloads.
Optimize pre-built binaries (if any)

Here are some examples of how PGO optimization is integrated into other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have some examples of how PGO information looks in the project-specific documentation:

ClickHouse: https://clickhouse.com/docs/en/operations/optimizing-performance/profile-guided-optimization
Databend: https://databend.rs/doc/contributing/pgo
Vector: https://vector.dev/docs/administration/tuning/pgo/
Nebula: https://docs.nebula-graph.io/3.5.0/8.service-tuning/enable_autofdo_for_nebulagraph/
GCC: Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
Clang:
- https://llvm.org/docs/HowToBuildWithPGO.html
- https://llvm.org/docs/AdvancedBuilds.html
tsv-utils: https://github.com/eBay/tsv-utils/blob/master/docs/BuildingWithLTO.md

Regarding LLVM BOLT integration, I have the following links:

Rustc:
- Rustc itself (GitHub PR)
- LLVM in Rustc (Reddit)
CPython: GitHub PR
YDB: GitHub comment
Clang:
LDC: GitHub comment
HHVM, Proxygen and others: Facebook paper
NodeJS: Blog
Chromium: Blog
MySQL, MongoDB, memcached, Verilator: Paper

By the way, I think applying PGO and PLO for https://github.com/erg-lang/erg will be a good idea. What do you think? If you agree with that - do I need to create a separate issue in the erg repo?

Another idea - what do you think about enabling LTO for the project? It can help with performance and binary size reduction as well. However, currently, LTO and PGO cannot be enabled at the same time for Pylyzer due to a bug in Rustc: rust-lang/rust#115344 (comment) . According to the tests above, LTO + BOLT works even a bit faster than PGO + BOLT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) usage #80

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Evaluate Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) usage #80

zamazan4ik Apr 13, 2024

Test environment

Benchmark

Results

Further steps

Replies: 0 comments

zamazan4ik
Apr 13, 2024