Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add an asv-based benchmarks on cirun [skip cirrus] #4751

Closed
wants to merge 16 commits into from

Conversation

ev-br
Copy link
Contributor

@ev-br ev-br commented Jun 13, 2024

No description provided.

@ev-br
Copy link
Contributor Author

ev-br commented Jun 15, 2024

The goal here is to run benchmarks on a Graviton arch on AWS Cirun.

The demo setup (for x86_64 for now) is https://github.com/ev-br/ob_bench and https://github.com/ev-br/ob-bench-asv with web visualization at https://ev-br.github.io/ob-bench-asv/.
(The setup is not mine, shamelessly mirrored from https://sgkit-dev.github.io/sgkit-benchmarks-asv/)

In brief:

The nightly wheel is in fact weekly (gets built every Thursday), which matches a weekly cron job what we want on AWS.
In principle, an alternative to using the wheel is to replicate its building from source, but why.
This way, the CI job does not really need to rebuild OpenBLAS.
As such, it does not really need to live in the main OpenBLAS repo. Even more, it living in the main repo is confusing (it clones the source, then never touches it).

So how about the following plan @martin-frbg :

  • clone https://github.com/ev-br/ob_bench and https://github.com/ev-br/ob-bench-asv into the OpenBLAS org, rename if wanted (these were placeholder names)
  • either configure the publisher repo to serve gh-pages from the main branch or temporarily give me enough permissions to configure it (I do not ask or need for any permissions outside of the publisher repo!)
  • I'll adapt the benchmark repo to run on AWS cirun instead of vanilla GHA runners.
  • there will be a need to generate/add a token for the benchmark and publisher repos talking to each other, but this is for after benchmarks run on cirun.
  • (later) I'll look into deduplicating with codspeed benchmarks.

@martin-frbg
Copy link
Collaborator

OK, so this is a somewhat different concept from the codspeed one, not meant to flag changes caused by a particular PR ? In that case, I think we could even consider running it in larger than weekly intervals. Giving it its own home under the OpenMathLib umbrella should also be no problem, I guess. Not sure I understand your comment about deduplicating, are you intending this to replace the codspeed setup that you committed recently ? (Different architecture, different frequency of runs, different purpose as far as I understood ?)
Sorry if looking at your demo setup would already answer these - I'll try to do that later today or tomorrow

@ev-br
Copy link
Contributor Author

ev-br commented Jun 15, 2024

IIUC the concept was to run it on a cron everywhere, and then codspeed was easy and free to run on each PR? The AWS cirun is trickier and not free. Not exuberantly expensive but nonetheless.

Deduplicating comment is about trivial implementation details: the python side of benchmarking is almost but not completely the same. Two reasons: 1) self-built openblas vs the wheel (prefixed names, scipy_daxpy_ vs daxpy_; the meson detection of what library to use); 2) the benchmark runners are different, so while the core benchmarks are the same, paraphernalia is slightly different. Compare https://github.com/OpenMathLib/OpenBLAS/pull/4751/files#diff-69617a6cd63a6737e2b271070f71b8e55d40a0abb8028283baef5b14f8c6ff71R59 and https://github.com/OpenMathLib/OpenBLAS/blob/develop/benchmark/pybench/benchmarks/bench_blas.py#L25

Merging them completely is not very easy, so I'd consider deduplicating up to a degree when everything runs and we're not on a deadline. No need to worry about right now, I'd say.

So the intended endgame is:

  • codspeed runs as is (or on a cron, up to you)
  • asv based benchmarks run on a cron on cirun AWS on a graviton + maybe a skylake x86_64.

@ev-br
Copy link
Contributor Author

ev-br commented Jun 15, 2024

codspeed runs as is (or on a cron, up to you)

Now that I remember asking them: not easy, it cannot run on a cron. codspeed only supports pushes and pull requests.
One might work around by having a bot pushing to a separate repo, but is it really worth it.

(Different architecture, different frequency of runs, different purpose as far as I understood ?)

Yeah. Codspeed is nice but only supports x86_64.
So we trade some UI niceties and use asv on AWS. One price to pay is that the runners are different.
#4721 runs unmodified benchmarks which run on codspeed, but there is no web viewer.

@martin-frbg
Copy link
Collaborator

ok, so we keep the codspeed setup as a canary for performance regressions, and this here is basically creating codspeed-like performance-vs-commits graphs on arm64 at larger intervals (?) So far,so good - they could live in their own repository named something like BLAS-benchmarks. I wonder if it would make sense to create graphs of performance vs matrix size as well (like the older OpenBLAS benchmarks do), and - in light of #4744 - perhaps add baseline data for "competing" implementations too ?

@ev-br
Copy link
Contributor Author

ev-br commented Jun 16, 2024

ok, so we keep the codspeed setup as a canary for performance regressions, and this here is basically creating codspeed-like performance-vs-commits graphs on arm64 at larger intervals (?)

Exactly! Codspeed serving as a canary-in-a-PR, and asv a canary in a week's worth of PRs (or more than a week, whatever the interval will be).

I wonder if it would make sense to create graphs of performance vs matrix size as well (like the older OpenBLAS benchmarks do)

This exists: https://ev-br.github.io/ob-bench-asv/#benchmarks.Nrm2.time_dnrm2?x-axis=size
From the landing page, https://ev-br.github.io/ob-bench-asv/, click on a benchmark box, then in the left-side panel click on "size" under "x-axis".
Here the name "size" is set at each benchmark class: https://github.com/ev-br/ob_bench/blob/main/benchmarks/benchmarks.py#L62

and - in light of #4744 - perhaps add baseline data for "competing" implementations too ?

Certainly doable. Need to sort the same some plumbing akin to what I mentioned under the "deduplicate" rubric above to be able to link to it.

they could live in their own repository named something like BLAS-benchmarks

Great! I'll be able to start adapting them to cirun once the repository exists in the org.

EDIT: In addition to web graphs, asv produces text output, like this:

$ asv run -v

... build output sniped....

[ 6.25%] ··· Running (benchmarks.DDot.time_ddot--)........
[56.25%] ··· benchmarks.DDot.time_ddot                                                                                                                                                                  ok
[56.25%] ··· ====== ===========
              size             
             ------ -----------
              100    413±0.8ns 
              1000    554±2ns  
             ====== ===========

[62.50%] ··· benchmarks.DSyrk.time_dsyrk                                                                                                                                                                ok
[62.50%] ··· ====== =============
              size               
             ------ -------------
              100     44.2±0.3μs 
              1000   22.7±0.06ms 
             ====== =============

[68.75%] ··· benchmarks.Daxpy.time_daxpy                                                                                                                                                                ok
[68.75%] ··· ====== ===========
              size             
             ------ -----------
              100    551±0.4ns 
              1000    747±4ns  
             ====== ===========

[75.00%] ··· benchmarks.Dgemm.time_dgemm                                                                                                                                                                ok
[75.00%] ··· ====== ============
              size              
             ------ ------------
              100    52.8±0.1μs 
              1000   42.0±0.1ms 
             ====== ============

[81.25%] ··· benchmarks.Dgesdd.time_dgesdd                                                                                                                                                              ok
[81.25%] ··· =========== =============
                (m, n)                
             ----------- -------------
                100, 5    11.4±0.01μs 
              1000, 222    29.0±0.2ms 
             =========== =============

[87.50%] ··· benchmarks.Dgesv.time_dgesv                                                                                                                                                                ok
[87.50%] ··· ====== =============
              size               
             ------ -------------
              100    70.4±0.07μs 
              1000   30.5±0.09ms 
             ====== =============

[93.75%] ··· benchmarks.Dsyev.time_dsyev                                                                                                                                                                ok
[93.75%] ··· ====== =============
              size               
             ------ -------------
               50     312±0.3μs  
              200    11.4±0.02ms 
             ====== =============

[100.00%] ··· benchmarks.Nrm2.time_dnrm2                                                                                                                                                                 ok
[100.00%] ··· ====== ===========
               size             
              ------ -----------
               100    642±0.6ns 
               1000    1.22±0μs 
              ====== ===========

This is on a c7g.large AWS instance, similar to what cirun uses. This sort of output will be available in the CI logs for each build.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 16, 2024

good to know that "size" is doable (though having just 100 and 1000 and then doing a line plot is a bit counterproductive)

@ev-br
Copy link
Contributor Author

ev-br commented Jun 16, 2024

Of course. Both benchmark functions and parameter combinations, both here and in codspeed, are proof-of-concept, and are mostly thrown in to have quick iteration turnover.
So, what would be most useful sizes? Also, am currently only running the double precision real workloads; do we want single precision and/or complex?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 16, 2024

So, what would be most useful sizes?

That would very much depend on the BLAS function - most show some jitter that probably stems from cache misses at certain sises if one does fine-grained benchmarks (like the ones in the benchmark folder, that default to checking all sizes between 1 and 1000 with a granularity of 1 -probably too expensive to do on AWS all the time)

Also, am currently only running the double precision real workloads; do we want single precision and/or complex?

Ideally yes, as in most cases each precision has its own dedicated kernel (certainly one for real and one for complex numbers). Probably all outside the scope of the milestone though...

@ev-br
Copy link
Contributor Author

ev-br commented Jun 16, 2024

Well, life does not stop on a milestone :-). Let's take these to OpenMathLib/BLAS-Benchmarks#1 and OpenMathLib/BLAS-Benchmarks#2.
Luckily, tweaking benchmark parameters is super easy

@ev-br
Copy link
Contributor Author

ev-br commented Jun 21, 2024

closing in favor of a cron job over at https://github.com/OpenMathLib/BLAS-Benchmarks/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants