Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent (and slowing) CI Build Times #383

Open
cpjordan opened this issue Jan 13, 2025 · 14 comments
Open

Inconsistent (and slowing) CI Build Times #383

cpjordan opened this issue Jan 13, 2025 · 14 comments

Comments

@cpjordan
Copy link
Contributor

Issue Summary
The CI build times for Thetis are inconsistent, with the latest build taking up to 2 hours to complete. Build times have otherwise started to range between 50 and 90 minutes since November 2024. The expected runtime is approximately 45 minutes.

Observations
Normal build time: ~42-46 minutes
Last normal build time: October 7th 2024 for main build, October 31st 2024 for last update
The Thetis build was then broken nearly all of November 2024 until the 28th which fixed it but took ~1.5hrs.
It has since varied between 50 minutes and ~80 minutes.
The latest build has ~83000 warnings (previously ~29000 warnings, many repeated in both cases), which may explain this particular instance being so slow.

With #376 taking a month, it is difficult to know if this caused the change in run time or something else has slowed things down dependency-wise. Even for the quickest run, it is still ~5-8 minutes slower than pre-November 2024. Furthermore, the previous variation in run time was only 4-5 minutes, rather than ~30-40 minutes.

@stephankramer
Copy link
Contributor

Great catch. Let's see if we can reproduce that "last normal build" (see #384 ) by using firedrake-vanilla/2024-10 image of the firedrake container. If that one's still "good", we could try 2024-11

@stephankramer
Copy link
Contributor

Next to that we should probably take some of these warnings a bit more seriously indeed

@stephankramer
Copy link
Contributor

So rerunning firedrake-vanilla 2024-10 with 8c6f9d1 indeed gives back the old performance - see https://github.com/thetisproject/thetis/actions/runs/12754435282/job/35548240308 - and only has 1277 warnings! So maybe we should take a closer look at these indeed :-)

@stephankramer
Copy link
Contributor

Having a stab at fixing some warnings (see #385) but it looks like most warnings are from firedrake's use of loopy - so presumably safe to ignore
Let's also print some timings of individual tests, so we can see if there are any particular tests that have become slower (see #386)

@stephankramer
Copy link
Contributor

stephankramer commented Jan 16, 2025

So it seems to be a more or less consistent increase in runtime for all tests.

Here's the percentage increase for all tests > 1s sorted from left to right by longest to shortest test. They are all slower, and many of them have an increase of 200%, i.e. 3 times as slow!

Actually scratch that: there's was a bug in my script - actually the performance regression is a lot more varied than that...it's actually much more varied between tests:

Image

@thangel
Copy link
Contributor

thangel commented Jan 16, 2025

Wow - this is quite substantial! What could be the cause of this?

@stephankramer
Copy link
Contributor

No idea. It looks to be a Firedrake issue (or something else in the larger environment) as it's after 8c6f9d1 and none of the later Thetis commits would make all tests slower. I guess the next step is to see if we can reproduce locally and then do some profiling to see what's started taking more time. It might be that this is an issue only when starting with a "cold" cache, i.e. that it's just the compilation that has become slower, so then it would not affect production runs as much

@stephankramer
Copy link
Contributor

See update above. So on average still just as bad, just concentrated more on some tests that have substantially increased - should hopefully be easier to track down.

@cpjordan
Copy link
Contributor Author

Thanks for the above @stephankramer - I will also try to look into this once my machines are available. If any changes due to incorrect implementation on my part have caused a slow down, my guesses would be #376 or #373.

I installed Firedrake and Thetis for a student this week and Thetis is substantially slower to run (consistent with the increased latest build time). Running the bottom friction test for that machine there are the warnings for interpolate which really needs to be updated:

/home/s1626662/firedrake/src/firedrake/firedrake/interpolation.py:385: FutureWarning: The use of `interpolate` to perform the numerical interpolation is deprecated.
...

and the PyOP2 one that comes up often (which i think is fine?):

/home/s1626662/firedrake/src/firedrake/pyop2/types/glob.py:278: UserWarning: PyOP2.Global has no comm, this is likely to break in parallel!
  warnings.warn("PyOP2.Global has no comm, this is likely to break in parallel!")

but these new ones are perhaps what is causing the most recent reduction in build times:

/home/s1626662/firedrake/src/loopy/loopy/schedule/tools.py:871: UserWarning: Cannot satisfy constraint that iname 'n' must be nested within 'layer''.
  cannot_satisfy_callback("Cannot satisfy constraint that"
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-code-gen-cache-v3-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-memoize-cache-preprocess_program-LoopyKeyBuilder.LoopyKeyBuilder-v0-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-schedule-cache-v4-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/src/loopy/loopy/schedule/tools.py:871: UserWarning: Cannot satisfy constraint that iname 'n' must be nested within 'layer''.
  cannot_satisfy_callback("Cannot satisfy constraint that"

@connorjward - could we borrow your expertise on the latter set of warnings? Has there been a recent update that might be causing this, do you have any idea whether there is something we need to address somewhere in Thetis or is this a potential issue with Firedrake/loopy etc.? This is a new warning, which I think might be linked to the slow down we've had between 7th-13th Jan.

@connorjward
Copy link
Contributor

We have long held suspicions of possible performance regressions with our CI (e.g. @pbrubeck messaged me this morning about this) but it's really hard to pin things down. Performance varies hugely depending on how overloaded the machines are. I have updated loopy a number of times recently, which would probably be the most likely place for a CI performance regression to come from, but performance didn't appear to get immediately worse afterwards.

We are also seeing the same warnings in our CI. They only get raised for a handful of tests though so I doubt that they are significant wrt the slowdown.

Not strictly related but we are soon going to be getting two new replacement CI runners that should be a fair bit faster.

I have had one idea that might help speed things up. I will see what I can do.

@stephankramer
Copy link
Contributor

stephankramer commented Jan 16, 2025

So I've singled out a single test and run it in firedrake-vanilla:2024-10 and firedrake-vanilla:latest and compared a run with a cold cache (first run in a clean container) and a hot cache (third run of the same test). Just running in docker on my own laptop. In firedrake-vanilla:2024-10 the first run (cold cache) takes 79s and the third run 27s. When running in firedrake-vanilla:latest the first run is 299s, the third run is still 255s. See below for the associated flamegraphs:

Image
firedrake-vanilla:2024-10, cold cache (~79s)

Image
firedrake-vanilla:2024-10, hot cache (~27s)

Image
firedrake-vanilla:latest, cold cache (~299s)

Image
firedrake-vanilla, hot cache (~255s)

So it seems with latest firedrake, it still spends an extraordinary time in compilation even though it doesn't appear to be creating any new tsfc or pyop2 kernels? @cpjordan @pbrubeck does this tell anything?

@connorjward
Copy link
Contributor

connorjward commented Jan 16, 2025

This is great! Which test is this? I will try and fix it tomorrow.

Also I have had one idea that might help improve things for everyone using our runners: firedrakeproject/firedrake#3980. If this works then I will submit PRs to Thetis and Gusto. (I was wrong)

@stephankramer
Copy link
Contributor

It's

python -mpytest test/tracerEq/test_consistency_2d.py -k"test_nonconst_tracer[DIRK33]"

@connorjward
Copy link
Contributor

With firedrakeproject/firedrake#3982 I have managed to get things back down to 45s with a hot cache. I've created another issue which describes where we can improve things next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants