Inconsistent (and slowing) CI Build Times #383

cpjordan · 2025-01-13T13:28:07Z

Issue Summary
The CI build times for Thetis are inconsistent, with the latest build taking up to 2 hours to complete. Build times have otherwise started to range between 50 and 90 minutes since November 2024. The expected runtime is approximately 45 minutes.

Observations
Normal build time: ~42-46 minutes
Last normal build time: October 7th 2024 for main build, October 31st 2024 for last update
The Thetis build was then broken nearly all of November 2024 until the 28th which fixed it but took ~1.5hrs.
It has since varied between 50 minutes and ~80 minutes.
The latest build has ~83000 warnings (previously ~29000 warnings, many repeated in both cases), which may explain this particular instance being so slow.

With #376 taking a month, it is difficult to know if this caused the change in run time or something else has slowed things down dependency-wise. Even for the quickest run, it is still ~5-8 minutes slower than pre-November 2024. Furthermore, the previous variation in run time was only 4-5 minutes, rather than ~30-40 minutes.

stephankramer · 2025-01-13T14:07:58Z

Great catch. Let's see if we can reproduce that "last normal build" (see #384 ) by using firedrake-vanilla/2024-10 image of the firedrake container. If that one's still "good", we could try 2024-11

stephankramer · 2025-01-13T14:09:44Z

Next to that we should probably take some of these warnings a bit more seriously indeed

stephankramer · 2025-01-14T12:12:15Z

So rerunning firedrake-vanilla 2024-10 with 8c6f9d1 indeed gives back the old performance - see https://github.com/thetisproject/thetis/actions/runs/12754435282/job/35548240308 - and only has 1277 warnings! So maybe we should take a closer look at these indeed :-)

stephankramer · 2025-01-15T11:49:30Z

Having a stab at fixing some warnings (see #385) but it looks like most warnings are from firedrake's use of loopy - so presumably safe to ignore
Let's also print some timings of individual tests, so we can see if there are any particular tests that have become slower (see #386)

stephankramer · 2025-01-16T14:25:26Z

~~So it seems to be a more or less consistent increase in runtime for all tests.~~

~~Here's the percentage increase for all tests > 1s sorted from left to right by longest to shortest test. They are all slower, and many of them have an increase of 200%, i.e. 3 times as slow!~~

Actually scratch that: there's was a bug in my script - actually the performance regression is a lot more varied than that...it's actually much more varied between tests:

thangel · 2025-01-16T14:28:40Z

Wow - this is quite substantial! What could be the cause of this?

stephankramer · 2025-01-16T15:30:38Z

No idea. It looks to be a Firedrake issue (or something else in the larger environment) as it's after 8c6f9d1 and none of the later Thetis commits would make all tests slower. I guess the next step is to see if we can reproduce locally and then do some profiling to see what's started taking more time. It might be that this is an issue only when starting with a "cold" cache, i.e. that it's just the compilation that has become slower, so then it would not affect production runs as much

stephankramer · 2025-01-16T16:14:29Z

See update above. So on average still just as bad, just concentrated more on some tests that have substantially increased - should hopefully be easier to track down.

cpjordan · 2025-01-16T16:52:30Z

Thanks for the above @stephankramer - I will also try to look into this once my machines are available. If any changes due to incorrect implementation on my part have caused a slow down, my guesses would be #376 or #373.

I installed Firedrake and Thetis for a student this week and Thetis is substantially slower to run (consistent with the increased latest build time). Running the bottom friction test for that machine there are the warnings for interpolate which really needs to be updated:

/home/s1626662/firedrake/src/firedrake/firedrake/interpolation.py:385: FutureWarning: The use of `interpolate` to perform the numerical interpolation is deprecated.
...

and the PyOP2 one that comes up often (which i think is fine?):

/home/s1626662/firedrake/src/firedrake/pyop2/types/glob.py:278: UserWarning: PyOP2.Global has no comm, this is likely to break in parallel!
  warnings.warn("PyOP2.Global has no comm, this is likely to break in parallel!")

but these new ones are perhaps what is causing the most recent reduction in build times:

/home/s1626662/firedrake/src/loopy/loopy/schedule/tools.py:871: UserWarning: Cannot satisfy constraint that iname 'n' must be nested within 'layer''.
  cannot_satisfy_callback("Cannot satisfy constraint that"
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-code-gen-cache-v3-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-memoize-cache-preprocess_program-LoopyKeyBuilder.LoopyKeyBuilder-v0-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/lib/python3.10/site-packages/pytools/persistent_dict.py:597: CollisionWarning: loopy-schedule-cache-v4-2024.1-islpy2024.2-cgen2020.1-v1: key collision in cache at '/home/s1626662/.cache/pytools' -- these are sufficiently unlikely that they're often indicative of a broken hash key implementation (that is not considering some elements relevant for equality comparison)
  return self.fetch(key)
/home/s1626662/firedrake/src/loopy/loopy/schedule/tools.py:871: UserWarning: Cannot satisfy constraint that iname 'n' must be nested within 'layer''.
  cannot_satisfy_callback("Cannot satisfy constraint that"

@connorjward - could we borrow your expertise on the latter set of warnings? Has there been a recent update that might be causing this, do you have any idea whether there is something we need to address somewhere in Thetis or is this a potential issue with Firedrake/loopy etc.? This is a new warning, which I think might be linked to the slow down we've had between 7th-13th Jan.

connorjward · 2025-01-16T17:17:32Z

We have long held suspicions of possible performance regressions with our CI (e.g. @pbrubeck messaged me this morning about this) but it's really hard to pin things down. Performance varies hugely depending on how overloaded the machines are. I have updated loopy a number of times recently, which would probably be the most likely place for a CI performance regression to come from, but performance didn't appear to get immediately worse afterwards.

We are also seeing the same warnings in our CI. They only get raised for a handful of tests though so I doubt that they are significant wrt the slowdown.

Not strictly related but we are soon going to be getting two new replacement CI runners that should be a fair bit faster.

I have had one idea that might help speed things up. I will see what I can do.

stephankramer · 2025-01-16T17:45:11Z

So I've singled out a single test and run it in firedrake-vanilla:2024-10 and firedrake-vanilla:latest and compared a run with a cold cache (first run in a clean container) and a hot cache (third run of the same test). Just running in docker on my own laptop. In firedrake-vanilla:2024-10 the first run (cold cache) takes 79s and the third run 27s. When running in firedrake-vanilla:latest the first run is 299s, the third run is still 255s. See below for the associated flamegraphs:

firedrake-vanilla:2024-10, cold cache (~79s)

firedrake-vanilla:2024-10, hot cache (~27s)

firedrake-vanilla:latest, cold cache (~299s)

firedrake-vanilla, hot cache (~255s)

So it seems with latest firedrake, it still spends an extraordinary time in compilation even though it doesn't appear to be creating any new tsfc or pyop2 kernels? @cpjordan @pbrubeck does this tell anything?

connorjward · 2025-01-16T17:50:39Z

This is great! Which test is this? I will try and fix it tomorrow.

~~Also I have had one idea that might help improve things for everyone using our runners: firedrakeproject/firedrake#3980. If this works then I will submit PRs to Thetis and Gusto.~~ (I was wrong)

stephankramer · 2025-01-16T21:17:30Z

It's

python -mpytest test/tracerEq/test_consistency_2d.py -k"test_nonconst_tracer[DIRK33]"

connorjward · 2025-01-17T13:54:08Z

With firedrakeproject/firedrake#3982 I have managed to get things back down to 45s with a hot cache. I've created another issue which describes where we can improve things next.

connorjward mentioned this issue Jan 16, 2025

Localise pytools cache to each runner firedrakeproject/firedrake#3980

Closed

connorjward mentioned this issue Jan 17, 2025

Improve kernel caching firedrakeproject/firedrake#3982

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent (and slowing) CI Build Times #383

Inconsistent (and slowing) CI Build Times #383

cpjordan commented Jan 13, 2025

stephankramer commented Jan 13, 2025

stephankramer commented Jan 13, 2025

stephankramer commented Jan 14, 2025

stephankramer commented Jan 15, 2025

stephankramer commented Jan 16, 2025 •

edited

Loading

thangel commented Jan 16, 2025

stephankramer commented Jan 16, 2025

stephankramer commented Jan 16, 2025

cpjordan commented Jan 16, 2025

connorjward commented Jan 16, 2025

stephankramer commented Jan 16, 2025 •

edited

Loading

connorjward commented Jan 16, 2025 •

edited

Loading

stephankramer commented Jan 16, 2025

connorjward commented Jan 17, 2025

Inconsistent (and slowing) CI Build Times #383

Inconsistent (and slowing) CI Build Times #383

Comments

cpjordan commented Jan 13, 2025

stephankramer commented Jan 13, 2025

stephankramer commented Jan 13, 2025

stephankramer commented Jan 14, 2025

stephankramer commented Jan 15, 2025

stephankramer commented Jan 16, 2025 • edited Loading

thangel commented Jan 16, 2025

stephankramer commented Jan 16, 2025

stephankramer commented Jan 16, 2025

cpjordan commented Jan 16, 2025

connorjward commented Jan 16, 2025

stephankramer commented Jan 16, 2025 • edited Loading

connorjward commented Jan 16, 2025 • edited Loading

stephankramer commented Jan 16, 2025

connorjward commented Jan 17, 2025

stephankramer commented Jan 16, 2025 •

edited

Loading

stephankramer commented Jan 16, 2025 •

edited

Loading

connorjward commented Jan 16, 2025 •

edited

Loading