-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent (and slowing) CI Build Times #383
Comments
Great catch. Let's see if we can reproduce that "last normal build" (see #384 ) by using firedrake-vanilla/2024-10 image of the firedrake container. If that one's still "good", we could try 2024-11 |
Next to that we should probably take some of these warnings a bit more seriously indeed |
So rerunning firedrake-vanilla 2024-10 with 8c6f9d1 indeed gives back the old performance - see https://github.com/thetisproject/thetis/actions/runs/12754435282/job/35548240308 - and only has 1277 warnings! So maybe we should take a closer look at these indeed :-) |
Actually scratch that: there's was a bug in my script - actually the performance regression is a lot more varied than that...it's actually much more varied between tests: |
Wow - this is quite substantial! What could be the cause of this? |
No idea. It looks to be a Firedrake issue (or something else in the larger environment) as it's after 8c6f9d1 and none of the later Thetis commits would make all tests slower. I guess the next step is to see if we can reproduce locally and then do some profiling to see what's started taking more time. It might be that this is an issue only when starting with a "cold" cache, i.e. that it's just the compilation that has become slower, so then it would not affect production runs as much |
See update above. So on average still just as bad, just concentrated more on some tests that have substantially increased - should hopefully be easier to track down. |
Thanks for the above @stephankramer - I will also try to look into this once my machines are available. If any changes due to incorrect implementation on my part have caused a slow down, my guesses would be #376 or #373. I installed Firedrake and Thetis for a student this week and Thetis is substantially slower to run (consistent with the increased latest build time). Running the bottom friction test for that machine there are the warnings for interpolate which really needs to be updated:
and the PyOP2 one that comes up often (which i think is fine?):
but these new ones are perhaps what is causing the most recent reduction in build times:
@connorjward - could we borrow your expertise on the latter set of warnings? Has there been a recent update that might be causing this, do you have any idea whether there is something we need to address somewhere in Thetis or is this a potential issue with Firedrake/loopy etc.? This is a new warning, which I think might be linked to the slow down we've had between 7th-13th Jan. |
We have long held suspicions of possible performance regressions with our CI (e.g. @pbrubeck messaged me this morning about this) but it's really hard to pin things down. Performance varies hugely depending on how overloaded the machines are. I have updated loopy a number of times recently, which would probably be the most likely place for a CI performance regression to come from, but performance didn't appear to get immediately worse afterwards. We are also seeing the same warnings in our CI. They only get raised for a handful of tests though so I doubt that they are significant wrt the slowdown. Not strictly related but we are soon going to be getting two new replacement CI runners that should be a fair bit faster. I have had one idea that might help speed things up. I will see what I can do. |
So I've singled out a single test and run it in firedrake-vanilla:2024-10 and firedrake-vanilla:latest and compared a run with a cold cache (first run in a clean container) and a hot cache (third run of the same test). Just running in docker on my own laptop. In firedrake-vanilla:2024-10 the first run (cold cache) takes 79s and the third run 27s. When running in firedrake-vanilla:latest the first run is 299s, the third run is still 255s. See below for the associated flamegraphs:
So it seems with latest firedrake, it still spends an extraordinary time in compilation even though it doesn't appear to be creating any new tsfc or pyop2 kernels? @cpjordan @pbrubeck does this tell anything? |
This is great! Which test is this? I will try and fix it tomorrow.
|
It's
|
With firedrakeproject/firedrake#3982 I have managed to get things back down to 45s with a hot cache. I've created another issue which describes where we can improve things next. |
Issue Summary
The CI build times for Thetis are inconsistent, with the latest build taking up to 2 hours to complete. Build times have otherwise started to range between 50 and 90 minutes since November 2024. The expected runtime is approximately 45 minutes.
Observations
Normal build time: ~42-46 minutes
Last normal build time: October 7th 2024 for main build, October 31st 2024 for last update
The Thetis build was then broken nearly all of November 2024 until the 28th which fixed it but took ~1.5hrs.
It has since varied between 50 minutes and ~80 minutes.
The latest build has ~83000 warnings (previously ~29000 warnings, many repeated in both cases), which may explain this particular instance being so slow.
With #376 taking a month, it is difficult to know if this caused the change in run time or something else has slowed things down dependency-wise. Even for the quickest run, it is still ~5-8 minutes slower than pre-November 2024. Furthermore, the previous variation in run time was only 4-5 minutes, rather than ~30-40 minutes.
The text was updated successfully, but these errors were encountered: