Some fundamental concerns #73

jan-kiszka · 2024-06-02T07:31:22Z

jan-kiszka
Jun 2, 2024

Hi all!

While looking into some high-level slides about green computing and the Green Software Foundation, I thought: "Why not make this concrete? Why not having a CI extension that tells me what I roughly emitted?" Then I found this and was pleased that it already exists. But now, after trying this out and thinking about this further, I'm no longer that sure. Here are my concerns:

How realistic are the numbers produced by this measurement?

As far as I understood, the model basically just looks at CPU usage. It does not consider base consumption of the runner or peaks caused by peripherals such as storage or network interfaces (not to speak of GPU accelerators). Did you compare numbers generated by the model on a local server with a power meter attached to that machine? Do the numbers realistically scale? Can we ignore the other factors without getting even relative results wrong? I'm lacking confidence that visualized numbers are not suggesting wrong optimizations. At that leads to my second concern.

How to avoid wrongly optimizing the runner at the price of causing emissions, possibly high ones, elsewhere?

Do you have numbers on what caching services consume? Looking at #14, it does not seem to be that clear. It's fair to ignore external factors while optimizing, provided the ratio between their estimated impact and the one under our own control is negligible (say, 1:10).

Last but not least: I would strongly recommend to not configure these CI steps unconditionally. Activate them every n-th run or on explicit request to get an update after noteworthy pipeline changes. Even the cached execution is still too heavy IMHO, giving all those dependencies it installs. And then it looks like the execution time can also go up (https://github.com/siemens/kas/actions/runs/9336544184/job/25697237499 vs. https://github.com/siemens/kas/actions/runs/9331141542/job/25685599701). Is that something your https://github.com/green-kernel is supposed to address?

I know we have to start somewhere, but given that overall emissions count, I'm a bit reluctant to perform and promote local measurements too early. Then I would rather stick with existing, "free" KPIs like reducing the CI time or avoiding job executions altogether.

ArneTR · 2024-06-02T10:21:54Z

ArneTR
Jun 2, 2024
Maintainer

You concerns are very valid and we have so far looked into all of them. Let me break it down:

The model that Eco-CI uses is called Cloud Energy. In the repo you find a detaile discussion on how well it performs in and out-of sample. It is also based on some prior academic work which is public and they have done validation against real life machines. The difference is ~10% or better in most cases.
Having said that, many caveats apply. For instance that the configuration is identical, DVFS is off etc. We have a detailed Case Study on that here: https://www.green-coding.io/case-studies/cpu-utilization-usefulness/
Also the VM is a problem as we somehow have to split the host power into the virtualized chunnks. Many models are possible while the one we use is linear. See discussion here: [Estimates on VMs] - Improving vHost-Ratio splitting cloud-energy#5

So although many caveats apply the approach is at least based on some theoretically valid assumptions and we have seen it work with approiately configured machines.

Now to answer your question in full: We just do not know if the values that GitHub claims that the machines have are correct. They can say the machine has a fixed frequency or say that it is an AMD Epyc processor, but in reality it is not. The hypervisor values can be faked.

This Eco-CI can only give you a sound estimation, but nothing more.

On top of that Eco-CI is not intended to provide "estimations for optimization", but really only for accounting. And for that the approach is quite feasible.

We have looked into the idea of optimization in this case study: https://www.green-coding.io/case-studies/ci-pipeline-energy-variability/ and have seen that due to network latency and resource over-subscription even super-simple pipelines tend to fluctuate 30% or more. So nop optimization is possible in Github Actions from the start no matter how accurate Eco-CI is :)

Regarding the overhead: Yes, this is a fair point. It is too high. Period. The way forward is to implement even stronger caching through a container and also pre-train the model. This would make the overhead probably -90% or more.

See Issues and approaches here:

We are very happy to accept PRs on this as Eco-CI is currently one of our free open source works for the community.

Also a nudge: Since I saw you work for Siemens maybe you can even convince your employer to support an open source project financially. We would be very grateful and super eager to devote more time on making Eco-CI better and hope we have already given some good head-start work. With more time for us to work on it there is more good work to come :)

In any case: Thanks for taking the time to write the Discussion. I believe this is very helpful for us and also for anyone else who reads it. ty!

4 replies

jan-kiszka Jun 3, 2024
Author

Hi Arne, thanks for your detailed reply.

"Eco-CI is not intended to provide "estimations for optimization", but really only for accounting." - I'm not yet getting this. What is the difference if I'm looking at the numbers and derive actions from them or if my customer does and potentially chooses a supplier with lower numbers? In the end, it's always about reduction, not just accounting.

And do you also have an idea why the execution time of our pytests got longer when installing Eco-CI? Is there a fix setup overhead involved (but where?), or is there a linear slow-down effect (and why?)?

I'll bring my points also to the internal exchange we just started with the folks involved in the Green Software Foundation.

ArneTR Jun 3, 2024
Maintainer

The difference between measuring for accounting and for optimization is that for the former you just need an "as is" value. Meaning if the server consumes 100 Wh and one day later 120 Wh, no matter why, that is your total. For the latter you need to make sure that the rise in 20Wh (for this example) is really due to some change in your code / pipeline and not due to fluctuation of the measurement system.
I linked you our case study which shows that Github Pipelines differ by 30% without a change. So to "optimize" something you can only really tell that your optimizations made a difference if the change makes at least a 30%+ difference. Otherwise it might have just been noise.

Can you link your pipelines where pytest took longer after installing Eco-CI. Will have a look

jan-kiszka Jun 3, 2024
Author

Right, it makes no sense to optimize for something that is in the noise level. But it still makes sense to optimize around numbers that are way outside that noise level. So, providing such rough numbers at minimal extra costs should be the first goal.

Regarding the observed overhead: The runs I already linked above. The code involved can be found in https://github.com/siemens/kas/tree/next. It's a mixture of CPU operations and network fetches.

ArneTR Jun 4, 2024
Maintainer

I see in the data that the pipeline duration for the KAS 3.10 step "Run tests" went up from around 1:17 to 1:48 .

I have seen runs before that take less than 1 minute for this step. So to confirm it would be nice if you can produce a couple of more runs. So far I have only found one run with Eco-CI active. if that is a mistake please linke the Github Actions direct links to the runs.

I currently would not know why Eco-CI would influence the run, but I am happy to investigate further if you can provide a couple more runs.

On a side note: We are also working on a more trimmed down version atm that should reduce the overhead to almost 0 by pre-calcuating the predictions for Github Shared Runners at least.

ArneTR · 2024-06-14T06:46:32Z

ArneTR
Jun 14, 2024
Maintainer

I already pinged you in a different thread, but wanted to answer here for completeness.

We have overhauled the plugin and PR #76 has removed a lot of dependencies. No packages are installed anymore and no docker container downloaded.

Happy if you give it a spin and looking forward to your opinion now.

Also it is unclear to me why Github still takes 6s for the measurement step, as detailed in the PR. Maybe you have some pointers / ideas.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some fundamental concerns #73

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some fundamental concerns #73

jan-kiszka Jun 2, 2024

Replies: 2 comments · 4 replies

ArneTR Jun 2, 2024 Maintainer

jan-kiszka Jun 3, 2024 Author

ArneTR Jun 3, 2024 Maintainer

jan-kiszka Jun 3, 2024 Author

ArneTR Jun 4, 2024 Maintainer

ArneTR Jun 14, 2024 Maintainer

jan-kiszka
Jun 2, 2024

Replies: 2 comments 4 replies

ArneTR
Jun 2, 2024
Maintainer

jan-kiszka Jun 3, 2024
Author

ArneTR Jun 3, 2024
Maintainer

jan-kiszka Jun 3, 2024
Author

ArneTR Jun 4, 2024
Maintainer

ArneTR
Jun 14, 2024
Maintainer