[clusteragent/autoscaling] Add telemetry metrics for autoscaling controller #28115

jennchenn · 2024-08-01T04:24:44Z

What does this PR do?

Add the following telemetry metrics around the autoscaling controller:

autoscaling.queue_*
autoscaling.status_conditions
workload_autoscaling.horizontal_scaling
workload_autoscaling.vertical_scaling

Motivation

Improve monitoring and debugging.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Deploy Cluster Agent with autoscaling controller activated, autoscaling.* metrics should be available on Prom endpoint (localhost:5000/metrics).
With Agent >= 7.57, metrics should also be scrape by DCA integration

pr-commenter · 2024-08-01T05:12:42Z

Regression Detector

Regression Detector Results

Run ID: 38014940-d618-4655-a287-e9a5434b8577 Metrics dashboard Target profiles

Baseline: 6108bcf
Comparison: c6f367c

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	links
➖	otel_to_otel_logs	ingress throughput	+0.82	[+0.00, +1.63]	Logs
➖	idle	memory utilization	+0.31	[+0.28, +0.35]	Logs
➖	file_tree	memory utilization	+0.29	[+0.23, +0.35]	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.01]	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.00, +0.00]	Logs
➖	basic_py_check	% cpu utilization	-0.08	[-2.86, +2.70]	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-0.43	[-1.31, +0.46]	Logs
➖	pycheck_1000_100byte_tags	% cpu utilization	-0.67	[-5.46, +4.12]	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	-1.69	[-14.22, +10.84]	Logs

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

pkg/clusteragent/autoscaling/workload/controller_horizontal.go

pkg/clusteragent/autoscaling/controller.go

pkg/clusteragent/autoscaling/workload/model/pod_autoscaler.go

pkg/clusteragent/autoscaling/workload/telemetry.go

pkg/clusteragent/autoscaling/workload/controller_vertical.go

pkg/clusteragent/autoscaling/workload/controller_horizontal.go

pkg/clusteragent/autoscaling/workload/telemetry.go

pr-commenter · 2024-08-01T22:30:47Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=41250820 --os-family=ubuntu

Note: This applies to commit c6f367c

…oscaling-telemetry

ogaca-dd

LGTM for files owned by ASC

vboulineau · 2024-08-09T07:04:08Z

/merge

dd-devflow · 2024-08-09T07:04:14Z

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

jennchenn added 3 commits July 29, 2024 15:29

Add workqueue load telemetry

15cb327

fixup! Add workqueue load telemetry

6b47815

Add telemetry for recommendation values and status conditions

692b415

vboulineau reviewed Aug 1, 2024

View reviewed changes

jennchenn added 6 commits August 1, 2024 15:44

Pass workqueue as param to autoscaling controller constructor

9d16bfc

Use gauge for applied recommendation values

8a718ea

Use gauge for autoscaler status

e4d60f1

fixup! Use gauge for applied recommendation values

7823b3a

Add telemetry for received recommendation values

09489d2

Add join leader label to metrics

55c36e4

vboulineau added 2 commits August 8, 2024 15:03

Update with latest changes

c3a49a4

Merge remote-tracking branch 'origin/main' into jenn/CONTAS-1_add-aut…

0c876fb

…oscaling-telemetry

vboulineau marked this pull request as ready for review August 8, 2024 13:10

vboulineau requested review from a team as code owners August 8, 2024 13:10

vboulineau added this to the 7.57.0 milestone Aug 8, 2024

vboulineau added team/containers changelog/no-changelog qa/done QA done before merge and regressions are covered by tests labels Aug 8, 2024

vboulineau approved these changes Aug 8, 2024

View reviewed changes

ogaca-dd approved these changes Aug 8, 2024

View reviewed changes

vboulineau force-pushed the jenn/CONTAS-1_add-autoscaling-telemetry branch from eb5f081 to 523619c Compare August 8, 2024 13:30

Fix merge

c6f367c

vboulineau force-pushed the jenn/CONTAS-1_add-autoscaling-telemetry branch from 523619c to c6f367c Compare August 8, 2024 16:29

dd-mergequeue bot merged commit 2c247a1 into main Aug 9, 2024
218 checks passed

dd-mergequeue bot deleted the jenn/CONTAS-1_add-autoscaling-telemetry branch August 9, 2024 07:23

jennchenn added the component/autoscaling label Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[clusteragent/autoscaling] Add telemetry metrics for autoscaling controller #28115

[clusteragent/autoscaling] Add telemetry metrics for autoscaling controller #28115

jennchenn commented Aug 1, 2024 •

edited by vboulineau

Loading

pr-commenter bot commented Aug 1, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

pr-commenter bot commented Aug 1, 2024 •

edited

Loading

ogaca-dd left a comment

vboulineau commented Aug 9, 2024

dd-devflow bot commented Aug 9, 2024

[clusteragent/autoscaling] Add telemetry metrics for autoscaling controller #28115

[clusteragent/autoscaling] Add telemetry metrics for autoscaling controller #28115

Conversation

jennchenn commented Aug 1, 2024 • edited by vboulineau Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Aug 1, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Explanation

pr-commenter bot commented Aug 1, 2024 • edited Loading

Test changes on VM

ogaca-dd left a comment

Choose a reason for hiding this comment

vboulineau commented Aug 9, 2024

dd-devflow bot commented Aug 9, 2024

jennchenn commented Aug 1, 2024 •

edited by vboulineau

Loading

pr-commenter bot commented Aug 1, 2024 •

edited

Loading

pr-commenter bot commented Aug 1, 2024 •

edited

Loading