Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design/impl: control plane metrics monitoring #1955

Closed
wants to merge 1 commit into from

Conversation

Xunzhuo
Copy link
Member

@Xunzhuo Xunzhuo commented Oct 11, 2023

What type of PR is this?

What this PR does / why we need it:

design: control plane metrics monitoring

Which issue(s) this PR fixes:

Fixes #700

https://gateway.xunzhuo.cafe/latest/design/eg-metrics

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 2 times, most recently from 2eea36e to 0b9f85e Compare October 11, 2023 13:55
@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #1955 (bf0b87a) into main (fa55804) will decrease coverage by 1.10%.
The diff coverage is 36.75%.

@@            Coverage Diff             @@
##             main    #1955      +/-   ##
==========================================
- Coverage   65.53%   64.43%   -1.10%     
==========================================
  Files          92      101       +9     
  Lines       13528    13965     +437     
==========================================
+ Hits         8865     8999     +134     
- Misses       4114     4407     +293     
- Partials      549      559      +10     
Files Coverage Δ
api/v1alpha1/validation/validate.go 74.30% <100.00%> (ø)
...ternal/infrastructure/kubernetes/infra_resource.go 81.65% <100.00%> (+2.26%) ⬆️
internal/provider/kubernetes/predicates.go 54.78% <100.00%> (ø)
internal/provider/kubernetes/routes.go 56.00% <100.00%> (ø)
internal/xds/translator/runner/runner.go 69.81% <100.00%> (+0.58%) ⬆️
internal/admin/server.go 56.75% <66.66%> (ø)
internal/provider/kubernetes/filters.go 52.30% <66.66%> (ø)
internal/xds/translator/extension.go 74.66% <93.75%> (+2.27%) ⬆️
internal/cmd/server.go 17.68% <0.00%> (-0.37%) ⬇️
internal/gatewayapi/runner/runner.go 22.60% <25.00%> (-0.41%) ⬇️
... and 13 more

... and 1 file with indirect coverage changes

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 2 times, most recently from eb8d7cf to 5ca340d Compare October 11, 2023 14:07
@Xunzhuo
Copy link
Member Author

Xunzhuo commented Oct 11, 2023

I have set up metrics pkg basement which can be easily used in different component, and we support push/pull metrics.
But I want to know what/where metrics we should build in each components:

  • Provider ( already have controller-runtime metrics )
  • GWAPI Translator
  • Infra Manager
  • xDS Translator
  • Extension Manager

I would like to hear from @envoyproxy/gateway-maintainers .

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch from 5ca340d to 542d663 Compare October 11, 2023 14:13
@arkodg
Copy link
Contributor

arkodg commented Oct 11, 2023

thanks for kickstarting this @Xunzhuo, hoping this lands in v0.6.0

sharing my thoughts on what we should be measuring

  • Input Resources from K8s
    • kubernetes_provider_reconcile_duration_seconds - gauge that measures how much time was spent in the reconciler , note this may already be provided by controller-runtime (haven't looked into this yet)
  • Translate Resources
    • <runner_duration_seconds - gauge that measures how much time was spent in every runner during a subscribe event e.g. in here, with a unique label for every runner
  • Output Resources to Envoy
    • xds_push_duration_seconds - gauge that measures time taken to push xds from xds server to envoy
  • Write Status to API Server
    • <resource_status_write_kubernetes_duration_seconds> - gauge that measures how much time it took to write status back to k8s server (this func) with a unique label for every resource type

request to add a design doc that highlights a few things, which are already answered by your PR

  • why this is needed
  • what are we exactly measuring
  • how we are observing the metrics

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 13 times, most recently from 29ccc76 to 83d8d2e Compare October 12, 2023 07:07
@Xunzhuo Xunzhuo added the area/observability Observability related issues label Oct 12, 2023
@Xunzhuo Xunzhuo added this to the 0.6.0-rc1 milestone Oct 12, 2023
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 2 times, most recently from c398366 to aff1fa3 Compare October 12, 2023 09:21
@@ -0,0 +1,634 @@
---
date: 2023-10-10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ptal @envoyproxy/gateway-reviewers @envoyproxy/gateway-maintainers

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 2 times, most recently from a13322e to 02ad089 Compare October 13, 2023 02:43
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 5 times, most recently from 19ebd3a to 2b1963b Compare October 13, 2023 10:02
@Xunzhuo Xunzhuo changed the title design: control plane metrics monitoring design/impl: control plane metrics monitoring Oct 13, 2023
@Xunzhuo Xunzhuo marked this pull request as ready for review October 13, 2023 10:29
@Xunzhuo Xunzhuo requested a review from a team as a code owner October 13, 2023 10:29
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 2 times, most recently from af6b271 to 4f90b09 Compare October 16, 2023 02:36
@Xunzhuo
Copy link
Member Author

Xunzhuo commented Oct 16, 2023

required: #1967

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 5 times, most recently from 480123f to 643eeff Compare October 16, 2023 12:23
@@ -69,6 +69,12 @@ type EnvoyGatewaySpec struct {
// +optional
Debug *EnvoyGatewayDebug `json:"debug,omitempty"`

// Telemetry defines the desired control plane telemetry related abilities.
// If unspecified, the telemetry is used with default configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the default configuration ? is it opt in ?

}
return e.Telemetry
}
e.Telemetry = DefaultEnvoyGatewayTelemetry()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we enabling prometheus metrics by default ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most people in Kubernetes are/will use prometheus, it should always enable by default.

//
// +optional
// +kubebuilder:default=true
Enable bool `json:"enable,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need an explicit enable flag, isn't the instantiation of EnvoyGatewayPrometheusProvider an implicit enable ?

// +kubebuilder:default=OpenTelemetry
Type MetricSinkType `json:"type"`
// Host define the sink service hostname.
Host string `json:"host"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we start using urls here instead of using 3 fields ?
cc @zirain

)

// metrics label definitions
// component is which component the update belong to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot map a single component to watchable, because it involves a subscriber component and a publisher component.
Also we internally call components runners


import "github.com/envoyproxy/gateway/internal/metrics"

var (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these metrics need to be split up in terms of subscribe and publish

@@ -0,0 +1,73 @@
// Copyright Envoy Gateway Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we take this library out as a separate PR ?

Copy link
Member Author

@Xunzhuo Xunzhuo Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I would like to separate it with 3 PR:

  • design + api
  • library
  • metrics instrumentation

@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch 3 times, most recently from 82ae3fb to aec4cb8 Compare October 17, 2023 07:54
Signed-off-by: bitliu <bitliu@tencent.com>
@Xunzhuo Xunzhuo force-pushed the design-cp-obs-metrics branch from aec4cb8 to bf0b87a Compare October 17, 2023 08:00
@Xunzhuo
Copy link
Member Author

Xunzhuo commented Oct 17, 2023

Close in favor of PR separations

@Xunzhuo Xunzhuo closed this Oct 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Observability related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Observability: Envoy Gateway
3 participants