Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxies connected to the secondary gateway do not receive configuration #4845

Closed
cico78 opened this issue Dec 4, 2024 · 7 comments · Fixed by #4767
Closed

Proxies connected to the secondary gateway do not receive configuration #4845

cico78 opened this issue Dec 4, 2024 · 7 comments · Fixed by #4767
Labels
Milestone

Comments

@cico78
Copy link

cico78 commented Dec 4, 2024

Description:

When a proxy is connected to a secondary gateway it does not receive configuration.
This is a new behaviour after the upgrade to 1.2.3 and was not happening with version 1.1.0

Repro steps:
Have two gateways configured with leader election (default configuration, version 1.2.3). Scale up envoy deployment and check for proxies connected to the secondary gateway. Those proxies will not receive the configurations complaining that the initial timeout is failing (see logs).

Those proxies will be able to get the configuration once the primary gateway is deleted and the secondary becomes the new primary.

Environment:

Gateway version: 1.2.3
Envoy version: 1.32.1
Kubernetes: EKS 1.28

Logs:

Default log level is set to debug

logs from envoy PROXY-POD-NAME:

...
[2024-12-04 07:48:34.237][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2024-12-04 07:48:34.237][1][info][config] [source/server/configuration_impl.cc:130] loading 3 cluster(s)
[2024-12-04 07:48:34.241][1][info][config] [source/server/configuration_impl.cc:138] loading 1 listener(s)
[2024-12-04 07:48:34.242][1][warning][misc] [source/extensions/filters/network/http_connection_manager/config.cc:88] internal_address_config is not configured. The existing default behaviour will trust RFC1918 IP addresses, but this will be changed in next release. Please explictily config internal address config as the migration step or config the envoy.reloadable_features.explicit_internal_address_config to true to untrust all ips by default
[2024-12-04 07:48:34.243][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2024-12-04 07:48:34.285][1][info][main] [source/server/server.cc:990] starting main dispatch loop
[2024-12-04 07:48:34.296][1][info][runtime] [source/common/runtime/runtime_impl.cc:631] RTDS has finished initialization
[2024-12-04 07:48:34.296][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2024-12-04 07:48:49.295][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
[2024-12-04 07:48:49.295][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2024-12-04 07:48:49.295][1][info][main] [source/server/server.cc:970] all clusters initialized. initializing init manager
[2024-12-04 07:49:04.295][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
[2024-12-04 07:49:04.295][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:944] all dependencies initialized. starting workers

logs from secondary gateway:

...
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:607	processing OIDC HMAC Secret	{"runner": "provider", "namespace": "envoy-gateway-system", "name": "envoy-oidc-hmac"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:387	processing Backend	{"runner": "provider", "kind": "Service", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:401	added Service to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:459	added EndpointSlice to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY-IDf"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:387	processing Backend	{"runner": "provider", "kind": "Service", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:401	added Service to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:48:37.332Z	DEBUG	xds-server	cache/snapshotcache.go:297	First incremental discovery request on stream 1, got nodeID PROXY-POD-NAME-ID
2024-12-04T07:48:37.332Z	INFO	xds-server	v3/simple.go:571	open delta watch ID:1 for type.googleapis.com/envoy.config.cluster.v3.Cluster Resources:map[] from nodeID: "PROXY-POD-NAME-ID"
2024-12-04T07:48:49.297Z	INFO	xds-server	v3/simple.go:571	open delta watch ID:2 for type.googleapis.com/envoy.config.listener.v3.Listener Resources:map[] from nodeID: "PROXY-POD-NAME-ID"
@cico78 cico78 added the triage label Dec 4, 2024
@zetaab
Copy link
Contributor

zetaab commented Dec 4, 2024

we have same issue, rate limiting is not working in 1.2.3 with multiple controller replicas. We scaled from 2 -> 1 and it seems that everything is working again

@zetaab
Copy link
Contributor

zetaab commented Dec 4, 2024

cc @arkodg

@arkodg arkodg added kind/bug Something isn't working help wanted Extra attention is needed cherrypick/release-v1.1.4 cherrypick/release-v1.2.4 and removed triage labels Dec 4, 2024
@arkodg arkodg added this to the v1.3.0-rc.1 milestone Dec 4, 2024
@kraashen
Copy link

kraashen commented Dec 5, 2024

we have same issue, rate limiting is not working in 1.2.3 with multiple controller replicas. We scaled from 2 -> 1 and it seems that everything is working again

debug logs from rate limit pods when running v1.2.3 and running multiple GET queries to a configured API endpoint.

time="2024-12-04T09:09:37Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:37Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:38Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:38Z" level=debug msg="caught error during call: no rate limit configuration loaded"

@arkodg
Copy link
Contributor

arkodg commented Dec 7, 2024

this is a regression from #4809, most likely due to the fact that the secondary status updater will never be ready (wg.Add(1) will not be called for it) so the client will block until the pod becomes the leader. This poses a problem because there are 2 client status calls made from the provider go routine directly

Should these function calls be avoided, and the requests sent over via watchable to the status updater or should we avoid making these calls when the controller is not a leader ? @alexwo @zhaohuabing

@zhaohuabing
Copy link
Member

zhaohuabing commented Dec 7, 2024

Looks like we also got regression of the 503 issue #4685 (comment).

My previous PR tried to consolidate all the status updates to the watchable so they won't block the senders, but there's a race condition between the gateway api runner and the provide runner. I need more time to dig into this if we want to solve it by this approach.

@arkodg
Copy link
Contributor

arkodg commented Dec 7, 2024

@zhaohuabing afaik there is no regression for 503 issue with 1 EG replica

@kraashen
Copy link

Confirming also that this fixes the issue with ratelimit pods not receiving configuration when running multiple gateway replicas, which resulted in requests returning HTTP 200 even if ratelimits are hit. Ratelimiter pods seem to work ok now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants