Proxies connected to the secondary gateway do not receive configuration #4845

cico78 · 2024-12-04T08:26:32Z

Description:

When a proxy is connected to a secondary gateway it does not receive configuration.
This is a new behaviour after the upgrade to 1.2.3 and was not happening with version 1.1.0

Repro steps:
Have two gateways configured with leader election (default configuration, version 1.2.3). Scale up envoy deployment and check for proxies connected to the secondary gateway. Those proxies will not receive the configurations complaining that the initial timeout is failing (see logs).

Those proxies will be able to get the configuration once the primary gateway is deleted and the secondary becomes the new primary.

Environment:

Gateway version: 1.2.3
Envoy version: 1.32.1
Kubernetes: EKS 1.28

Logs:

Default log level is set to debug

logs from envoy PROXY-POD-NAME:

...
[2024-12-04 07:48:34.237][1][info][config] [source/server/configuration_impl.cc:124] loading 0 static secret(s)
[2024-12-04 07:48:34.237][1][info][config] [source/server/configuration_impl.cc:130] loading 3 cluster(s)
[2024-12-04 07:48:34.241][1][info][config] [source/server/configuration_impl.cc:138] loading 1 listener(s)
[2024-12-04 07:48:34.242][1][warning][misc] [source/extensions/filters/network/http_connection_manager/config.cc:88] internal_address_config is not configured. The existing default behaviour will trust RFC1918 IP addresses, but this will be changed in next release. Please explictily config internal address config as the migration step or config the envoy.reloadable_features.explicit_internal_address_config to true to untrust all ips by default
[2024-12-04 07:48:34.243][1][info][config] [source/server/configuration_impl.cc:154] loading stats configuration
[2024-12-04 07:48:34.285][1][info][main] [source/server/server.cc:990] starting main dispatch loop
[2024-12-04 07:48:34.296][1][info][runtime] [source/common/runtime/runtime_impl.cc:631] RTDS has finished initialization
[2024-12-04 07:48:34.296][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:245] cm init: initializing cds
[2024-12-04 07:48:49.295][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
[2024-12-04 07:48:49.295][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:249] cm init: all clusters initialized
[2024-12-04 07:48:49.295][1][info][main] [source/server/server.cc:970] all clusters initialized. initializing init manager
[2024-12-04 07:49:04.295][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
[2024-12-04 07:49:04.295][1][info][config] [source/common/listener_manager/listener_manager_impl.cc:944] all dependencies initialized. starting workers

logs from secondary gateway:

...
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:607	processing OIDC HMAC Secret	{"runner": "provider", "namespace": "envoy-gateway-system", "name": "envoy-oidc-hmac"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:387	processing Backend	{"runner": "provider", "kind": "Service", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:401	added Service to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:459	added EndpointSlice to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY-IDf"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:387	processing Backend	{"runner": "provider", "kind": "Service", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:45:15.240Z	INFO	provider	kubernetes/controller.go:401	added Service to resource tree	{"runner": "provider", "namespace": "XXX", "name": "YYY"}
2024-12-04T07:48:37.332Z	DEBUG	xds-server	cache/snapshotcache.go:297	First incremental discovery request on stream 1, got nodeID PROXY-POD-NAME-ID
2024-12-04T07:48:37.332Z	INFO	xds-server	v3/simple.go:571	open delta watch ID:1 for type.googleapis.com/envoy.config.cluster.v3.Cluster Resources:map[] from nodeID: "PROXY-POD-NAME-ID"
2024-12-04T07:48:49.297Z	INFO	xds-server	v3/simple.go:571	open delta watch ID:2 for type.googleapis.com/envoy.config.listener.v3.Listener Resources:map[] from nodeID: "PROXY-POD-NAME-ID"

The text was updated successfully, but these errors were encountered:

zetaab · 2024-12-04T13:14:22Z

we have same issue, rate limiting is not working in 1.2.3 with multiple controller replicas. We scaled from 2 -> 1 and it seems that everything is working again

zetaab · 2024-12-04T15:06:46Z

cc @arkodg

kraashen · 2024-12-05T08:09:06Z

we have same issue, rate limiting is not working in 1.2.3 with multiple controller replicas. We scaled from 2 -> 1 and it seems that everything is working again

debug logs from rate limit pods when running v1.2.3 and running multiple GET queries to a configured API endpoint.

time="2024-12-04T09:09:37Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:37Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:38Z" level=debug msg="caught error during call: no rate limit configuration loaded"
time="2024-12-04T09:09:38Z" level=debug msg="caught error during call: no rate limit configuration loaded"

arkodg · 2024-12-07T00:52:39Z

this is a regression from #4809, most likely due to the fact that the secondary status updater will never be ready (wg.Add(1) will not be called for it) so the client will block until the pod becomes the leader. This poses a problem because there are 2 client status calls made from the provider go routine directly

Gateway

gateway/internal/provider/kubernetes/predicates.go

Line 297 in b9f9a9f

r.updateStatusForGateway(ctx, gtw)
GatewayClass

gateway/internal/provider/kubernetes/controller.go

Line 202 in b9f9a9f

if err := r.updateStatusForGatewayClass(ctx, managedGC, false, string(gwapiv1.GatewayClassReasonInvalidParameters), msg); err != nil {

Should these function calls be avoided, and the requests sent over via watchable to the status updater or should we avoid making these calls when the controller is not a leader ? @alexwo @zhaohuabing

zhaohuabing · 2024-12-07T02:13:52Z

Looks like we also got regression of the 503 issue #4685 (comment).

My previous PR tried to consolidate all the status updates to the watchable so they won't block the senders, but there's a race condition between the gateway api runner and the provide runner. I need more time to dig into this if we want to solve it by this approach.

arkodg · 2024-12-07T02:15:21Z

@zhaohuabing afaik there is no regression for 503 issue with 1 EG replica

kraashen · 2024-12-13T08:16:00Z

Confirming also that this fixes the issue with ratelimit pods not receiving configuration when running multiple gateway replicas, which resulted in requests returning HTTP 200 even if ratelimits are hit. Ratelimiter pods seem to work ok now.

cico78 added the triage label Dec 4, 2024

arkodg added kind/bug Something isn't working help wanted Extra attention is needed cherrypick/release-v1.1.4 cherrypick/release-v1.2.4 and removed triage labels Dec 4, 2024

arkodg added this to the v1.3.0-rc.1 milestone Dec 4, 2024

alexwo mentioned this issue Dec 7, 2024

Add a resilience test suite to cover the expected behavior of the data plane for both existing and new proxy instances during specific edge conditions #4861

Open

zhaohuabing mentioned this issue Dec 9, 2024

fix: trigger a resync to avoid missing any status updates #4878

Closed

zhaohuabing removed the help wanted Extra attention is needed label Dec 9, 2024

This was referenced Dec 10, 2024

fix: decouple gateway status updates from the reconciler #4767

Merged

OIDC filter randomly failing in v1.2 due to missing information in token request #4718

Closed

arkodg removed the cherrypick/release-v1.1.4 label Dec 11, 2024

arkodg closed this as completed in #4767 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxies connected to the secondary gateway do not receive configuration #4845

Proxies connected to the secondary gateway do not receive configuration #4845

cico78 commented Dec 4, 2024

zetaab commented Dec 4, 2024

zetaab commented Dec 4, 2024

kraashen commented Dec 5, 2024

arkodg commented Dec 7, 2024 •

edited

Loading

zhaohuabing commented Dec 7, 2024 •

edited

Loading

arkodg commented Dec 7, 2024

kraashen commented Dec 13, 2024

Proxies connected to the secondary gateway do not receive configuration #4845

Proxies connected to the secondary gateway do not receive configuration #4845

Comments

cico78 commented Dec 4, 2024

zetaab commented Dec 4, 2024

zetaab commented Dec 4, 2024

kraashen commented Dec 5, 2024

arkodg commented Dec 7, 2024 • edited Loading

zhaohuabing commented Dec 7, 2024 • edited Loading

arkodg commented Dec 7, 2024

kraashen commented Dec 13, 2024

arkodg commented Dec 7, 2024 •

edited

Loading

zhaohuabing commented Dec 7, 2024 •

edited

Loading