Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: envoy proxy pods restarted always when configuration change #2965

Closed
zetaab opened this issue Mar 18, 2024 · 7 comments
Closed

bug: envoy proxy pods restarted always when configuration change #2965

zetaab opened this issue Mar 18, 2024 · 7 comments
Labels

Comments

@zetaab
Copy link
Contributor

zetaab commented Mar 18, 2024

Description:

 % kubectl get pods -n envoy-gateway-system
NAME                                          READY   STATUS        RESTARTS   AGE
envoy-eg-external-a514c411-5ffc5d664b-74g5j   2/2     Running       0          2d19h
envoy-eg-external-a514c411-5ffc5d664b-dhmmg   2/2     Running       0          2d19h
envoy-eg-internal-7f4ff7e4-7fb9c8d7df-8kjgk   2/2     Terminating   0          27s
envoy-eg-internal-7f4ff7e4-7fb9c8d7df-krcqc   2/2     Terminating   0          25s
envoy-eg-internal-7f4ff7e4-8685896d59-4z8n8   1/2     Terminating   0          4m31s
envoy-eg-internal-7f4ff7e4-8685896d59-gqtd7   2/2     Running       0          6s
envoy-eg-internal-7f4ff7e4-8685896d59-lrd4f   2/2     Running       0          4s
envoy-gateway-5987f4589-9h6ts                 1/1     Running       0          3d

When I modify like httproutes it will lead to envoy pod restarts. This is situation that is not really good when using external loadbalancers in front of envoy. I do understand that envoy will drain connections. However, when envoy uses externalTrafficPolicy: Local by default, it means that external loadbalancer will mark only nodes as healthy which contains the envoy pods. Now when these pods are moving between machines, it will always take 10-30 seconds (depends how loadbalancer healthchecks are installed) that services will start replying again.

Repro steps:
use type loadbalancer service in front of envoy, if needed modify the external loadbalancer healthcheck intervals to 60 seconds (to see how it really behaves). Then modify httproute configurations and see when pods start restarting and moving between kubernetes nodes -> it will make the services unavailable for some seconds.

Instead of restarting pods, envoy configurations should be reloaded. Avoid modifying kubernetes deployment configuration itself all the time, it will make downtime when using external loadbalancers and externaltrafficpolicy local - the health checks are not that fast.

Environment:
eg 1.0.0

@zetaab zetaab added the triage label Mar 18, 2024
@arkodg
Copy link
Contributor

arkodg commented Mar 18, 2024

@zetaab can you share config to repro this ? is this specific when setting mergeGateways to true ?

@cnvergence
Copy link
Member

possibly related, #2637

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Apr 17, 2024
@arkodg
Copy link
Contributor

arkodg commented Apr 18, 2024

does this issue still exist @zetaab ?

@github-actions github-actions bot removed the stale label Apr 19, 2024
@zetaab
Copy link
Contributor Author

zetaab commented Apr 19, 2024

I have not tried EG after this, need to revisit when I have time

@zetaab
Copy link
Contributor Author

zetaab commented Apr 25, 2024

@arkodg still issue with 1.0.1 at least. I am using merge gateway. When I create additional gateways it will always restart envoy pods. When I am running diff for the replicasets:

>         - containerPort: 10443
>           name: echose-ebb62894
>           protocol: TCP
220,222d222
<           protocol: TCP
<         - containerPort: 10443
<           name: envoy-938fd695

so the issue is port naming in kubernetes deployment configuration. Not sure is this #3130 part of 1.0.1? It could fix that but not sure

@zetaab
Copy link
Contributor Author

zetaab commented Apr 26, 2024

I can confirm that this works much better in latest main. So I think this is solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants