Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy gateway AWS LB can't be clean up by EKS #2939

Closed
liyihuang opened this issue Mar 15, 2024 · 6 comments
Closed

Envoy gateway AWS LB can't be clean up by EKS #2939

liyihuang opened this issue Mar 15, 2024 · 6 comments
Labels

Comments

@liyihuang
Copy link

Description:
When I use the envoy gateway in AWS and delete the load balancer type service, EKS LB controller(no matter it's intree from EKS directly or https://github.com/kubernetes-sigs/aws-load-balancer-controller ). they just can't clean up the load balancer generate from envoy gateway.

I will use CLB as the example for reproduce but it's the same based on my test and I opened the issue here(kubernetes-sigs/aws-load-balancer-controller#3592)
Repro steps:

eksctl create cluster --region ca-central-1

kubectl create deployment nginx --image=nginx

aws elb describe-load-balancers --query 'LoadBalancerDescriptions[*].LoadBalancerName' --output text
-----
No ELB in this region
-----

kubectl expose deployment nginx --port=80 --type=LoadBalancer

(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ aws elb describe-load-balancers --query 'LoadBalancerDescriptions[*].LoadBalancerName' --output text

ac53833b7872d40eda2c4ba33c505096
-----
only one nginx LB
-----

helm install eg oci://docker.io/envoyproxy/gateway-helm --version v1.0.0 -n envoy-gateway-system --create-namespace


(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ k get svc -A
NAMESPACE              NAME                            TYPE           CLUSTER-IP       EXTERNAL-IP                                                                  PORT(S)               AGE
default                backend                         ClusterIP      10.100.245.126   <none>                                                                       3000/TCP              56s
default                kubernetes                      ClusterIP      10.100.0.1       <none>                                                                       443/TCP               20m
default                nginx                           LoadBalancer   10.100.64.134    ac53833b7872d40eda2c4ba33c505096-2064217360.ca-central-1.elb.amazonaws.com   80:32391/TCP          6m40s
envoy-gateway-system   envoy-default-eg-e41e7b31       LoadBalancer   10.100.152.77    ae8b77ce648ea46d2b968ae23cf69ed6-2060289306.ca-central-1.elb.amazonaws.com   80:31945/TCP          10s
envoy-gateway-system   envoy-gateway                   ClusterIP      10.100.1.7       <none>                                                                       18000/TCP,18001/TCP   33s
envoy-gateway-system   envoy-gateway-metrics-service   ClusterIP      10.100.251.116   <none>                                                                       19001/TCP             33s
kube-system            kube-dns                        ClusterIP      10.100.0.10      <none>                                                                       53/UDP,53/TCP         20m
-------

you can see there are 2 LBs in service and I forgot to run aws command this step
-------

(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ kubectl delete -f https://github.com/envoyproxy/gateway/releases/download/v1.0.0/quickstart.yaml -n default

Warning: deleting cluster-scoped resources, not scoped to the provided namespace
gatewayclass.gateway.networking.k8s.io "eg" deleted
gateway.gateway.networking.k8s.io "eg" deleted
serviceaccount "backend" deleted
service "backend" deleted
deployment.apps "backend" deleted
httproute.gateway.networking.k8s.io "backend" deleted
(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ aws elb describe-load-balancers --query 'LoadBalancerDescriptions[*].LoadBalancerName' --output text
ac53833b7872d40eda2c4ba33c505096        ae8b77ce648ea46d2b968ae23cf69ed6
(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ aws elb describe-load-balancers --query 'LoadBalancerDescriptions[*].LoadBalancerName' --output text
ac53833b7872d40eda2c4ba33c505096        ae8b77ce648ea46d2b968ae23cf69ed6

--------
still 2 LB when I check aws directly
-------


(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ k get svc
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP                                                                  PORT(S)        AGE
kubernetes   ClusterIP      10.100.0.1      <none>                                                                       443/TCP        22m
nginx        LoadBalancer   10.100.64.134   ac53833b7872d40eda2c4ba33c505096-2064217360.ca-central-1.elb.amazonaws.com   80:32391/TCP   8m13s

(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ k delete svc nginx
service "nginx" deleted
(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ k get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   23m
(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ aws elb describe-load-balancers --query 'LoadBalancerDescriptions[*].LoadBalancerName' --output text
ae8b77ce648ea46d2b968ae23cf69ed6

-----
LB from envoy gateway is still there
-----

I did check the cloudtrail from AWS, it clearly shows that there is no API call from AWS perspective trying to delete the LB.

I know it sounds like an AWS issue but I suspect because of the naming from envoy gateway causing AWS LB controller not deleting the LB from AWS.

Environment:

Include the environment like gateway version, envoy version and so on.

(⎈|liyi.huang@isovalent.com@beautiful-party-1710511746.ca-central-1.eksctl.io:N/A)~ k version
Client Version: v1.29.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.16-eks-508b6b3
WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1

Logs:

Include the access logs and the Envoy logs.

@arkodg
Copy link
Contributor

arkodg commented Mar 15, 2024

👋 @liyihuang good to see you here :)

can you share the metadata snippet from the generated svc like envoy-default-eg-e41e7b31 . Lets make sure EG is respecting the finalizers added by the AWS controller

@liyihuang
Copy link
Author

@arkodg lol. I didn't expect you pick this up.

I deleted this environment after I created the issue but I do have another screenshot that I took from last night having the same issue where I checked the finalizers where I think it's ok.

Please let me know if you want me to deploy a new environment.

image

For this particular screenshot, it's the NLB from AWS, so I have the LB controller installed and manage the AWS LB(https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/guide/service/annotations/#legacy-cloud-provider) this annotation tells it's AWS external LB controller managing it than the in tree one.

The symptoms is the same but I'm able to see more logs from LB controller where I attached to AWS issue(kubernetes-sigs/aws-load-balancer-controller#3592)

you can see LB controller deleted the targets but not the LB itself from the following logs
(log is from 3 weeks ago so it's not the NLB from the photo that I attached here)

{"level":"info","ts":"2024-02-24T14:09:56Z","logger":"controllers.service","msg":"created targetGroup","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"envoy-gateway-system/envoy-default-eg-e41e7b31:80","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:targetgroup/k8s-envoygat-envoydef-662a714b08/af0a72cf569098e8"}
{"level":"info","ts":"2024-02-24T14:09:56Z","logger":"controllers.service","msg":"creating loadBalancer","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"LoadBalancer"}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"created loadBalancer","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"LoadBalancer","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:loadbalancer/net/k8s-envoygat-envoydef-d844852579/3f34d754ad6b8335"}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"creating listener","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"80"}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"created listener","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"80","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:listener/net/k8s-envoygat-envoydef-d844852579/3f34d754ad6b8335/46f8b73b1cf97a8e"}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"creating targetGroupBinding","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"envoy-gateway-system/envoy-default-eg-e41e7b31:80"}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"created targetGroupBinding","stackID":"envoy-gateway-system/envoy-default-eg-e41e7b31","resourceID":"envoy-gateway-system/envoy-default-eg-e41e7b31:80","targetGroupBinding":{"namespace":"envoy-gateway-system","name":"k8s-envoygat-envoydef-662a714b08"}}
{"level":"info","ts":"2024-02-24T14:09:57Z","logger":"controllers.service","msg":"successfully deployed model","service":{"namespace":"envoy-gateway-system","name":"envoy-default-eg-e41e7b31"}}
{"level":"info","ts":"2024-02-24T14:09:59Z","msg":"authorizing securityGroup ingress","securityGroupID":"sg-0eeb8e99fb1fa92e7","permission":[{"FromPort":10080,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":10080,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-036e2ab103adf694a","GroupName":null,"PeeringStatus":null,"UserId":null,"VpcId":null,"VpcPeeringConnectionId":null}]}]}
{"level":"info","ts":"2024-02-24T14:09:59Z","msg":"authorized securityGroup ingress","securityGroupID":"sg-0eeb8e99fb1fa92e7"}
{"level":"info","ts":"2024-02-24T14:09:59Z","msg":"registering targets","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:targetgroup/k8s-envoygat-envoydef-662a714b08/af0a72cf569098e8","targets":[{"AvailabilityZone":null,"Id":"10.5.2.168","Port":10080}]}
{"level":"info","ts":"2024-02-24T14:10:00Z","msg":"registered targets","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:targetgroup/k8s-envoygat-envoydef-662a714b08/af0a72cf569098e8"}
{"level":"info","ts":"2024-02-24T14:25:37Z","msg":"deRegistering targets","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:targetgroup/k8s-envoygat-envoydef-662a714b08/af0a72cf569098e8","targets":[{"AvailabilityZone":"ca-central-1d","Id":"10.5.2.168","Port":10080}]}
{"level":"info","ts":"2024-02-24T14:25:37Z","msg":"deRegistered targets","arn":"arn:aws:elasticloadbalancing:ca-central-1:679388779924:targetgroup/k8s-envoygat-envoydef-662a714b08/af0a72cf569098e8"}
{"level":"info","ts":"2024-02-24T14:25:37Z","msg":"revoking securityGroup ingress","securityGroupID":"sg-0eeb8e99fb1fa92e7","permission":[{"FromPort":10080,"IpProtocol":"tcp","IpRanges":null,"Ipv6Ranges":null,"PrefixListIds":null,"ToPort":10080,"UserIdGroupPairs":[{"Description":"elbv2.k8s.aws/targetGroupBinding=shared","GroupId":"sg-036e2ab103adf694a","GroupName":null,"PeeringStatus":null,"UserId":"679388779924","VpcId":null,"VpcPeeringConnectionId":null}]}]}
{"level":"info","ts":"2024-02-24T14:25:37Z","msg":"revoked securityGroup ingress","securityGroupID":"sg-0eeb8e99fb1fa92e7"}

@arkodg
Copy link
Contributor

arkodg commented Mar 29, 2024

hey @liyihuang I'm suspecting for your case the finalizers may have been overwritten by EG
#3034 fixed this recently
can you try again with the latest image

helm install eg oci://docker.io/envoyproxy/gateway-helm --version v0.0.0-latest -n envoy-gateway-system --create-namespace

@liyihuang
Copy link
Author

@arkodg thanks. I will look into next week. Isn't today a public holiday in the US and you are in the PST timezone?

@arkodg
Copy link
Contributor

arkodg commented Apr 1, 2024

@arkodg thanks. I will look into next week. Isn't today a public holiday in the US and you are in the PST timezone?

it is, but im traveling this week, and in another tz :)

@liyihuang
Copy link
Author

@arkodg I just got the time to look into this and can confirm the issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants