Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685

Closed
coro opened this issue Nov 8, 2024 · 37 comments · Fixed by #4754
Closed

503s and _No_route_to_host errors due to routing to non-existent Endpoints #4685

coro opened this issue Nov 8, 2024 · 37 comments · Fixed by #4754
Labels

Comments

@coro
Copy link
Contributor

coro commented Nov 8, 2024

Description:
We have been seeing many 503 errors when connecting to a Service with a lot of Pod churn.

We also saw in the logs that in these cases, the upstream_host that Envoy was attempting to connect to was for Pods that no longer existed in the cluster. These Pods could have been terminated over 50 mins earlier.

Repro steps:
Simple setup of Gateway (AWS NLB) -> HTTPRoute -> Service pointing to a Deployment with a lot of Pod churn.

Environment:
Gateway: v1.2.1 (also seen on v1.1.3) (not seen on v1.1.1)
Envoy: v1.32.1 (also seen on v1.31.1) (not seen on v1.31.0)
EKS cluster v1.29

Logs:

{
  "start_time": "2024-11-08T13:54:35.382Z",
  "method": "POST",
  "x-envoy-origin-path": "/v1/models/custom-model:predict",
  "protocol": "HTTP/1.1",
  "response_code": "503",
  "response_flags": "UF",
  "response_code_details": "upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_No_route_to_host}",
  "connection_termination_details": "-",
  "upstream_transport_failure_reason": "delayed_connect_error:_No_route_to_host",
  "bytes_received": "494",
  "bytes_sent": "165",
  "duration": "3060",
  "x-envoy-upstream-service-time": "-",
  "x-forwarded-for": "10.0.99.64",
  "user-agent": "python-requests/2.32.3",
  "x-request-id": "3dbc6bcc-79b1-4ec1-855c-1569b69f5416",
  ":authority": "kserve-detection.prod.signal",
  "upstream_host": "10.0.101.243:8080",
  "upstream_cluster": "httproute/default/detection/rule/0",
  "upstream_local_address": "-",
  "downstream_local_address": "10.0.101.92:10080",
  "downstream_remote_address": "10.0.99.64:41562",
  "requested_server_name": "-",
  "route_name": "httproute/default/detection/rule/0/match/0/kserve-detection_prod_signal"
}

The generated cluster config for this route was:

   "dynamic_active_clusters": [
    {
     "version_info": "a998a3b0100c28a29ff3ee662cb01dc45463f0260c9a6a50107dcc12a45938fc",
     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "httproute/default/detection/rule/0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "httproute/default/detection/rule/0"
      },
      "connect_timeout": "10s",
      "per_connection_buffer_limit_bytes": 32768,
      "lb_policy": "LEAST_REQUEST",
      "circuit_breakers": {
       "thresholds": [
        {
         "max_retries": 1024
        }
       ]
      },
      "dns_lookup_family": "V4_ONLY",
      "outlier_detection": {},
      "common_lb_config": {
       "locality_weighted_lb_config": {}
      },
      "ignore_health_on_host_removal": true
     },
     "last_updated": "2024-11-08T14:29:21.557Z"
    },
...
@coro coro added the triage label Nov 8, 2024
@arkodg
Copy link
Contributor

arkodg commented Nov 8, 2024

hey @coro, we haven't seen such issues yet, can you also run kubectl get endpointslice -l <service label> -o yaml as well as egctl config envoy-proxy endpoint multiple times over a few seconds, so we can better understand the consistency issue

to avoid this, you can set up retries and passive health checks, here's an example

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: retry-for-route
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: backend
  retry:
    numRetries: 5
    perRetry:
      backOff:
        baseInterval: 100ms
        maxInterval: 10s
      timeout: 250ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes
  healthCheck:
    passive:
      baseEjectionTime: 10s
      interval: 2s
      maxEjectionPercent: 100
      consecutive5XxErrors: 1 
      consecutiveGatewayErrors: 0
      consecutiveLocalOriginFailures: 1
      splitExternalLocalOriginErrors: false

@coro
Copy link
Contributor Author

coro commented Nov 11, 2024

Thanks for the suggestion @arkodg! On the endpointslice front, anecdotally I can tell you that we did not see the IPs of the non-existent Pods in the EndpointSlices, but we will check again with that and the egctl output and get back to you.

It's worth mentioning that this was on a KServe InferenceService being scaled by a KEDA autoscaler, so perhaps there is some kind of race condition between KEDA's control of the EndpointSlice and EG's reading of it. Again, we will take a look and let you know.

@arkodg
Copy link
Contributor

arkodg commented Nov 11, 2024

cc @dprotaso

@coro
Copy link
Contributor Author

coro commented Nov 12, 2024

@arkodg Just a quick update, we can confirm that this does not happen with EG v1.1.1 / Envoy v1.31.0. Rolling back seems to have resolved this issue. We haven't had a moment to safely reproduce this and get your additional debug info just yet, so we will let you know again when we have that.

@coro
Copy link
Contributor Author

coro commented Nov 12, 2024

cc @evilr00t @sam-burrell

@sam-burrell
Copy link
Contributor

We can confirm that pods can be terminated and egctl config envoy-proxy endpoint for the route confirms the pod is not in the list of endpoints lbEndpoints, there is no mention of the ip of the pod (healthy or otherwise)

Yet we still get logs after that point in time from envoy saying upstream_reset_before_response_started{remote_connection_failure|delayed_connect_error:_Connection_refused} for the upstream ip of a pod not in the list of lbEndpoints from envoy.

There seems to be some routing still within envoy that is trying to route to a dead/non existant pod. Perhaps this is cached or perhaps there is a long lived connection still open or trying to be used?

This is continuing from the same environment as @coro (We work on the same team)

@evilr00t
Copy link

evilr00t commented Nov 13, 2024

Hi!

Looks like we were able to identify the issue.

We've started comparing proxy endpoints egctl config envoy-proxy endpoint -n kube-system envoy_pod
and since v1.1.3 (we've checked 1.1.1 and 1.1.2) there seems to be a regression.

After upgrade controller does not update it's proxy endpoints anymore - only during the startup which as you can imagine in very dynamic k8s world causes a lot of issues.

Kubernetes endpointslice is being updated but the endpoints on the proxy are static since the startup of gateway - rollout restart updates endpoints but this is not what we want.

We were comparing different tags and I think this is what caused the regression altho I'm not 100% sure - just giving you a hint. (As those changes were cherry-picked from 1.2.0 into 1.1.3 where we also noticed same issue)

#4336
https://github.com/envoyproxy/gateway/pull/4337/files

@arkodg
Copy link
Contributor

arkodg commented Nov 14, 2024

unsure how status updater can affect endpointslice reconciliation
cc @zhaohuabing

@ovaldi
Copy link

ovaldi commented Nov 14, 2024

We’re also facing the same issue.

And downgrade EG from v1.2.1 to v1.1.2 resolved this issue.

@ligol
Copy link

ligol commented Nov 14, 2024

We're getting the same error on our side too, rollbacking to v1.1.2 seemed fixing the issue on our side too

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 14, 2024

@ligol @ovaldi @evilr00t @sam-burrell @coro

Are there any easy way to reproduce this? I tried to modify the replicas of the deploy to increase/descrease pods, but can't reproduce this issue.

Also are there any errors/warnings in the EG/Envoy logs while this happened?

Can you also try
v1.1.3+ envoyproxy/envoy:distroless-v1.31.2
or
v1.1.2 envoyproxy/envoy:distroless-v1.31.3?

Looks like the only significant change within v1.1.3 is the upgrade of envoy to v1.31.3.

@ligol
Copy link

ligol commented Nov 14, 2024

I will try to test the different combinaison tomorrow, when I'll have some traffic on the preprod environement, otherwise I don't really have any other way to reproduce

@sam-burrell
Copy link
Contributor

sam-burrell commented Nov 15, 2024

These are all the combinations we have tested

✔️ Working
gateway 1.1.1
Envoy image envoyproxy/envoy:distroless-v1.31.1

✔️ Working
gateway 1.1.2
Envoy image envoyproxy/envoy:distroless-v1.31.2

✔️ Working
gateway 1.1.2
Envoy image envoyproxy/envoy:distroless-v1.31.3

🛑 Not Working
gateway 1.1.3
Envoy image envoyproxy/envoy:distroless-v1.31.1

🛑 Not Working
gateway 1.1.3
Envoy image envoyproxy/envoy:distroless-v1.31.2

🛑 Not Working
gateway 1.2.1
Envoy image envoyproxy/envoy:distroless-v1.32.1

All on EKS cluster v1.29.8

@sam-burrell
Copy link
Contributor

sam-burrell commented Nov 15, 2024

The following shell output is how we are defining Not Working
We get results like the following after randomly killing a few pods

EndpointSlice FOOBAR_ENDPOINT endpoints:                       ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.100.160', '10.0.100.32', '10.0.102.251', '10.0.103.11', '10.0.104.40', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.13', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131']
Selected endpoints for FOOBAR_ROUTE on FOOBAR_GATEWAY_POD:     ['10.0.102.251', '10.0.106.41', '10.0.106.44', '10.0.110.230', '10.0.111.196', '10.0.96.116', '10.0.96.13', '10.0.97.194', '10.0.98.131']

Some gateway pods have IPs that don't exist in the endpoint slice
IP 10.0.106.41 is not in the endpoint slice and is returning 503 when hit by envoy

The following is the script we are using to test this:

import subprocess
import json
from datetime import datetime
from pprint import pprint

HTTP_ROUTE = "HTTP_ROUTE"
ENDPOINT_SLICE = "ENDPOINT_SLICE"
GATEWAY_POD_PREFIX = "GATEWAY_POD_PREFIX"

# Define the command to get the EndpointSlice
command = ["kubectl", "get", "endpointslice", "-o", "json", ENDPOINT_SLICE]
result = subprocess.run(command, capture_output=True, text=True)
endpointslice_output = result.stdout
endpointslice_data = json.loads(endpointslice_output)

# Print the EndpointSlice information
ENDPOINT_SLICE_endpoints = [
    endpoint["addresses"][0]
    for endpoint in endpointslice_data["endpoints"]
    if endpoint["conditions"]["ready"] == True and endpoint["conditions"]["serving"] == True
]

ENDPOINT_SLICE_endpoints = sorted(ENDPOINT_SLICE_endpoints)
print(f"EndpointSlice {ENDPOINT_SLICE} endpoints: \t\t\t\t\t\t\t\t\t\t{ENDPOINT_SLICE_endpoints}")


command = ["kubectl", "get", "pods", "-n", "kube-system", "-o", "json"]
result = subprocess.run(command, capture_output=True, text=True)
pods_output = result.stdout

# Parse the JSON output to get the pod names
pods_data = json.loads(pods_output)
envoy_pods = [
    pod["metadata"]["name"]
    for pod in pods_data["items"]
    if pod["metadata"]["name"].startswith(GATEWAY_POD_PREFIX)
]

all_selected_endpoints = []

for envoy_pod in envoy_pods:
    # Define the command to run
    command = [
        "egctl", "config", "envoy-proxy", "endpoint", "-n", "kube-system",
        envoy_pod
    ]

    # Run the command and capture the output
    result = subprocess.run(command, capture_output=True, text=True)
    output = result.stdout

    # Parse the JSON output
    data = json.loads(output)

    # Extract the required information
    endpoints = data["kube-system"][envoy_pod]["dynamicEndpointConfigs"]
    selected_endpoints = [
        lb_endpoint["endpoint"]["address"]["socketAddress"]["address"]
        for ep in endpoints
        if ep["endpointConfig"]["clusterName"] == HTTP_ROUTE
        for lb_endpoint in ep["endpointConfig"]["endpoints"][0]["lbEndpoints"]
    ]
    selected_endpoints = sorted(selected_endpoints)
    print(f"Selected endpoints for {HTTP_ROUTE} on {envoy_pod}: \t{selected_endpoints}")

@sam-burrell
Copy link
Contributor

We have turned debug logs on for envoy-gateway-controller and see nothing obvious.
We do not see any logs in envoy-gateway-controller matching "error|warning|fatal|Error"
We do not see any logs in envoy-gateway-pods other than the 503s

@sam-burrell
Copy link
Contributor

Is there anything we can do to assist with this?

@arkodg arkodg added kind/bug Something isn't working cherrypick/release-v1.2.2 labels Nov 19, 2024
@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 20, 2024

I tried the below setup with both 1.2.1 and 1.1.2

Constantly create and delete pods with:

while true
do
kubectl patch deployment backend -p '{"spec":{"replicas":100}}'
sleep 10
kubectl patch deployment backend -p '{"spec":{"replicas":1}}'
sleep 10
done

And access the Gateway with

while true
do
status_code=`curl -o /dev/null -s -w "%{http_code}\n"  -HHost:www.example.com http://172.18.0.200`
if [ "$status_code" -eq 503 ]; then
  echo "503"
fi
done

And I got some 503 erros with both the 1.2.1 and 1.1.2. After stopping creating and deleting of pods, the 503 errors ceased for both versions. There doesn't seem to be any noticeable difference in behavior between versions 1.2.1 and 1.1.2.

Probably there's something missing in my configuration, so I can't reproduce this? @sam-burrell @ligol can you help me to reproduce this?

@zhaohuabing
Copy link
Member

The following shell output is how we are defining Not Working
We get results like the following after randomly killing a few pods

I think it's acceptable for the endpoints in envoy xDS temporarily differ from the k8s endpointslice. This delay occurs because Envoy Gateway (EG) needs time to propagate the changes to Envoy. However, if an endpoint still exists in Envoy for a while after being deleted from the EndpointSlice, it indicates a problem.

@zhaohuabing
Copy link
Member

@ligol @ovaldi @evilr00t @sam-burrell @coro Could you please try if #4754 can fix this?

@dromadaire54
Copy link

dromadaire54 commented Nov 21, 2024

@zhaohuabing we are going to try if the revert fix the issue.

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 22, 2024

@zhaohuabing we are going to try if the revert fix the issue.

@dromadaire54 Please also try #4767 if you have a stable reproducable env. Thanks!

@evilr00t
Copy link

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.

Screenshot included:

Screenshot 2024-11-22 at 10 34 59 AM

@zirain
Copy link
Contributor

zirain commented Nov 22, 2024

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.

Screenshot included:

Screenshot 2024-11-22 at 10 34 59 AM

it's not merged, so you need built it yourself.

@arkodg
Copy link
Contributor

arkodg commented Nov 22, 2024

@zhaohuabing can you provide a custom image from your docker hub repo (with the commit reverted) that others can try ?

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 23, 2024

I prefer #4767 because reverting in 4755 will cause regression of another issue.

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 23, 2024

Update: please use zhaohuabing/gateway-dev:e406088d7 for #4767. The new image contains the fix for e2e tests.

@zhaohuabing
Copy link
Member

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.

Screenshot included:

Screenshot 2024-11-22 at 10 34 59 AM

@evilr00t It's temporary or the deleted pod IPs will be in the endpoints for a long time?

@evilr00t
Copy link

Just checked the latest v0.0.0-latest and this bug still happens - only 1 of the proxies is being updates during the scale-up / scale-down.
Screenshot included:
Screenshot 2024-11-22 at 10 34 59 AM

@evilr00t It's temporary or the deleted pod IPs will be in the endpoints for a long time?

on provided screenshot, when I've scaled up the replica set - new IPs of Pods were added only to 1 envoy proxy. Deleted pods are removed from that 1 working proxy. 2 Remaining proxies are not updated at all and shows the IPs of Pods that were available during the startup of those proxies.

@tskinner-oppfi
Copy link

Thanks for looking into this @zhaohuabing This is a great project!

I am seeing similar issues and finally took some time to try and make some reproducible steps using some of the scripts in the gateway repo and here is what I've come up with:

  1. Spin up a local test cluster with eg and the demo app (I didn't look around the repo too well so if there is a better way, please tell)
    make create-cluster
    make kube-install-image
    make kube-deploy
    make kube-demo
    
  2. Scale up the envoy proxies and gateways to 3. The issue doesn't seem to happen with only a single replica.
  3. Scale the backend (demo app) up to 10 replicas or something
  4. Add a listener to the gateway. I have something similar listener in my cluster and it seemed to be a key in reproducing, if not this error, then some similar error.
    - protocol: TCP
      port: 1080
      name: tcp
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              kubernetes.io/metadata.name: default
    
  5. Scale the backend demo app to zero and up again and monitor requests to it

The behavior doesn't appear with the zhaohuabing/gateway-dev:d3e3c6bf7 image it seems.

@zhaohuabing
Copy link
Member

zhaohuabing commented Nov 27, 2024

@tskinner-oppfi Thanks for testing the fixes! V1.2.2 will be released this week with the fix #4754 .

@dromadaire54
Copy link

@zhaohuabing thanks for the fix no error during several days in our env.

@zhaohuabing
Copy link
Member

Hello @dromadaire54 @tskinner-oppfi
If it's possible, could you please also try this image zhaohuabing/gateway-dev:96002955f ? It's built from this PR: #4809

@tskinner-oppfi
Copy link

tskinner-oppfi commented Nov 29, 2024

I've tested the image zhaohuabing/gateway-dev:96002955f in one of our EKS clusters and I still see errors although they are different than the 503s that were seen previously:

{"start_time":"2024-11-29T18:20:00.028Z","method":"-","x-envoy-origin-path":"-","protocol":"-","response_code":"0","response_flags":"UH","response_code_details":"-","connection_termination_details":"-","upstream_transport_failure_reason":"-","bytes_received":"0","bytes_sent":"0","duration":"0","x-envoy-upstream-service-time":"-","x-forwarded-for":"-","user-agent":"-","x-request-id":"-",":authority":"-","upstream_host":"-","upstream_cluster":"EmptyCluster","upstream_local_address":"-","downstream_local_address":"10.143.32.129:1080","downstream_remote_address":"172.28.6.182:65105","requested_server_name":"-","route_name":"-"}

The end result being I see a mixture of 200 and 0 status codes.

@dromadaire54
Copy link

Hi,
We are facing 503 error code in the latest version 1.2.3 which fixed this issue. There's somebody else with the same situation ?

@ovaldi
Copy link

ovaldi commented Dec 2, 2024

Recently, after upgrading to v1.2.0 and v1.2.1, we encountered the 503s issue.

And, after upgrading to v1.2.2, we encountered the DPANIC issue.

So for v1.2.3, we decided to wait and see... 😂

Hi, We are facing 503 error code in the latest version 1.2.3 which fixed this issue. There's somebody else with the same situation ?

@arkodg
Copy link
Contributor

arkodg commented Dec 3, 2024

Hi,
We are facing 503 error code in the latest version 1.2.3 which fixed this issue. There's somebody else with the same situation ?

@dromadaire54 can you share the access logs o/p ? Im not sure how the fix that went into v1.2.3 can cause it (we only wait during init) , the issue @tskinner-oppfi is hitting is tied to a TCP listener without an attached TCPRoute which looks like another issue (maybe configuration issue) and needs to be investigated separately

@dromadaire54
Copy link

@arkodg I shared you the logs by dm in slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
10 participants