Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-up][scale-out] coredns pod not able to list resources from TP master #1359

Open
h-w-chen opened this issue Feb 14, 2022 · 5 comments
Open
Assignees
Milestone

Comments

@h-w-chen
Copy link
Collaborator

What happened:
coredns is in Running state, however the ready container is 0/1; coredns pod log has following error record:

...
Failed to list *v1.Endpoints: Get \"https://10.40.0.2:443/api/v1/endpoints?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: connect: no route to host\n","stream":"stderr","time":"2022-02-14T19:07:10.303482004Z"}
Failed to list *v1.Service: Get \"https://10.40.0.2:443/api/v1/services?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: connect: no route to host\n","stream":"stderr","time":"2022-02-14T19:07:20.515095501Z"}

What you expected to happen:
coredns able to list resources of TP master

How to reproduce it (as minimally and precisely as possible):
using code of poc-2022-01-30, run kube-up.sh to start 1 TP 1RP 1worker scale-out cluster,

Anything else we need to know?:
10.40.0.2 is the node (physical) IP address of its TP master

Environment:

  • Arktos version (use kubectl version): poc-2022-0130, commit 2b6855
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools: kube-up scale-out 1+1x1
  • Network plugin and version (if this is a network-related bug):
  • Others:
@sonyafenge
Copy link
Collaborator

Add previous investigation with flannel for your reference
For scale-up

  1. cos OS:

coredns pod can be started successfully

pod can ping each others cross nodes.

ip neigh on master

# ip neigh
10.40.0.1 dev eth0 lladdr 42:01:0a:28:00:01 REACHABLE

ip neigh on minion node

# ip neigh
10.64.1.5 dev cni0 lladdr ba:6c:7a:2e:ff:e7 REACHABLE
10.64.1.10 dev cni0 lladdr e6:b6:78:9c:0d:97 REACHABLE
10.64.1.19 dev cni0 lladdr 16:e4:f7:91:d0:77 REACHABLE
10.64.1.4 dev cni0 lladdr 4e:ca:cd:9e:a5:1d REACHABLE
10.64.1.9 dev cni0  FAILED
10.40.0.1 dev eth0 lladdr 42:01:0a:28:00:01 REACHABLE
10.64.1.7 dev cni0  FAILED
10.64.1.3 dev cni0 lladdr 92:48:35:fd:06:25 DELAY
10.64.1.6 dev cni0 lladdr a2:c1:25:4c:a8:db REACHABLE
10.64.1.11 dev cni0 lladdr 8a:d9:ac:5d:09:7c REACHABLE
10.64.1.2 dev cni0 lladdr 22:4f:8b:43:8d:8c REACHABLE
  1. ubuntu 20.04

cordons failed to start with the error in this issue

pod cannot ping cross nodes

ip neigh on master

ip neigh
10.40.0.1 dev ens4 lladdr 42:01:0a:28:00:01 REACHABLE

ip neigh on minion nodes

# ip neigh
10.40.0.1 dev ens4 lladdr 42:01:0a:28:00:01 REACHABLE

@sonyafenge
Copy link
Collaborator

checked scale-out with mizar on minion nodes, run "ip neigh" also get only 10.40.0.1 one and no any pods ip listed.
Suspect same issue with flannel:
flannel-io/flannel#1155

@Sindica
Copy link
Collaborator

Sindica commented Feb 15, 2022

This is the reason that coredns and kube-dns are crashing in scale out (both local and kube-up) and scale up (in kube-up only since local deploys coredns/kube-dns pod to master).

kube-dns:
2022-02-15T22:42:35.697490309Z stderr F E0215 22:42:35.697343       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:189: Failed to list *v1.Endpoints: Get https://10.0.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
2022-02-15T22:42:35.6975123Z stderr F E0215 22:42:35.697348       1 reflector.go:201] k8s.io/dns/pkg/dns/dns.go:192: Failed to list *v1.Service: Get https://10.0.0.1:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.1:443: i/o timeout
...
2022-02-15T22:43:03.197113883Z stderr F I0215 22:43:03.196855       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:03.697056566Z stderr F I0215 22:43:03.696870       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:04.19701365Z stderr F I0215 22:43:04.196833       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:04.697068732Z stderr F I0215 22:43:04.696904       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:05.197105713Z stderr F I0215 22:43:05.196876       1 dns.go:219] Waiting for [endpoints services] to be initialized from apiserver...
2022-02-15T22:43:05.697058396Z stderr F F0215 22:43:05.696845       1 dns.go:209] Timeout waiting for initialization
coredns:
2022-02-15T22:47:59.821381993Z stderr F E0215 22:47:59.821145       1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Service: Get "https://172.30.0.14:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 172.30.0.14:6443: i/o timeout

@Sindica Sindica added this to the 0.10 milestone Feb 17, 2022
@Sindica
Copy link
Collaborator

Sindica commented Feb 17, 2022

After added mizar daemon to TP master and tp master to mizar droplets, kubeup scale out 1x1 still have service crashing issue.

sindica2000@ying-scaleout-tp-1-master:/var/log$ cat kube-controller-manager.log | grep mizar-node-controller.go | grep mizar | grep Create | grep successfully
I0217 06:33:40.553971       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-tp-1-master, eventType Create
I0217 06:34:20.331171       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-minion-group-2gt2, eventType Create
I0217 06:34:24.335101       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-master, eventType Create
I0217 06:34:24.943984       1 mizar-node-controller.go:239] Mizar handled request successfully for mizar_node. key ying-scaleout-rp-1-minion-group-w5f3, eventType Create

sindica2000@ying-dev1:~/go/src/sindica-arktos$ kubectl --kubeconfig cluster/kubeconfig.tp-1 get eps | grep host
ying-scaleout-rp-1-master-default--hostep-13576365                            host     4a:f7:5b:5c:20:54   10.40.0.3      89.225.0.1   32       Provisioned   aaa-default-network-subnet      aaa-default-network      13576365   ying-scaleout-rp-1-master              ehost-13576365   vehost-13576365           10.40.0.3   42:01:0a:28:00:03   2022-02-17T06:44:35.409457   0.522525                    
ying-scaleout-rp-1-minion-group-2gt2-default--hostep-13576365                 host     1a:3d:d2:71:73:a8   10.40.0.4      89.225.0.1   32       Init          aaa-default-network-subnet      aaa-default-network      1          ying-scaleout-rp-1-minion-group-2gt2   ehost-13576365   vehost-13576365           10.40.0.4   42:01:0a:28:00:04   2022-02-17T06:44:35.091425                               
ying-scaleout-rp-1-minion-group-w5f3-default--hostep-13576365                 host     12:1c:8d:2a:01:74   10.40.0.5      89.225.0.1   32       Init          aaa-default-network-subnet      aaa-default-network      1          ying-scaleout-rp-1-minion-group-w5f3   ehost-13576365   vehost-13576365           10.40.0.5   42:01:0a:28:00:05   2022-02-17T06:44:35.831306                               
ying-scaleout-tp-1-master-default--hostep-1                                   host     b2:30:08:53:36:7d   10.40.0.2      20.0.0.1     32       Provisioned   net0                            vpc0                     1          ying-scaleout-tp-1-master              ehost-1          vehost-1                  10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:34:08.829725   1.330807                    
ying-scaleout-tp-1-master-default--hostep-13576365                            host     9a:f9:43:fb:8f:ec   10.40.0.2      89.225.0.1   32       Provisioned   aaa-default-network-subnet      aaa-default-network      13576365   ying-scaleout-tp-1-master              ehost-13576365   vehost-13576365           10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:44:34.931540   0.842169                    
ying-scaleout-tp-1-master-default--hostep-13961987                            host     0e:4f:f1:bc:d9:a9   10.40.0.2      1.36.0.1     32       Provisioned   system-default-network-subnet   system-default-network   13961987   ying-scaleout-tp-1-master              ehost-13961987   vehost-13961987           10.40.0.2   42:01:0a:28:00:02   2022-02-17T06:34:08.555134   0.998453     

sindica2000@ying-dev1:~/go/src/sindica-arktos$ kubectl --kubeconfig cluster/kubeconfig.tp-1 get pods -o wide -AT | grep ying-scaleout-rp-1-minion-group-w5f3 | grep -v netpod
system   default       mizar-daemon-jbc5b                                    4045097148037907557   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   coredns-default-ying-scaleout-tp-1-7545f94d7c-qzz4z   4574940730222740260   0/1     CrashLoopBackOff   135        9h    1.36.0.14    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   fluentd-gcp-v3.2.0-48bdq                              1457492913046417354   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   heapster-v1.6.0-beta.1-7c546f8546-twxbl               8803644565539269132   2/2     Running            98         9h    1.36.0.19    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   kube-proxy-ying-scaleout-rp-1-minion-group-w5f3       5903977805958703516   1/1     Running            0          9h    10.40.0.5    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>
system   kube-system   metrics-server-v0.3.3-5f994fcb77-6dzf6                523873102943355845    1/2     CrashLoopBackOff   116        9h    1.36.0.34    ying-scaleout-rp-1-minion-group-w5f3   <none>           <none>

Log from kubernetes-dashboard-848965699-jx5vx:

{"log":"2022/02/17 16:23:14 Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service account's configuration) or the --apiserver-host param points to a server that does not exist. Reason: Get https://10.0.0.1:443/version: dial tcp 10.0.0.1:443: connect: no route to host\n","stream":"stdout","time":"2022-02-17T16:23:14.830337903Z"}

Log from kube-dns-autoscaler:

{"log":"E0217 16:26:46.478019       1 reflector.go:190] github.com/kubernetes-incubator/cluster-proportional-autoscaler/pkg/autoscaler/k8sclient/k8sclient.go:94: Failed to list *v1.Node: Get https://10.0.0.1:443/api/v1/nodes: dial tcp 10.0.0.1:443: getsockopt: no route to host\n","stream":"stderr","time":"2022-02-17T16:26:46.478272251Z"}

Log from metrics server:

{"log":"panic: Get https://10.0.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 10.0.0.1:443: connect: no route to host\n","stream":"stderr","time":"2022-02-17T16:27:49.286259419Z"}

Note host eps for minion is in Init state

@Sindica
Copy link
Collaborator

Sindica commented Mar 3, 2022

I still see issues in new tenant coredns pod. It is not 100% reproducible but should be able to reproduce with creating multiple tenants in same cluster.

{"log":"E0303 17:47:03.438944       1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get \"https://10.40.0.2:443/api/v1/endpoints?limit=500\u0026resourceVersion=0\": dial tcp 10.40.0.2:443: i/o timeout\n","stream":"stderr","time":"2022-03-03T17:47:03.439111869Z"}
{"log":"[INFO] SIGTERM: Shutting down servers then terminating\n","stream":"stdout","time":"2022-03-03T17:47:09.34679178Z"}

This may not be an issue since the new VPC started with 127.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants